CN103489445B - A kind of method and device identifying voice in audio frequency - Google Patents

A kind of method and device identifying voice in audio frequency Download PDF

Info

Publication number
CN103489445B
CN103489445B CN201310429920.0A CN201310429920A CN103489445B CN 103489445 B CN103489445 B CN 103489445B CN 201310429920 A CN201310429920 A CN 201310429920A CN 103489445 B CN103489445 B CN 103489445B
Authority
CN
China
Prior art keywords
frame
audio frequency
voice
voice data
lpc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310429920.0A
Other languages
Chinese (zh)
Other versions
CN103489445A (en
Inventor
田彪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Taile Culture Technology Co.,Ltd.
Original Assignee
Beijing Yinzhibang Culture Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yinzhibang Culture Technology Co Ltd filed Critical Beijing Yinzhibang Culture Technology Co Ltd
Priority to CN201310429920.0A priority Critical patent/CN103489445B/en
Publication of CN103489445A publication Critical patent/CN103489445A/en
Application granted granted Critical
Publication of CN103489445B publication Critical patent/CN103489445B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a kind of method and the device that identify voice in audio frequency, described method comprises: carry out sub-frame processing to voice data; Use the linear predictive coding (LPC) that exponent number is P analyze each the frame voice data after sub-frame processing and extract audio frequency characteristics, described audio frequency characteristics comprises the degree of bias and the kurtosis of short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum; P+3 rank proper vector is formed according to described audio frequency characteristics; Use support vector machine (SVM) algorithm to carry out training to described proper vector and obtain corresponding support vector machine; Whether containing voice in each frame voice data according to described support vector machine identification.The present invention can realize the identification of the high precision high confidence level of voice in audio frequency, for song content analysis provides basic service, thus further realizes the functions such as synchronous lyrics, categorizing songs, song recommendations.

Description

A kind of method and device identifying voice in audio frequency
Technical field
The present invention relates to multimedia messages field, be specifically related to audio signal analysis field, particularly relate to a kind of method and the device that identify voice in audio frequency.
Background technology
Along with multimedia technology constantly develops, the effect of audio/video information in the work of people, style and entertainment life is more and more heavier.Such as, on internet, each large music site is classified to song or recommends song, makes the song that each user can search song as soon as possible or recommend to user.
Mostly current its work such as categorizing songs and recommendation of each large music site is to work in coordination with filtering based on text analyzing and user behavior, there is not yet the application being deep into audio content analysis technology.Audio content analysis technology is classified to audio frequency according to the audio frequency characteristics extracted, and makes user can retrieve required audio frequency more accurately, can also realize the retrieval to realaudio data.Generally comprise accompaniment part and vocal sections in song audio, accurately can detect that in audio frequency, the position of vocal sections is a basic work in audio content analysis field, but difficulty large, have challenge.Prior art has the research of people's sound detection in some songs, but precision is not high, and accuracy rate is lower.
Summary of the invention
The object of the present invention is to provide a kind of method and the device that identify voice in audio frequency, the degree of bias and kurtosis morphogenesis characters vector is obtained by extracting effective audio frequency characteristics and short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum, and use the mode of machine learning to identify voice in audio frequency, solve the problem that in voice Study of recognition, the low accuracy rate of precision is low, realize the identification of the high precision high confidence level of voice in audio frequency.
First aspect, embodiments provide a kind of method identifying voice in audio frequency, described method comprises:
Sub-frame processing is carried out to voice data;
Use the linear predictive coding (LPC) that exponent number is P analyze each the frame voice data after sub-frame processing and extract audio frequency characteristics, described audio frequency characteristics comprises the degree of bias and the kurtosis of short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum;
P+3 rank proper vector is formed according to described audio frequency characteristics;
Use support vector machine (SVM) algorithm to carry out training to described proper vector and obtain corresponding support vector machine;
Whether containing voice in each frame voice data according to described support vector machine identification.
Second aspect, the embodiment of the present invention also provides a kind of device identifying voice in audio frequency, it is characterized in that, described device comprises: sub-frame processing module, audio feature extraction module, feature vector module, support vector machine training module and identification module,
Wherein, sub-frame processing module, for carrying out sub-frame processing to voice data;
Audio feature extraction module, the linear predictive coding (LPC) being P for using exponent number is analyzed each the frame voice data after sub-frame processing and is extracted audio frequency characteristics, and described audio frequency characteristics comprises the degree of bias and the kurtosis of short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum;
Feature vector module, for forming P+3 rank proper vector according to described audio frequency characteristics;
Support vector machine training module, carries out training for using support vector machine (SVM) algorithm to described proper vector and obtains corresponding support vector machine;
Whether identification module, for containing voice in each frame voice data according to described support vector machine identification.
The present invention obtains the degree of bias and kurtosis morphogenesis characters vector by the effective audio frequency characteristics of extraction and short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum, and use the mode of machine learning to identify voice in audio frequency, realize the identification of the high precision high confidence level of voice in audio frequency, for song content analysis provides basic service, thus further realize the functions such as synchronous lyrics, categorizing songs, song recommendations.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the method for voice in the identification audio frequency in first embodiment of the invention.
Fig. 2 is the segmentation process flow diagram of the step 101 in first embodiment of the invention.
Fig. 3 is the segmentation process flow diagram of the step 102 in first embodiment of the invention.
Fig. 4 is the structural drawing of the device of voice in the identification audio frequency in second embodiment of the invention.
Embodiment
Below in conjunction with drawings and Examples, the present invention is described in further detail.Be understandable that, specific embodiment described herein is only for explaining the present invention, but not limitation of the invention.It also should be noted that, for convenience of description, illustrate only part related to the present invention in accompanying drawing but not entire infrastructure.
Figure 1 illustrates the first embodiment of the present invention.
Fig. 1 is the method for voice in the identification audio frequency in first embodiment of the invention, and details are as follows for this realization flow 100:
In a step 101, sub-frame processing is carried out to voice data.
Step 101 (as shown in Figure 2) specifically comprises:
Whether step 1011, detection audio frequency are two-channel or multichannel.
In the present embodiment, the audio frequency of input can be monophony, two-channel or multichannel, if detect that audio frequency is two-channel or multichannel, can extract the L channel of audio frequency or all sound channels is fused together and carry out sub-frame processing again; If detect that audio frequency is monophony, then direct according to default frame length, sub-frame processing is carried out to audio sampling data sequence.
Step 1012, when audio frequency be two-channel or multichannel time, merging all sound channels is that sound channel extracts voice data.
Step 1013, the audio sampling data sequence in described voice data is carried out sub-frame processing according to default frame length, described audio sampling data sequence is divided into a voice data frame sequence.
In the present embodiment, the audio sampling data sequence in voice data is carried out sub-frame processing according to the frame length preset.Due to sound signal as a whole its characteristic and the parameter that characterizes its essential characteristic are all times to time change, so the Digital Signal Processing of use reason stationary signal analyzing and processing can not be carried out.Although sound signal has time-varying characteristics, but (it is generally acknowledged within the short time of 10 ~ 30ms) in a short time range, its characteristic remains unchanged namely relatively stable substantially, thus can be regarded as a metastable state process, namely carry out " short-time analysis ", sound signal is divided into one by one analyze its characteristic parameter.
In a step 102, use the linear predictive coding (LPC) that exponent number is P analyze each the frame voice data after sub-frame processing and extract audio frequency characteristics, described audio frequency characteristics comprises the degree of bias and the kurtosis of short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum.
Step 102 (as shown in Figure 3) specifically comprises:
Step 1021, respectively each frame voice data is extracted to the short-time zero-crossing rate of this frame voice data, described short-time zero-crossing rate is the number of times of sound signal in the frame through zero level.
In the present embodiment, short-time zero-crossing rate represents the number of times of a frame audio frequency sound intermediate frequency signal through zero level.It can distinguish voiceless sound and voiced sound, because the high band zero-crossing rate in sound signal is higher, low-frequency range zero-crossing rate is lower.Short-time zero-crossing rate is according to following formulae discovery:
Z i = 1 2 Σ n = 1 N | sgn ( S n ) - sgn ( S n - 1 ) |
Wherein, natural number n is the sequence number of the audio sample value in the i-th frame, and its maximal value is N, S nbe the n-th sampled value, sgn () is sign function, audio sample value S nsign function for positive number is 1, audio sample value S nfor the sign function of negative and 0 is all-1, namely sgn ( S n ) = 1 , S n > 0 - 1 , S n ≤ 0 .
Step 1022, respectively to each frame voice data carry out linear predictive coding (LPC) analyze obtain corresponding P rank LPC predictive coefficient and LPC prediction residual.
In the present embodiment, linear predictive coding (LPC, LinearPredictiveCoding) be the parameter being produced sound channel excitation and transfer function by analyzing audio waveform, coding to these parameters is just converted to the coding reality of sound waveform, the data volume of sound is greatly reduced.The present value of a phonetic sampling, can approach by the weighted linear combination of the past value of several phonetic samplings, and the weighting coefficient in linear combination just becomes linear predictor coefficient (LPC coefficient).Lpc analysis is that linear time invariant cause and effect systems stabilisation sets up an all-pole modeling, and utilizes mean-square error criteria, carries out model parameter estimation to known voice signal s (n).If utilize P sampling value to predict, then become the linear prediction of P rank.Suppose with in the past P sampling value s (n-1), s (n-2) ... s (n-p) } weighting sum carry out prediction signal current sample value s (n), then prediction signal for:
s ^ ( n ) = Σ k = 1 P a k ( n - k )
Wherein weighting coefficient a krepresent, be called LPC predictive coefficient a k, then predicated error e (n) is:
e ( n ) = s ( n ) - s ^ ( n ) = s ( n ) - Σ k = 1 P a k ( n - k )
If prediction is best, then short-time average predicated error ε to be made namely minimum:
ε=E[e 2(n)]=min
∂ [ e 2 ( n ) ] ∂ a k = 0 , ( 1 ≤ k ≤ P )
Make φ (i, k)=E [s (n-i), s (n-k)], minimum ε can be expressed as:
ϵ min = φ ( 0,0 ) - Σ k = 1 P a k φ ( 0 , k )
Error is more close to zero, and the accuracy of linear prediction be the best in the minimum meaning of square error, can calculate predictive coefficient thus.
Step 1023, described LPC prediction residual corresponding to each frame voice data are respectively carried out Fast Fourier Transform (FFT) and are obtained corresponding LPC prediction residual amplitude spectrum, calculate the degree of bias and the kurtosis of described LPC prediction residual amplitude spectrum.
In the present embodiment, LPC prediction residual is carried out Fourier transform and is obtained LPC prediction residual amplitude spectrum X (k, i), then obtains the degree of bias and the kurtosis of LPC prediction residual amplitude spectrum, and its computing formula is as follows:
v SSk ( i ) = 2 Σ k = 0 κ / 2 - 1 ( | X ( k , i ) | - μ | X | ) 3 Γ · σ | X | 3
v SK ( i ) = 2 Σ k = 0 κ / 2 - 1 ( | X ( k , i ) | - μ | X | ) 4 Γ · σ | X | 3 - 3
Wherein, X (k, i) is the LPC prediction residual amplitude spectrum of the i-th frame, and k is frequency, μ | X|for amplitude spectrum average, for amplitude spectrum variance, Γ is Fourier transform length.
In step 103, P+3 rank proper vector is formed according to described audio frequency characteristics.
In the present embodiment, according to the degree of bias and the kurtosis formation P+3 rank proper vector of the short-time zero-crossing rate obtained in above-mentioned 102 steps, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum.By lpc analysis, N group LPC vector parameters can be obtained by N frame audio frequency, namely can form N group P+3 rank proper vector.
At step 104, use support vector machine (SVM) algorithm to carry out training to described proper vector and obtain corresponding support vector machine.
In the present embodiment, use support vector machine (SVM) algorithm to carry out training to the proper vector formed in step 103 and obtain corresponding support vector machine.
Support vector machine (SVM) algorithm is, by nonlinear transformation, input vector is mapped to a high-dimensional feature space, then asks optimal separating hyper plane in above-mentioned space, and this nonlinear transformation can by defining suitable kernel function to realize.The linear kernel of kernel function main at present, polynomial expression kernel, radial basis kernel (RBF), sigmoid core.SVM (SupportVectorMachine) algorithm is as follows:
If sample set is (x i, y i), i=1 ..., l, x ∈ R d, {-1,1} is category label to y ∈.When can divide under two quasi-mode lines, the lineoid the Representation Equation dividing two classes is:
ω·x+b=0
Discriminant function form then in d dimension space is g (x)=ω x+b, and by discriminant function normalization, then optimal classification surface problem is:
min φ ( ω ) = 1 2 ( ω · ω )
y i[(ω·x i)+b]≥1i=1,…,l
Need by a nonlinear transformation Φ in linearly inseparable situation: certain high-dimensional feature space will be mapped to, structural classification lineoid in high-dimensional feature space to mould-fixed sample
Two class problems of linearly inseparable can solve by asking the optimal classification surface of above formula, even if two classes are separated error-free, and the classification gap of two classes is maximum, and the mathematical form of this problem is:
min φ ( ω , ξ ) = 1 2 ( ω · ω ) + C Σ i = 1 l ξ i
y i[(ω·x i)+b]≥1i=1,…,l
ξ i≥0i=1,…,l
Wherein ξ ifor slack variable, C is penalty factor, can compromise by changing penalty factor between the generalization ability of sorter and misclassification rate.
The dual form of the problems referred to above is:
max W ( a ) = Σ i = 1 l a i - 1 2 Σ i , j = 1 l a i a j y i y j K ( x i , x j )
C≥a i≥1i=1,…,l
Σ i = 1 l a i y i = 0
Wherein, a=[a 1, a 2, a l], a ifor Lagrange multiplier, be called kernel function.
Solve above formula and obtain a i, classification function f (x) can be expressed as:
f ( x ) = sgn [ Σ i = 1 l a i y i K ( x i , x j ) + b ]
Be not a of 0 icorresponding training sample becomes support vector, and these support vectors meet
y i[(ω·x i)+b]=1
Thus constant b can be determined.Above-mentioned support vector machine can only distinguish the classification problem of two quasi-modes, and for the classification problem of m quasi-mode, can design m binary classifier, each sorter only distinguishes a quasi-mode and other class.
The key of SVM algorithm chooses the type of kernel function, mainly contains linear kernel, polynomial expression kernel, radial basis kernel (RBF), sigmoid core.Most widely used in these functions should be exactly RBF kernel function, no matter be small sample or large sample, the situation such as higher-dimension or low-dimensional, RBF kernel function is all applicable, the function that it compares other has following advantage: 1) sample can be mapped to the space of a more higher-dimension by RBF kernel function, and linear kernel function is a special case of gaussian kernel function; 2) compared with Polynomial kernel function, RBF needs the parameter determined to lack, the complexity of how many direct influence functions of kernel functional parameter, in addition, when polynomial exponent number is higher, the element value of nuclear matrix will be tending towards infinitely great or infinitely small, and RBF, then upper, can reduce the dyscalculia of numerical value; 3) for some parameter, RBF with sigmoid has similar performance.
In step 105, whether voice is contained in each frame voice data according to described support vector machine identification.
Whether in the present embodiment, the support vector machine obtained according to step 104 and default support vector machine compare, identify in audio frame containing voice.When described support vector machine is identical with default support vector machine, judge in each frame voice data described containing voice.The voice data that each frame is detected is likely one of following three kinds of situations: 1) only containing accompaniment sound; 2) simultaneously containing accompaniment sound and voice; 3) only containing voice.2) and 3) be all considered as voice to exist, 1) be then considered as without voice.Such as, the audio frequency of input is a whole song, carries out identifying the voice distribution that can obtain whole head song to whole head song.
In a preferred implementation of the present embodiment, choosing default frame length is in a step 101 20ms, has the data overlap of 50% between consecutive frame.
In another preferred implementation of the present embodiment, the exponent number P in step 102 is chosen for 10.
First embodiment of the invention obtains the degree of bias and kurtosis morphogenesis characters vector by the effective audio frequency characteristics of extraction and short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum, and use the mode of machine learning to identify voice in audio frequency, realize the identification of the high precision high confidence level of voice section in audio frequency, for song content analysis provides basic service, thus further realize the functions such as synchronous lyrics, categorizing songs, song recommendations.
Figure 4 illustrates the second embodiment of the present invention.
Fig. 4 is the device of voice in the identification audio frequency in second embodiment of the invention, and this device comprises: sub-frame processing module 401, audio feature extraction module 402, feature vector module 403, support vector machine training module 404 and identification module 405.
Wherein, sub-frame processing module 401 is for carrying out sub-frame processing to voice data.
Sub-frame processing module 401 comprises audio detection unit 4011, channel audio data extracting unit 4012 and sub-frame processing unit 4013.Wherein, whether audio detection unit 4011 is two-channel or multichannel for detecting audio frequency; Channel audio data extracting unit 4012 for when audio frequency be two-channel or multichannel time, merging all sound channels is that sound channel extracts voice data; Described audio sampling data sequence, for the audio sampling data sequence in described voice data is carried out sub-frame processing according to default frame length, is divided into a voice data frame sequence by sub-frame processing unit 4013.
In the present embodiment, the audio frequency of input can be monophony, two-channel or multichannel, if detect that audio frequency is two-channel or multichannel, can extract the L channel of audio frequency or all sound channels is fused together and carry out sub-frame processing again; If detect that audio frequency is monophony, then direct according to default frame length, sub-frame processing is carried out to audio sampling data sequence.Frame length is generally 20ms, has the data overlap of 50% between consecutive frame.
Audio feature extraction module 402 is analyzed each the frame voice data after sub-frame processing for the linear predictive coding (LPC) using exponent number to be P and is extracted audio frequency characteristics, and described audio frequency characteristics comprises the degree of bias and the kurtosis of short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum.
Audio feature extraction module 402 comprises short-time zero-crossing rate extraction unit 4021, linear predictive coding (LPC) unit 4022 and LPC prediction residual amplitude spectrum analytic unit 4023.Wherein, short-time zero-crossing rate extraction unit 4021 is for extracting the short-time zero-crossing rate of this frame voice data respectively to each frame voice data, described short-time zero-crossing rate is sound signal in the frame through the number of times of null value and coordinate axis transverse axis; Linear predictive coding (LPC) unit 4022 obtains corresponding P rank LPC predictive coefficient a and LPC prediction residual for carrying out linear predictive coding (LPC) analysis to each frame voice data respectively; LPC prediction residual amplitude spectrum analytic unit 4023 carries out Fast Fourier Transform (FFT) for described LPC prediction residual corresponding to each frame voice data respectively and obtains corresponding LPC prediction residual amplitude spectrum, calculates the degree of bias and the kurtosis of described LPC prediction residual amplitude spectrum.
In the present embodiment, short-time zero-crossing rate represents that a frame audio frequency sound intermediate frequency signal passes the number of times of null value and coordinate axis transverse axis.The present value of a phonetic sampling, can approach by the weighted linear combination of the past value of several phonetic samplings, and the weighting coefficient in linear combination just becomes linear predictor coefficient (LPC coefficient).Lpc analysis is that linear time invariant cause and effect systems stabilisation sets up an all-pole modeling, and utilizes mean-square error criteria, carries out model parameter estimation to known voice signal s (n).Utilize 10 sampling values to predict, then become 10 rank linear predictions.LPC prediction residual is carried out Fourier transform and is obtained LPC prediction residual amplitude spectrum X (k, i), then obtains the degree of bias and the kurtosis of LPC prediction residual amplitude spectrum.
Feature vector module 403 is for forming P+3 rank proper vector according to described audio frequency characteristics.
In the present embodiment, 13 rank proper vectors are formed according to the degree of bias of the short-time zero-crossing rate obtained in above-mentioned 102 steps, 10 rank LPC predictive coefficients and LPC prediction residual amplitude spectrum and kurtosis.By lpc analysis, N group LPC vector parameters can be obtained by N frame audio frequency, namely can form N group 13 rank proper vector.
Support vector machine training module 404 obtains corresponding support vector machine for using support vector machine (SVM) algorithm to carry out training to described proper vector.
In the present embodiment, use support vector machine (SVM) algorithm to carry out training to the proper vector that feature vector module 403 obtains and obtain corresponding support vector machine.
Support vector machine (SVM) algorithm is, by nonlinear transformation, input vector is mapped to a high-dimensional feature space, then asks optimal separating hyper plane in above-mentioned space, and this nonlinear transformation can by define suitable in Product function realize.The linear form of kernel function main at present, polynomial form, radial basis kernel (RBF) and sigmoid core.The linear kernel of SVM algorithms selection or radial basis kernel.
Identification module 405 is for the voice in each frame voice data according to described support vector machine identification.
Identification module 405 (as shown in Figure 4) comprises comparing unit 4051 and recognition unit 4052, and wherein, comparing unit 4051 is for comparing the described support vector machine of each frame audio frequency with default support vector machine; Recognition unit 4052, for when described support vector machine is identical with default support vector machine, judges in each frame voice data described containing voice.
Whether in the present embodiment, the support vector machine of training to obtain according to support vector machine training module 404 pairs of proper vectors and default support vector machine compare, identify in audio frame containing voice.The voice data that each frame is detected is likely one of following three kinds of situations: 1) only containing accompaniment sound; 2) simultaneously containing accompaniment sound and voice; 3) only containing voice.2) and 3) be all considered as voice to exist, 1) be then considered as without voice.Such as, the audio frequency of input is a whole song, carries out identifying the voice distribution that can obtain whole head song to whole head song.
Second embodiment of the invention sub-frame processing module is by audio frequency framing, then audio feature extraction module is to the effective audio frequency characteristics of the audio extraction after framing, the audio frequency characteristics morphogenesis characters vector that feature vector module will obtain, support vector machine training module carries out training to the proper vector formed and obtains corresponding support vector machine, identification module is according to the support vector machine obtained and default support vector machine relative discern voice, realize the identification of the high precision high confidence level of voice section in audio frequency, for song content analysis provides basic service, thus further realize synchronous lyrics, categorizing songs, the functions such as song recommendations.
One of ordinary skill in the art will appreciate that all or part of step realized in said method embodiment can carry out by program the hardware that instruction is correlated with to have come, described program can be stored in computer read/write memory medium, here the storage medium of indication, as: ROM/RAM, magnetic disc, CD etc.
Note, above are only preferred embodiment of the present invention and institute's application technology principle.Skilled person in the art will appreciate that and the invention is not restricted to specific embodiment described here, various obvious change can be carried out for a person skilled in the art, readjust and substitute and can not protection scope of the present invention be departed from.Therefore, although be described in further detail invention has been by above embodiment, the present invention is not limited only to above embodiment, when not departing from the present invention's design, can also comprise other Equivalent embodiments more, and scope of the present invention is determined by appended right.

Claims (14)

1. identify a method for voice in audio frequency, it is characterized in that, comprising:
Sub-frame processing is carried out to voice data;
Use each the frame voice data after the linear predictive coding lpc analysis sub-frame processing that exponent number is P and extract audio frequency characteristics, described audio frequency characteristics comprises the degree of bias and the kurtosis of short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum;
P+3 rank proper vector is formed according to described audio frequency characteristics;
Use support vector machines algorithm to carry out training to described proper vector and obtain corresponding support vector machine;
Whether containing voice in each frame voice data according to described support vector machine identification.
2. the method for voice in identification audio frequency according to claim 1, is characterized in that, describedly carries out sub-frame processing to voice data and comprises:
Detect whether audio frequency is two-channel or multichannel;
When audio frequency be two-channel or multichannel time, merging all sound channels is that sound channel extracts voice data;
Audio sampling data sequence in described voice data is carried out sub-frame processing according to default frame length, described audio sampling data sequence is divided into a voice data frame sequence.
3. the method for voice in identification audio frequency according to claim 1, it is characterized in that, described exponent number P is 10.
4. the method for voice in identification audio frequency according to claim 1, is characterized in that, each the frame voice data after the linear predictive coding lpc analysis sub-frame processing that described use exponent number is P also extracts audio frequency characteristics and comprises:
Respectively each frame voice data is extracted to the short-time zero-crossing rate of this frame voice data, described short-time zero-crossing rate is the number of times of sound signal in the frame through zero level;
Respectively linear predictive coding lpc analysis is carried out to each frame voice data and obtain corresponding P rank LPC predictive coefficient a and LPC prediction residual;
Described LPC prediction residual corresponding to each frame voice data is respectively carried out Fast Fourier Transform (FFT) and is obtained corresponding LPC prediction residual amplitude spectrum, calculates the degree of bias and the kurtosis of described LPC prediction residual amplitude spectrum.
5. the method for voice in identification audio frequency according to claim 4, it is characterized in that, described short-time zero-crossing rate is according to following formulae discovery:
Z i = 1 2 Σ n = 1 N | sgn ( S n ) - sgn ( S n - 1 ) |
Wherein, natural number n is the sequence number of the audio sample value in the i-th frame, and its maximal value is N, S nbe the n-th sampled value, sgn () is sign function, audio sample value S nsign function for positive number is 1, audio sample value S nfor the sign function of negative and 0 is all-1, namely sgn ( S n ) = 1 , S n > 0 - 1 , S n ≤ 0 .
6. the method for voice in identification audio frequency according to claim 4, it is characterized in that, the degree of bias of described LPC prediction residual amplitude spectrum and kurtosis are according to following formulae discovery:
v S S k ( i ) = 2 Σ k = 0 κ / 2 - 1 ( | X ( k , i ) | - μ | X | ) 3 Γ · σ | X | 3 v S K ( i ) = 2 Σ k = 0 κ / 2 - 1 ( | X ( k , i ) | - μ | X | ) 4 Γ · σ | X | 3 - 3
Wherein, X (k, i) is the LPC prediction residual amplitude spectrum of the i-th frame, and k is frequency, μ | X|for amplitude spectrum average, for amplitude spectrum variance, Γ is Fourier transform length.
7. whether the method for voice in identification audio frequency according to claim 1, is characterized in that, comprise in described each frame voice data according to described support vector machine identification containing voice:
Described support vector machine and default support vector machine are compared;
When described support vector machine is identical with default support vector machine, judge in each frame voice data described containing voice.
8. identify a device for voice in audio frequency, it is characterized in that, described device comprises: sub-frame processing module, audio feature extraction module, feature vector module, support vector machine training module and identification module;
Wherein, sub-frame processing module is used for carrying out sub-frame processing to voice data;
Audio feature extraction module for use exponent number to be P the sub-frame processing of linear predictive coding lpc analysis after each frame voice data and extract audio frequency characteristics, described audio frequency characteristics comprises the degree of bias and the kurtosis of short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum;
Feature vector module is used for forming P+3 rank proper vector according to described audio frequency characteristics;
Support vector machine training module obtains corresponding support vector machine for using support vector machines algorithm to carry out training to described proper vector;
Whether identification module is used in each frame voice data according to described support vector machine identification containing voice.
9. the device of voice in identification audio frequency according to claim 8, it is characterized in that, described sub-frame processing module comprises audio detection unit, channel audio data extracting unit and sub-frame processing unit;
Wherein, whether audio detection unit is two-channel or multichannel for detecting audio frequency;
Channel audio data extracting unit be used for when audio frequency be two-channel or multichannel time, merging all sound channels is sound channel extraction voice data;
Sub-frame processing unit is used for the audio sampling data sequence in described voice data to carry out sub-frame processing according to default frame length, and described audio sampling data sequence is divided into a voice data frame sequence.
10. the device of voice in identification audio frequency according to claim 8, it is characterized in that, described exponent number P is 10.
In 11. identification audio frequency according to claim 8, the device of voice, is characterized in that, described audio feature extraction module comprises short-time zero-crossing rate extraction unit, linear predictive coding LPC unit and LPC prediction residual amplitude spectrum analytic unit;
Wherein, short-time zero-crossing rate extraction unit is used for the short-time zero-crossing rate respectively each frame voice data being extracted to this frame voice data, and described short-time zero-crossing rate is the number of times of sound signal in the frame through zero level;
Linear predictive coding LPC unit is used for carrying out linear predictive coding lpc analysis to each frame voice data respectively and obtains corresponding P rank LPC predictive coefficient a and LPC prediction residual;
LPC prediction residual amplitude spectrum analytic unit carries out Fast Fourier Transform (FFT) for described LPC prediction residual corresponding to each frame voice data respectively and obtains corresponding LPC prediction residual amplitude spectrum, calculates the degree of bias and the kurtosis of described LPC prediction residual amplitude spectrum.
The device of voice in 12. identification audio frequency according to claim 11, it is characterized in that, described short-time zero-crossing rate is according to following formulae discovery:
Z i = 1 2 Σ n = 1 N | sgn ( S n ) - sgn ( S n - 1 ) |
Wherein, natural number n is the sequence number of the audio sample value in the i-th frame, and its maximal value is N, S nbe the n-th sampled value, sgn () is sign function, audio sample value S nsign function for positive number is 1, audio sample value S nfor the sign function of negative and 0 is all-1, namely sgn ( S n ) = 1 , S n > 0 - 1 , S n ≤ 0 .
In 13. identification audio frequency according to claim 11, the device of voice, is characterized in that, the degree of bias of described LPC prediction residual amplitude spectrum and kurtosis are according to following formulae discovery:
v S S k ( i ) = 2 Σ k = 0 κ / 2 - 1 ( | X ( k , i ) | - μ | X | ) 3 Γ · σ | X | 3 v S K ( i ) = 2 Σ k = 0 κ / 2 - 1 ( | X ( k , i ) | - μ | X | ) 4 Γ · σ | X | 3 - 3
Wherein, X (k, i) for LPC prediction residual amplitude spectrum, k be frequency, μ | X|for amplitude spectrum average, for amplitude spectrum variance, Γ is Fourier transform length.
The device of voice in 14. identification audio frequency according to claim 8, it is characterized in that, described identification module comprises comparing unit and recognition unit;
Wherein, comparing unit is used for the described support vector machine of each frame audio frequency to compare with default support vector machine;
Recognition unit is used for when described support vector machine is identical with default support vector machine, judges in each frame voice data described containing voice.
CN201310429920.0A 2013-09-18 2013-09-18 A kind of method and device identifying voice in audio frequency Active CN103489445B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310429920.0A CN103489445B (en) 2013-09-18 2013-09-18 A kind of method and device identifying voice in audio frequency

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310429920.0A CN103489445B (en) 2013-09-18 2013-09-18 A kind of method and device identifying voice in audio frequency

Publications (2)

Publication Number Publication Date
CN103489445A CN103489445A (en) 2014-01-01
CN103489445B true CN103489445B (en) 2016-03-30

Family

ID=49829625

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310429920.0A Active CN103489445B (en) 2013-09-18 2013-09-18 A kind of method and device identifying voice in audio frequency

Country Status (1)

Country Link
CN (1) CN103489445B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104916288B (en) * 2014-03-14 2019-01-18 深圳Tcl新技术有限公司 The method and device of the prominent processing of voice in a kind of audio
CN106571150B (en) * 2015-10-12 2021-04-16 阿里巴巴集团控股有限公司 Method and system for recognizing human voice in music
CN107886956B (en) * 2017-11-13 2020-12-11 广州酷狗计算机科技有限公司 Audio recognition method and device and computer storage medium
CN108417204A (en) * 2018-02-27 2018-08-17 四川云淞源科技有限公司 Information security processing method based on big data
US11341983B2 (en) 2018-09-17 2022-05-24 Honeywell International Inc. System and method for audio noise reduction
CN109545191B (en) * 2018-11-15 2022-11-25 电子科技大学 Real-time detection method for initial position of human voice in song
CN109410968B (en) * 2018-11-15 2022-12-09 电子科技大学 Efficient detection method for initial position of voice in song
CN109994126A (en) * 2019-03-11 2019-07-09 北京三快在线科技有限公司 Audio message segmentation method, device, storage medium and electronic equipment
CN112309352A (en) * 2020-01-15 2021-02-02 北京字节跳动网络技术有限公司 Audio information processing method, apparatus, device and medium
CN113257242A (en) * 2021-04-06 2021-08-13 杭州远传新业科技有限公司 Voice broadcast suspension method, device, equipment and medium in self-service voice service
CN114220427A (en) * 2021-10-29 2022-03-22 深圳市锐明技术股份有限公司 Method and device for identifying call command, terminal equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1405693A (en) * 2001-08-08 2003-03-26 南京北极星软件有限公司 Computer human-sound identifying method and telephone communication system with human-sound identifying function
CN101149923A (en) * 2006-09-22 2008-03-26 富士通株式会社 Speech recognition method, speech recognition apparatus and computer program
CN101236742A (en) * 2008-03-03 2008-08-06 中兴通讯股份有限公司 Music/ non-music real-time detection method and device
CN102708859A (en) * 2012-06-20 2012-10-03 太仓博天网络科技有限公司 Real-time music voice identification system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100930584B1 (en) * 2007-09-19 2009-12-09 한국전자통신연구원 Speech discrimination method and apparatus using voiced sound features of human speech

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1405693A (en) * 2001-08-08 2003-03-26 南京北极星软件有限公司 Computer human-sound identifying method and telephone communication system with human-sound identifying function
CN101149923A (en) * 2006-09-22 2008-03-26 富士通株式会社 Speech recognition method, speech recognition apparatus and computer program
CN101236742A (en) * 2008-03-03 2008-08-06 中兴通讯股份有限公司 Music/ non-music real-time detection method and device
CN102708859A (en) * 2012-06-20 2012-10-03 太仓博天网络科技有限公司 Real-time music voice identification system

Also Published As

Publication number Publication date
CN103489445A (en) 2014-01-01

Similar Documents

Publication Publication Date Title
CN103489445B (en) A kind of method and device identifying voice in audio frequency
Soumaya et al. The detection of Parkinson disease using the genetic algorithm and SVM classifier
CN101506874B (en) Feeling detection method, and feeling detection device
US20070131095A1 (en) Method of classifying music file and system therefor
Semwal et al. Automatic speech emotion detection system using multi-domain acoustic feature selection and classification models
CN105702251A (en) Speech emotion identifying method based on Top-k enhanced audio bag-of-word model
Vuppala et al. Improved consonant–vowel recognition for low bit‐rate coded speech
Park et al. Voice Activity Detection in Noisy Environments Based on Double‐Combined Fourier Transform and Line Fitting
Hossain et al. Emovoice: Finding my mood from my voice signal
Ghosal et al. Automatic male-female voice discrimination
Krey et al. Music and timbre segmentation by recursive constrained K-means clustering
Krishnamoorthy et al. Hierarchical audio content classification system using an optimal feature selection algorithm
Elnagar et al. Automatic classification of reciters of quranic audio clips
Salhi et al. Robustness of auditory teager energy cepstrum coefficients for classification of pathological and normal voices in noisy environments
Shaik et al. Sentiment analysis with word-based Urdu speech recognition
Mezghani et al. Multifeature speech/music discrimination based on mid-term level statistics and supervised classifiers
Xie et al. Investigation of acoustic and visual features for frog call classification
Büker et al. Angular margin softmax loss and its variants for double compressed amr audio detection
Markov et al. Music genre classification using Gaussian process models
Yang et al. Combining auditory perception and visual features for regional recognition of Chinese folk songs
Karlos et al. Speech recognition combining MFCCs and image features
Kaushik et al. Vocalist identification in audio songs using convolutional neural network
Karlos et al. Optimized active learning strategy for audiovisual speaker recognition
Wichern et al. Automatic audio tagging using covariate shift adaptation
Dhakal Novel Architectures for Human Voice and Environmental Sound Recognitionusing Machine Learning Algorithms

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
C41 Transfer of patent application or patent right or utility model
GR01 Patent grant
TA01 Transfer of patent application right

Effective date of registration: 20160309

Address after: 100085, Beijing, Haidian District Qinghe Anning East Road No. 23, building two, 18, 2108

Applicant after: BEIJING YINZHIBANG CULTURE TECHNOLOGY Co.,Ltd.

Address before: 100085 Beijing, Haidian District, No. ten on the ground floor, No. 10 Baidu building, layer three

Applicant before: BEIJING BAIDU NETCOM SCIENCE AND TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right

Effective date of registration: 20220414

Address after: 518057 3305, floor 3, building 1, aerospace building, No. 51, Gaoxin South ninth Road, high tech Zone community, Yuehai street, Nanshan District, Shenzhen, Guangdong

Patentee after: Shenzhen Taile Culture Technology Co.,Ltd.

Address before: 2108, floor 2, building 23, No. 18, anningzhuang East Road, Qinghe, Haidian District, Beijing 100085

Patentee before: BEIJING YINZHIBANG CULTURE TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right