CN103489445B

CN103489445B - A kind of method and device identifying voice in audio frequency

Info

Publication number: CN103489445B
Application number: CN201310429920.0A
Authority: CN
Inventors: 田彪
Original assignee: Beijing Yinzhibang Culture Technology Co Ltd
Current assignee: Shenzhen Taile Culture Technology Co.,Ltd.
Priority date: 2013-09-18
Filing date: 2013-09-18
Publication date: 2016-03-30
Anticipated expiration: 2033-09-18
Also published as: CN103489445A

Abstract

The invention discloses a kind of method and the device that identify voice in audio frequency, described method comprises: carry out sub-frame processing to voice data; Use the linear predictive coding (LPC) that exponent number is P analyze each the frame voice data after sub-frame processing and extract audio frequency characteristics, described audio frequency characteristics comprises the degree of bias and the kurtosis of short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum; P+3 rank proper vector is formed according to described audio frequency characteristics; Use support vector machine (SVM) algorithm to carry out training to described proper vector and obtain corresponding support vector machine; Whether containing voice in each frame voice data according to described support vector machine identification.The present invention can realize the identification of the high precision high confidence level of voice in audio frequency, for song content analysis provides basic service, thus further realizes the functions such as synchronous lyrics, categorizing songs, song recommendations.

Description

A kind of method and device identifying voice in audio frequency

Technical field

The present invention relates to multimedia messages field, be specifically related to audio signal analysis field, particularly relate to a kind of method and the device that identify voice in audio frequency.

Background technology

Along with multimedia technology constantly develops, the effect of audio/video information in the work of people, style and entertainment life is more and more heavier.Such as, on internet, each large music site is classified to song or recommends song, makes the song that each user can search song as soon as possible or recommend to user.

Mostly current its work such as categorizing songs and recommendation of each large music site is to work in coordination with filtering based on text analyzing and user behavior, there is not yet the application being deep into audio content analysis technology.Audio content analysis technology is classified to audio frequency according to the audio frequency characteristics extracted, and makes user can retrieve required audio frequency more accurately, can also realize the retrieval to realaudio data.Generally comprise accompaniment part and vocal sections in song audio, accurately can detect that in audio frequency, the position of vocal sections is a basic work in audio content analysis field, but difficulty large, have challenge.Prior art has the research of people's sound detection in some songs, but precision is not high, and accuracy rate is lower.

Summary of the invention

The object of the present invention is to provide a kind of method and the device that identify voice in audio frequency, the degree of bias and kurtosis morphogenesis characters vector is obtained by extracting effective audio frequency characteristics and short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum, and use the mode of machine learning to identify voice in audio frequency, solve the problem that in voice Study of recognition, the low accuracy rate of precision is low, realize the identification of the high precision high confidence level of voice in audio frequency.

First aspect, embodiments provide a kind of method identifying voice in audio frequency, described method comprises:

Sub-frame processing is carried out to voice data;

Use the linear predictive coding (LPC) that exponent number is P analyze each the frame voice data after sub-frame processing and extract audio frequency characteristics, described audio frequency characteristics comprises the degree of bias and the kurtosis of short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum;

P+3 rank proper vector is formed according to described audio frequency characteristics;

Use support vector machine (SVM) algorithm to carry out training to described proper vector and obtain corresponding support vector machine;

Whether containing voice in each frame voice data according to described support vector machine identification.

Second aspect, the embodiment of the present invention also provides a kind of device identifying voice in audio frequency, it is characterized in that, described device comprises: sub-frame processing module, audio feature extraction module, feature vector module, support vector machine training module and identification module,

Wherein, sub-frame processing module, for carrying out sub-frame processing to voice data;

Audio feature extraction module, the linear predictive coding (LPC) being P for using exponent number is analyzed each the frame voice data after sub-frame processing and is extracted audio frequency characteristics, and described audio frequency characteristics comprises the degree of bias and the kurtosis of short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum;

Feature vector module, for forming P+3 rank proper vector according to described audio frequency characteristics;

Support vector machine training module, carries out training for using support vector machine (SVM) algorithm to described proper vector and obtains corresponding support vector machine;

Whether identification module, for containing voice in each frame voice data according to described support vector machine identification.

The present invention obtains the degree of bias and kurtosis morphogenesis characters vector by the effective audio frequency characteristics of extraction and short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum, and use the mode of machine learning to identify voice in audio frequency, realize the identification of the high precision high confidence level of voice in audio frequency, for song content analysis provides basic service, thus further realize the functions such as synchronous lyrics, categorizing songs, song recommendations.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the method for voice in the identification audio frequency in first embodiment of the invention.

Fig. 2 is the segmentation process flow diagram of the step 101 in first embodiment of the invention.

Fig. 3 is the segmentation process flow diagram of the step 102 in first embodiment of the invention.

Fig. 4 is the structural drawing of the device of voice in the identification audio frequency in second embodiment of the invention.

Embodiment

Below in conjunction with drawings and Examples, the present invention is described in further detail.Be understandable that, specific embodiment described herein is only for explaining the present invention, but not limitation of the invention.It also should be noted that, for convenience of description, illustrate only part related to the present invention in accompanying drawing but not entire infrastructure.

Figure 1 illustrates the first embodiment of the present invention.

Fig. 1 is the method for voice in the identification audio frequency in first embodiment of the invention, and details are as follows for this realization flow 100:

In a step 101, sub-frame processing is carried out to voice data.

Step 101 (as shown in Figure 2) specifically comprises:

Whether step 1011, detection audio frequency are two-channel or multichannel.

In the present embodiment, the audio frequency of input can be monophony, two-channel or multichannel, if detect that audio frequency is two-channel or multichannel, can extract the L channel of audio frequency or all sound channels is fused together and carry out sub-frame processing again; If detect that audio frequency is monophony, then direct according to default frame length, sub-frame processing is carried out to audio sampling data sequence.

Step 1012, when audio frequency be two-channel or multichannel time, merging all sound channels is that sound channel extracts voice data.

Step 1013, the audio sampling data sequence in described voice data is carried out sub-frame processing according to default frame length, described audio sampling data sequence is divided into a voice data frame sequence.

In the present embodiment, the audio sampling data sequence in voice data is carried out sub-frame processing according to the frame length preset.Due to sound signal as a whole its characteristic and the parameter that characterizes its essential characteristic are all times to time change, so the Digital Signal Processing of use reason stationary signal analyzing and processing can not be carried out.Although sound signal has time-varying characteristics, but (it is generally acknowledged within the short time of 10 ~ 30ms) in a short time range, its characteristic remains unchanged namely relatively stable substantially, thus can be regarded as a metastable state process, namely carry out " short-time analysis ", sound signal is divided into one by one analyze its characteristic parameter.

In a step 102, use the linear predictive coding (LPC) that exponent number is P analyze each the frame voice data after sub-frame processing and extract audio frequency characteristics, described audio frequency characteristics comprises the degree of bias and the kurtosis of short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum.

Step 102 (as shown in Figure 3) specifically comprises:

Step 1021, respectively each frame voice data is extracted to the short-time zero-crossing rate of this frame voice data, described short-time zero-crossing rate is the number of times of sound signal in the frame through zero level.

In the present embodiment, short-time zero-crossing rate represents the number of times of a frame audio frequency sound intermediate frequency signal through zero level.It can distinguish voiceless sound and voiced sound, because the high band zero-crossing rate in sound signal is higher, low-frequency range zero-crossing rate is lower.Short-time zero-crossing rate is according to following formulae discovery:

Z_{i} = \frac{1}{2} Σ_{n = 1}^{N} | sgn (S_{n}) - sgn (S_{n - 1}) |

Wherein, natural number n is the sequence number of the audio sample value in the i-th frame, and its maximal value is N, S _nbe the n-th sampled value, sgn () is sign function, audio sample value S _nsign function for positive number is 1, audio sample value S _nfor the sign function of negative and 0 is all-1, namely

sgn (S_{n}) = \{\begin{matrix} 1, & S_{n} > 0 \\ - 1, & S_{n} \leq 0 \end{matrix} .

Step 1022, respectively to each frame voice data carry out linear predictive coding (LPC) analyze obtain corresponding P rank LPC predictive coefficient and LPC prediction residual.

In the present embodiment, linear predictive coding (LPC, LinearPredictiveCoding) be the parameter being produced sound channel excitation and transfer function by analyzing audio waveform, coding to these parameters is just converted to the coding reality of sound waveform, the data volume of sound is greatly reduced.The present value of a phonetic sampling, can approach by the weighted linear combination of the past value of several phonetic samplings, and the weighting coefficient in linear combination just becomes linear predictor coefficient (LPC coefficient).Lpc analysis is that linear time invariant cause and effect systems stabilisation sets up an all-pole modeling, and utilizes mean-square error criteria, carries out model parameter estimation to known voice signal s (n).If utilize P sampling value to predict, then become the linear prediction of P rank.Suppose with in the past P sampling value s (n-1), s (n-2) ... s (n-p) } weighting sum carry out prediction signal current sample value s (n), then prediction signal for:

\hat{s} (n) = Σ_{k = 1}^{P} a_{k} (n - k)

Wherein weighting coefficient a _krepresent, be called LPC predictive coefficient a _k, then predicated error e (n) is:

e (n) = s (n) - \hat{s} (n) = s (n) - Σ_{k = 1}^{P} a_{k} (n - k)

If prediction is best, then short-time average predicated error ε to be made namely minimum:

ε=E[e ²(n)]=min

\frac{&PartialD; [e^{2} (n)]}{{&PartialD; a}_{k}} = 0, (1 \leq k \leq P)

Make φ (i, k)=E [s (n-i), s (n-k)], minimum ε can be expressed as:

ϵ_{\min} = φ (0,0) - Σ_{k = 1}^{P} a_{k} φ (0, k)

Error is more close to zero, and the accuracy of linear prediction be the best in the minimum meaning of square error, can calculate predictive coefficient thus.

Step 1023, described LPC prediction residual corresponding to each frame voice data are respectively carried out Fast Fourier Transform (FFT) and are obtained corresponding LPC prediction residual amplitude spectrum, calculate the degree of bias and the kurtosis of described LPC prediction residual amplitude spectrum.

In the present embodiment, LPC prediction residual is carried out Fourier transform and is obtained LPC prediction residual amplitude spectrum X (k, i), then obtains the degree of bias and the kurtosis of LPC prediction residual amplitude spectrum, and its computing formula is as follows:

v_{SSk} (i) = \frac{2 Σ_{k = 0}^{κ / 2 - 1} {(| X (k, i) | - μ_{| X |})}^{3}}{Γ \cdot σ_{| X |}^{3}}

v_{SK} (i) = \frac{2 Σ_{k = 0}^{κ / 2 - 1} {(| X (k, i) | - μ_{| X |})}^{4}}{Γ \cdot σ_{| X |}^{3}} - 3

Wherein, X (k, i) is the LPC prediction residual amplitude spectrum of the i-th frame, and k is frequency, μ _{| X|}for amplitude spectrum average, for amplitude spectrum variance, Γ is Fourier transform length.

In step 103, P+3 rank proper vector is formed according to described audio frequency characteristics.

In the present embodiment, according to the degree of bias and the kurtosis formation P+3 rank proper vector of the short-time zero-crossing rate obtained in above-mentioned 102 steps, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum.By lpc analysis, N group LPC vector parameters can be obtained by N frame audio frequency, namely can form N group P+3 rank proper vector.

At step 104, use support vector machine (SVM) algorithm to carry out training to described proper vector and obtain corresponding support vector machine.

In the present embodiment, use support vector machine (SVM) algorithm to carry out training to the proper vector formed in step 103 and obtain corresponding support vector machine.

Support vector machine (SVM) algorithm is, by nonlinear transformation, input vector is mapped to a high-dimensional feature space, then asks optimal separating hyper plane in above-mentioned space, and this nonlinear transformation can by defining suitable kernel function to realize.The linear kernel of kernel function main at present, polynomial expression kernel, radial basis kernel (RBF), sigmoid core.SVM (SupportVectorMachine) algorithm is as follows:

If sample set is (x _i, y _i), i=1 ..., l, x ∈ R ^d, {-1,1} is category label to y ∈.When can divide under two quasi-mode lines, the lineoid the Representation Equation dividing two classes is:

ω·x+b=0

Discriminant function form then in d dimension space is g (x)=ω x+b, and by discriminant function normalization, then optimal classification surface problem is:

\min φ (ω) = \frac{1}{2} (ω \cdot ω)

y _i[(ω·x _i)+b]≥1i=1,…，l

Need by a nonlinear transformation Φ in linearly inseparable situation: certain high-dimensional feature space will be mapped to, structural classification lineoid in high-dimensional feature space to mould-fixed sample

Two class problems of linearly inseparable can solve by asking the optimal classification surface of above formula, even if two classes are separated error-free, and the classification gap of two classes is maximum, and the mathematical form of this problem is:

\min φ (ω, ξ) = \frac{1}{2} (ω \cdot ω) + C Σ_{i = 1}^{l} ξ_{i}

y _i[(ω·x _i)+b]≥1i=1,…，l

ξ _i≥0i=1,…，l

Wherein ξ _ifor slack variable, C is penalty factor, can compromise by changing penalty factor between the generalization ability of sorter and misclassification rate.

The dual form of the problems referred to above is:

\max W (a) = Σ_{i = 1}^{l} a_{i} - \frac{1}{2} Σ_{i, j = 1}^{l} a_{i} a_{j} y_{i} y_{j} K (x_{i}, x_{j})

C≥a _i≥1i=1,…，l

Σ_{i = 1}^{l} a_{i} y_{i} = 0

Wherein, a=[a ₁, a ₂, a _l], a _ifor Lagrange multiplier, be called kernel function.

Solve above formula and obtain a _i, classification function f (x) can be expressed as:

f (x) = sgn [Σ_{i = 1}^{l} a_{i} y_{i} K (x_{i}, x_{j}) + b]

Be not a of 0 _icorresponding training sample becomes support vector, and these support vectors meet

y _i[(ω·x _i)+b]=1

Thus constant b can be determined.Above-mentioned support vector machine can only distinguish the classification problem of two quasi-modes, and for the classification problem of m quasi-mode, can design m binary classifier, each sorter only distinguishes a quasi-mode and other class.

The key of SVM algorithm chooses the type of kernel function, mainly contains linear kernel, polynomial expression kernel, radial basis kernel (RBF), sigmoid core.Most widely used in these functions should be exactly RBF kernel function, no matter be small sample or large sample, the situation such as higher-dimension or low-dimensional, RBF kernel function is all applicable, the function that it compares other has following advantage: 1) sample can be mapped to the space of a more higher-dimension by RBF kernel function, and linear kernel function is a special case of gaussian kernel function; 2) compared with Polynomial kernel function, RBF needs the parameter determined to lack, the complexity of how many direct influence functions of kernel functional parameter, in addition, when polynomial exponent number is higher, the element value of nuclear matrix will be tending towards infinitely great or infinitely small, and RBF, then upper, can reduce the dyscalculia of numerical value; 3) for some parameter, RBF with sigmoid has similar performance.

In step 105, whether voice is contained in each frame voice data according to described support vector machine identification.

Whether in the present embodiment, the support vector machine obtained according to step 104 and default support vector machine compare, identify in audio frame containing voice.When described support vector machine is identical with default support vector machine, judge in each frame voice data described containing voice.The voice data that each frame is detected is likely one of following three kinds of situations: 1) only containing accompaniment sound; 2) simultaneously containing accompaniment sound and voice; 3) only containing voice.2) and 3) be all considered as voice to exist, 1) be then considered as without voice.Such as, the audio frequency of input is a whole song, carries out identifying the voice distribution that can obtain whole head song to whole head song.

In a preferred implementation of the present embodiment, choosing default frame length is in a step 101 20ms, has the data overlap of 50% between consecutive frame.

In another preferred implementation of the present embodiment, the exponent number P in step 102 is chosen for 10.

First embodiment of the invention obtains the degree of bias and kurtosis morphogenesis characters vector by the effective audio frequency characteristics of extraction and short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum, and use the mode of machine learning to identify voice in audio frequency, realize the identification of the high precision high confidence level of voice section in audio frequency, for song content analysis provides basic service, thus further realize the functions such as synchronous lyrics, categorizing songs, song recommendations.

Figure 4 illustrates the second embodiment of the present invention.

Fig. 4 is the device of voice in the identification audio frequency in second embodiment of the invention, and this device comprises: sub-frame processing module 401, audio feature extraction module 402, feature vector module 403, support vector machine training module 404 and identification module 405.

Wherein, sub-frame processing module 401 is for carrying out sub-frame processing to voice data.

Sub-frame processing module 401 comprises audio detection unit 4011, channel audio data extracting unit 4012 and sub-frame processing unit 4013.Wherein, whether audio detection unit 4011 is two-channel or multichannel for detecting audio frequency; Channel audio data extracting unit 4012 for when audio frequency be two-channel or multichannel time, merging all sound channels is that sound channel extracts voice data; Described audio sampling data sequence, for the audio sampling data sequence in described voice data is carried out sub-frame processing according to default frame length, is divided into a voice data frame sequence by sub-frame processing unit 4013.

In the present embodiment, the audio frequency of input can be monophony, two-channel or multichannel, if detect that audio frequency is two-channel or multichannel, can extract the L channel of audio frequency or all sound channels is fused together and carry out sub-frame processing again; If detect that audio frequency is monophony, then direct according to default frame length, sub-frame processing is carried out to audio sampling data sequence.Frame length is generally 20ms, has the data overlap of 50% between consecutive frame.

Audio feature extraction module 402 is analyzed each the frame voice data after sub-frame processing for the linear predictive coding (LPC) using exponent number to be P and is extracted audio frequency characteristics, and described audio frequency characteristics comprises the degree of bias and the kurtosis of short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum.

Audio feature extraction module 402 comprises short-time zero-crossing rate extraction unit 4021, linear predictive coding (LPC) unit 4022 and LPC prediction residual amplitude spectrum analytic unit 4023.Wherein, short-time zero-crossing rate extraction unit 4021 is for extracting the short-time zero-crossing rate of this frame voice data respectively to each frame voice data, described short-time zero-crossing rate is sound signal in the frame through the number of times of null value and coordinate axis transverse axis; Linear predictive coding (LPC) unit 4022 obtains corresponding P rank LPC predictive coefficient a and LPC prediction residual for carrying out linear predictive coding (LPC) analysis to each frame voice data respectively; LPC prediction residual amplitude spectrum analytic unit 4023 carries out Fast Fourier Transform (FFT) for described LPC prediction residual corresponding to each frame voice data respectively and obtains corresponding LPC prediction residual amplitude spectrum, calculates the degree of bias and the kurtosis of described LPC prediction residual amplitude spectrum.

In the present embodiment, short-time zero-crossing rate represents that a frame audio frequency sound intermediate frequency signal passes the number of times of null value and coordinate axis transverse axis.The present value of a phonetic sampling, can approach by the weighted linear combination of the past value of several phonetic samplings, and the weighting coefficient in linear combination just becomes linear predictor coefficient (LPC coefficient).Lpc analysis is that linear time invariant cause and effect systems stabilisation sets up an all-pole modeling, and utilizes mean-square error criteria, carries out model parameter estimation to known voice signal s (n).Utilize 10 sampling values to predict, then become 10 rank linear predictions.LPC prediction residual is carried out Fourier transform and is obtained LPC prediction residual amplitude spectrum X (k, i), then obtains the degree of bias and the kurtosis of LPC prediction residual amplitude spectrum.

Feature vector module 403 is for forming P+3 rank proper vector according to described audio frequency characteristics.

In the present embodiment, 13 rank proper vectors are formed according to the degree of bias of the short-time zero-crossing rate obtained in above-mentioned 102 steps, 10 rank LPC predictive coefficients and LPC prediction residual amplitude spectrum and kurtosis.By lpc analysis, N group LPC vector parameters can be obtained by N frame audio frequency, namely can form N group 13 rank proper vector.

Support vector machine training module 404 obtains corresponding support vector machine for using support vector machine (SVM) algorithm to carry out training to described proper vector.

In the present embodiment, use support vector machine (SVM) algorithm to carry out training to the proper vector that feature vector module 403 obtains and obtain corresponding support vector machine.

Support vector machine (SVM) algorithm is, by nonlinear transformation, input vector is mapped to a high-dimensional feature space, then asks optimal separating hyper plane in above-mentioned space, and this nonlinear transformation can by define suitable in Product function realize.The linear form of kernel function main at present, polynomial form, radial basis kernel (RBF) and sigmoid core.The linear kernel of SVM algorithms selection or radial basis kernel.

Identification module 405 is for the voice in each frame voice data according to described support vector machine identification.

Identification module 405 (as shown in Figure 4) comprises comparing unit 4051 and recognition unit 4052, and wherein, comparing unit 4051 is for comparing the described support vector machine of each frame audio frequency with default support vector machine; Recognition unit 4052, for when described support vector machine is identical with default support vector machine, judges in each frame voice data described containing voice.

Whether in the present embodiment, the support vector machine of training to obtain according to support vector machine training module 404 pairs of proper vectors and default support vector machine compare, identify in audio frame containing voice.The voice data that each frame is detected is likely one of following three kinds of situations: 1) only containing accompaniment sound; 2) simultaneously containing accompaniment sound and voice; 3) only containing voice.2) and 3) be all considered as voice to exist, 1) be then considered as without voice.Such as, the audio frequency of input is a whole song, carries out identifying the voice distribution that can obtain whole head song to whole head song.

Second embodiment of the invention sub-frame processing module is by audio frequency framing, then audio feature extraction module is to the effective audio frequency characteristics of the audio extraction after framing, the audio frequency characteristics morphogenesis characters vector that feature vector module will obtain, support vector machine training module carries out training to the proper vector formed and obtains corresponding support vector machine, identification module is according to the support vector machine obtained and default support vector machine relative discern voice, realize the identification of the high precision high confidence level of voice section in audio frequency, for song content analysis provides basic service, thus further realize synchronous lyrics, categorizing songs, the functions such as song recommendations.

One of ordinary skill in the art will appreciate that all or part of step realized in said method embodiment can carry out by program the hardware that instruction is correlated with to have come, described program can be stored in computer read/write memory medium, here the storage medium of indication, as: ROM/RAM, magnetic disc, CD etc.

Note, above are only preferred embodiment of the present invention and institute's application technology principle.Skilled person in the art will appreciate that and the invention is not restricted to specific embodiment described here, various obvious change can be carried out for a person skilled in the art, readjust and substitute and can not protection scope of the present invention be departed from.Therefore, although be described in further detail invention has been by above embodiment, the present invention is not limited only to above embodiment, when not departing from the present invention's design, can also comprise other Equivalent embodiments more, and scope of the present invention is determined by appended right.

Claims

1. identify a method for voice in audio frequency, it is characterized in that, comprising:

Sub-frame processing is carried out to voice data;

Use each the frame voice data after the linear predictive coding lpc analysis sub-frame processing that exponent number is P and extract audio frequency characteristics, described audio frequency characteristics comprises the degree of bias and the kurtosis of short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum;

Use support vector machines algorithm to carry out training to described proper vector and obtain corresponding support vector machine;

2. the method for voice in identification audio frequency according to claim 1, is characterized in that, describedly carries out sub-frame processing to voice data and comprises:

Detect whether audio frequency is two-channel or multichannel;

When audio frequency be two-channel or multichannel time, merging all sound channels is that sound channel extracts voice data;

Audio sampling data sequence in described voice data is carried out sub-frame processing according to default frame length, described audio sampling data sequence is divided into a voice data frame sequence.

3. the method for voice in identification audio frequency according to claim 1, it is characterized in that, described exponent number P is 10.

4. the method for voice in identification audio frequency according to claim 1, is characterized in that, each the frame voice data after the linear predictive coding lpc analysis sub-frame processing that described use exponent number is P also extracts audio frequency characteristics and comprises:

Respectively each frame voice data is extracted to the short-time zero-crossing rate of this frame voice data, described short-time zero-crossing rate is the number of times of sound signal in the frame through zero level;

Respectively linear predictive coding lpc analysis is carried out to each frame voice data and obtain corresponding P rank LPC predictive coefficient a and LPC prediction residual;

Described LPC prediction residual corresponding to each frame voice data is respectively carried out Fast Fourier Transform (FFT) and is obtained corresponding LPC prediction residual amplitude spectrum, calculates the degree of bias and the kurtosis of described LPC prediction residual amplitude spectrum.

5. the method for voice in identification audio frequency according to claim 4, it is characterized in that, described short-time zero-crossing rate is according to following formulae discovery:

Z_{i} = \frac{1}{2} Σ_{n = 1}^{N} | sgn (S_{n}) - sgn (S_{n - 1}) |

sgn (S_{n}) = \{\begin{matrix} 1, & S_{n} > 0 \\ - 1, & S_{n} \leq 0 \end{matrix} .

6. the method for voice in identification audio frequency according to claim 4, it is characterized in that, the degree of bias of described LPC prediction residual amplitude spectrum and kurtosis are according to following formulae discovery:

\begin{matrix} v_{S S k} (i) = \frac{2 Σ_{k = 0}^{κ / 2 - 1} {(| X (k, i) | - μ_{| X |})}^{3}}{Γ \cdot σ_{| X |}^{3}} \\ v_{S K} (i) = \frac{2 Σ_{k = 0}^{κ / 2 - 1} {(| X (k, i) | - μ_{| X |})}^{4}}{Γ \cdot σ_{| X |}^{3}} - 3 \end{matrix}

7. whether the method for voice in identification audio frequency according to claim 1, is characterized in that, comprise in described each frame voice data according to described support vector machine identification containing voice:

Described support vector machine and default support vector machine are compared;

When described support vector machine is identical with default support vector machine, judge in each frame voice data described containing voice.

8. identify a device for voice in audio frequency, it is characterized in that, described device comprises: sub-frame processing module, audio feature extraction module, feature vector module, support vector machine training module and identification module;

Wherein, sub-frame processing module is used for carrying out sub-frame processing to voice data;

Audio feature extraction module for use exponent number to be P the sub-frame processing of linear predictive coding lpc analysis after each frame voice data and extract audio frequency characteristics, described audio frequency characteristics comprises the degree of bias and the kurtosis of short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum;

Feature vector module is used for forming P+3 rank proper vector according to described audio frequency characteristics;

Support vector machine training module obtains corresponding support vector machine for using support vector machines algorithm to carry out training to described proper vector;

Whether identification module is used in each frame voice data according to described support vector machine identification containing voice.

9. the device of voice in identification audio frequency according to claim 8, it is characterized in that, described sub-frame processing module comprises audio detection unit, channel audio data extracting unit and sub-frame processing unit;

Wherein, whether audio detection unit is two-channel or multichannel for detecting audio frequency;

Channel audio data extracting unit be used for when audio frequency be two-channel or multichannel time, merging all sound channels is sound channel extraction voice data;

Sub-frame processing unit is used for the audio sampling data sequence in described voice data to carry out sub-frame processing according to default frame length, and described audio sampling data sequence is divided into a voice data frame sequence.

10. the device of voice in identification audio frequency according to claim 8, it is characterized in that, described exponent number P is 10.

In 11. identification audio frequency according to claim 8, the device of voice, is characterized in that, described audio feature extraction module comprises short-time zero-crossing rate extraction unit, linear predictive coding LPC unit and LPC prediction residual amplitude spectrum analytic unit;

Wherein, short-time zero-crossing rate extraction unit is used for the short-time zero-crossing rate respectively each frame voice data being extracted to this frame voice data, and described short-time zero-crossing rate is the number of times of sound signal in the frame through zero level;

Linear predictive coding LPC unit is used for carrying out linear predictive coding lpc analysis to each frame voice data respectively and obtains corresponding P rank LPC predictive coefficient a and LPC prediction residual;

LPC prediction residual amplitude spectrum analytic unit carries out Fast Fourier Transform (FFT) for described LPC prediction residual corresponding to each frame voice data respectively and obtains corresponding LPC prediction residual amplitude spectrum, calculates the degree of bias and the kurtosis of described LPC prediction residual amplitude spectrum.

The device of voice in 12. identification audio frequency according to claim 11, it is characterized in that, described short-time zero-crossing rate is according to following formulae discovery:

Z_{i} = \frac{1}{2} Σ_{n = 1}^{N} | sgn (S_{n}) - sgn (S_{n - 1}) |

sgn (S_{n}) = \{\begin{matrix} 1, & S_{n} > 0 \\ - 1, & S_{n} \leq 0 \end{matrix} .

In 13. identification audio frequency according to claim 11, the device of voice, is characterized in that, the degree of bias of described LPC prediction residual amplitude spectrum and kurtosis are according to following formulae discovery:

\begin{matrix} v_{S S k} (i) = \frac{2 Σ_{k = 0}^{κ / 2 - 1} {(| X (k, i) | - μ_{| X |})}^{3}}{Γ \cdot σ_{| X |}^{3}} \\ v_{S K} (i) = \frac{2 Σ_{k = 0}^{κ / 2 - 1} {(| X (k, i) | - μ_{| X |})}^{4}}{Γ \cdot σ_{| X |}^{3}} - 3 \end{matrix}

Wherein, X (k, i) for LPC prediction residual amplitude spectrum, k be frequency, μ _{| X|}for amplitude spectrum average, for amplitude spectrum variance, Γ is Fourier transform length.

The device of voice in 14. identification audio frequency according to claim 8, it is characterized in that, described identification module comprises comparing unit and recognition unit;

Wherein, comparing unit is used for the described support vector machine of each frame audio frequency to compare with default support vector machine;

Recognition unit is used for when described support vector machine is identical with default support vector machine, judges in each frame voice data described containing voice.