Embodiment
Below in conjunction with drawings and Examples, the present invention is described in further detail.Be understandable that, specific embodiment described herein is only for explaining the present invention, but not limitation of the invention.It also should be noted that, for convenience of description, illustrate only part related to the present invention in accompanying drawing but not entire infrastructure.
Figure 1 illustrates the first embodiment of the present invention.
Fig. 1 is the method for voice in the identification audio frequency in first embodiment of the invention, and details are as follows for this realization flow 100:
In a step 101, sub-frame processing is carried out to voice data.
Step 101 (as shown in Figure 2) specifically comprises:
Whether step 1011, detection audio frequency are two-channel or multichannel.
In the present embodiment, the audio frequency of input can be monophony, two-channel or multichannel, if detect that audio frequency is two-channel or multichannel, can extract the L channel of audio frequency or all sound channels is fused together and carry out sub-frame processing again; If detect that audio frequency is monophony, then direct according to default frame length, sub-frame processing is carried out to audio sampling data sequence.
Step 1012, when audio frequency be two-channel or multichannel time, merging all sound channels is that sound channel extracts voice data.
Step 1013, the audio sampling data sequence in described voice data is carried out sub-frame processing according to default frame length, described audio sampling data sequence is divided into a voice data frame sequence.
In the present embodiment, the audio sampling data sequence in voice data is carried out sub-frame processing according to the frame length preset.Due to sound signal as a whole its characteristic and the parameter that characterizes its essential characteristic are all times to time change, so the Digital Signal Processing of use reason stationary signal analyzing and processing can not be carried out.Although sound signal has time-varying characteristics, but (it is generally acknowledged within the short time of 10 ~ 30ms) in a short time range, its characteristic remains unchanged namely relatively stable substantially, thus can be regarded as a metastable state process, namely carry out " short-time analysis ", sound signal is divided into one by one analyze its characteristic parameter.
In a step 102, use the linear predictive coding (LPC) that exponent number is P analyze each the frame voice data after sub-frame processing and extract audio frequency characteristics, described audio frequency characteristics comprises the degree of bias and the kurtosis of short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum.
Step 102 (as shown in Figure 3) specifically comprises:
Step 1021, respectively each frame voice data is extracted to the short-time zero-crossing rate of this frame voice data, described short-time zero-crossing rate is the number of times of sound signal in the frame through zero level.
In the present embodiment, short-time zero-crossing rate represents the number of times of a frame audio frequency sound intermediate frequency signal through zero level.It can distinguish voiceless sound and voiced sound, because the high band zero-crossing rate in sound signal is higher, low-frequency range zero-crossing rate is lower.Short-time zero-crossing rate is according to following formulae discovery:
Wherein, natural number n is the sequence number of the audio sample value in the i-th frame, and its maximal value is N, S
nbe the n-th sampled value, sgn () is sign function, audio sample value S
nsign function for positive number is 1, audio sample value S
nfor the sign function of negative and 0 is all-1, namely
Step 1022, respectively to each frame voice data carry out linear predictive coding (LPC) analyze obtain corresponding P rank LPC predictive coefficient and LPC prediction residual.
In the present embodiment, linear predictive coding (LPC, LinearPredictiveCoding) be the parameter being produced sound channel excitation and transfer function by analyzing audio waveform, coding to these parameters is just converted to the coding reality of sound waveform, the data volume of sound is greatly reduced.The present value of a phonetic sampling, can approach by the weighted linear combination of the past value of several phonetic samplings, and the weighting coefficient in linear combination just becomes linear predictor coefficient (LPC coefficient).Lpc analysis is that linear time invariant cause and effect systems stabilisation sets up an all-pole modeling, and utilizes mean-square error criteria, carries out model parameter estimation to known voice signal s (n).If utilize P sampling value to predict, then become the linear prediction of P rank.Suppose with in the past P sampling value s (n-1), s (n-2) ... s (n-p) } weighting sum carry out prediction signal current sample value s (n), then prediction signal
for:
Wherein weighting coefficient a
krepresent, be called LPC predictive coefficient a
k, then predicated error e (n) is:
If prediction is best, then short-time average predicated error ε to be made namely minimum:
ε=E[e
2(n)]=min
Make φ (i, k)=E [s (n-i), s (n-k)], minimum ε can be expressed as:
Error is more close to zero, and the accuracy of linear prediction be the best in the minimum meaning of square error, can calculate predictive coefficient thus.
Step 1023, described LPC prediction residual corresponding to each frame voice data are respectively carried out Fast Fourier Transform (FFT) and are obtained corresponding LPC prediction residual amplitude spectrum, calculate the degree of bias and the kurtosis of described LPC prediction residual amplitude spectrum.
In the present embodiment, LPC prediction residual is carried out Fourier transform and is obtained LPC prediction residual amplitude spectrum X (k, i), then obtains the degree of bias and the kurtosis of LPC prediction residual amplitude spectrum, and its computing formula is as follows:
Wherein, X (k, i) is the LPC prediction residual amplitude spectrum of the i-th frame, and k is frequency, μ
| X|for amplitude spectrum average,
for amplitude spectrum variance, Γ is Fourier transform length.
In step 103, P+3 rank proper vector is formed according to described audio frequency characteristics.
In the present embodiment, according to the degree of bias and the kurtosis formation P+3 rank proper vector of the short-time zero-crossing rate obtained in above-mentioned 102 steps, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum.By lpc analysis, N group LPC vector parameters can be obtained by N frame audio frequency, namely can form N group P+3 rank proper vector.
At step 104, use support vector machine (SVM) algorithm to carry out training to described proper vector and obtain corresponding support vector machine.
In the present embodiment, use support vector machine (SVM) algorithm to carry out training to the proper vector formed in step 103 and obtain corresponding support vector machine.
Support vector machine (SVM) algorithm is, by nonlinear transformation, input vector is mapped to a high-dimensional feature space, then asks optimal separating hyper plane in above-mentioned space, and this nonlinear transformation can by defining suitable kernel function to realize.The linear kernel of kernel function main at present, polynomial expression kernel, radial basis kernel (RBF), sigmoid core.SVM (SupportVectorMachine) algorithm is as follows:
If sample set is (x
i, y
i), i=1 ..., l, x ∈ R
d, {-1,1} is category label to y ∈.When can divide under two quasi-mode lines, the lineoid the Representation Equation dividing two classes is:
ω·x+b=0
Discriminant function form then in d dimension space is g (x)=ω x+b, and by discriminant function normalization, then optimal classification surface problem is:
y
i[(ω·x
i)+b]≥1i=1,…,l
Need by a nonlinear transformation Φ in linearly inseparable situation:
certain high-dimensional feature space will be mapped to, structural classification lineoid in high-dimensional feature space to mould-fixed sample
Two class problems of linearly inseparable can solve by asking the optimal classification surface of above formula, even if two classes are separated error-free, and the classification gap of two classes is maximum, and the mathematical form of this problem is:
y
i[(ω·x
i)+b]≥1i=1,…,l
ξ
i≥0i=1,…,l
Wherein ξ
ifor slack variable, C is penalty factor, can compromise by changing penalty factor between the generalization ability of sorter and misclassification rate.
The dual form of the problems referred to above is:
C≥a
i≥1i=1,…,l
Wherein, a=[a
1, a
2, a
l], a
ifor Lagrange multiplier,
be called kernel function.
Solve above formula and obtain a
i, classification function f (x) can be expressed as:
Be not a of 0
icorresponding training sample becomes support vector, and these support vectors meet
y
i[(ω·x
i)+b]=1
Thus constant b can be determined.Above-mentioned support vector machine can only distinguish the classification problem of two quasi-modes, and for the classification problem of m quasi-mode, can design m binary classifier, each sorter only distinguishes a quasi-mode and other class.
The key of SVM algorithm chooses the type of kernel function, mainly contains linear kernel, polynomial expression kernel, radial basis kernel (RBF), sigmoid core.Most widely used in these functions should be exactly RBF kernel function, no matter be small sample or large sample, the situation such as higher-dimension or low-dimensional, RBF kernel function is all applicable, the function that it compares other has following advantage: 1) sample can be mapped to the space of a more higher-dimension by RBF kernel function, and linear kernel function is a special case of gaussian kernel function; 2) compared with Polynomial kernel function, RBF needs the parameter determined to lack, the complexity of how many direct influence functions of kernel functional parameter, in addition, when polynomial exponent number is higher, the element value of nuclear matrix will be tending towards infinitely great or infinitely small, and RBF, then upper, can reduce the dyscalculia of numerical value; 3) for some parameter, RBF with sigmoid has similar performance.
In step 105, whether voice is contained in each frame voice data according to described support vector machine identification.
Whether in the present embodiment, the support vector machine obtained according to step 104 and default support vector machine compare, identify in audio frame containing voice.When described support vector machine is identical with default support vector machine, judge in each frame voice data described containing voice.The voice data that each frame is detected is likely one of following three kinds of situations: 1) only containing accompaniment sound; 2) simultaneously containing accompaniment sound and voice; 3) only containing voice.2) and 3) be all considered as voice to exist, 1) be then considered as without voice.Such as, the audio frequency of input is a whole song, carries out identifying the voice distribution that can obtain whole head song to whole head song.
In a preferred implementation of the present embodiment, choosing default frame length is in a step 101 20ms, has the data overlap of 50% between consecutive frame.
In another preferred implementation of the present embodiment, the exponent number P in step 102 is chosen for 10.
First embodiment of the invention obtains the degree of bias and kurtosis morphogenesis characters vector by the effective audio frequency characteristics of extraction and short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum, and use the mode of machine learning to identify voice in audio frequency, realize the identification of the high precision high confidence level of voice section in audio frequency, for song content analysis provides basic service, thus further realize the functions such as synchronous lyrics, categorizing songs, song recommendations.
Figure 4 illustrates the second embodiment of the present invention.
Fig. 4 is the device of voice in the identification audio frequency in second embodiment of the invention, and this device comprises: sub-frame processing module 401, audio feature extraction module 402, feature vector module 403, support vector machine training module 404 and identification module 405.
Wherein, sub-frame processing module 401 is for carrying out sub-frame processing to voice data.
Sub-frame processing module 401 comprises audio detection unit 4011, channel audio data extracting unit 4012 and sub-frame processing unit 4013.Wherein, whether audio detection unit 4011 is two-channel or multichannel for detecting audio frequency; Channel audio data extracting unit 4012 for when audio frequency be two-channel or multichannel time, merging all sound channels is that sound channel extracts voice data; Described audio sampling data sequence, for the audio sampling data sequence in described voice data is carried out sub-frame processing according to default frame length, is divided into a voice data frame sequence by sub-frame processing unit 4013.
In the present embodiment, the audio frequency of input can be monophony, two-channel or multichannel, if detect that audio frequency is two-channel or multichannel, can extract the L channel of audio frequency or all sound channels is fused together and carry out sub-frame processing again; If detect that audio frequency is monophony, then direct according to default frame length, sub-frame processing is carried out to audio sampling data sequence.Frame length is generally 20ms, has the data overlap of 50% between consecutive frame.
Audio feature extraction module 402 is analyzed each the frame voice data after sub-frame processing for the linear predictive coding (LPC) using exponent number to be P and is extracted audio frequency characteristics, and described audio frequency characteristics comprises the degree of bias and the kurtosis of short-time zero-crossing rate, P rank LPC predictive coefficient and LPC prediction residual amplitude spectrum.
Audio feature extraction module 402 comprises short-time zero-crossing rate extraction unit 4021, linear predictive coding (LPC) unit 4022 and LPC prediction residual amplitude spectrum analytic unit 4023.Wherein, short-time zero-crossing rate extraction unit 4021 is for extracting the short-time zero-crossing rate of this frame voice data respectively to each frame voice data, described short-time zero-crossing rate is sound signal in the frame through the number of times of null value and coordinate axis transverse axis; Linear predictive coding (LPC) unit 4022 obtains corresponding P rank LPC predictive coefficient a and LPC prediction residual for carrying out linear predictive coding (LPC) analysis to each frame voice data respectively; LPC prediction residual amplitude spectrum analytic unit 4023 carries out Fast Fourier Transform (FFT) for described LPC prediction residual corresponding to each frame voice data respectively and obtains corresponding LPC prediction residual amplitude spectrum, calculates the degree of bias and the kurtosis of described LPC prediction residual amplitude spectrum.
In the present embodiment, short-time zero-crossing rate represents that a frame audio frequency sound intermediate frequency signal passes the number of times of null value and coordinate axis transverse axis.The present value of a phonetic sampling, can approach by the weighted linear combination of the past value of several phonetic samplings, and the weighting coefficient in linear combination just becomes linear predictor coefficient (LPC coefficient).Lpc analysis is that linear time invariant cause and effect systems stabilisation sets up an all-pole modeling, and utilizes mean-square error criteria, carries out model parameter estimation to known voice signal s (n).Utilize 10 sampling values to predict, then become 10 rank linear predictions.LPC prediction residual is carried out Fourier transform and is obtained LPC prediction residual amplitude spectrum X (k, i), then obtains the degree of bias and the kurtosis of LPC prediction residual amplitude spectrum.
Feature vector module 403 is for forming P+3 rank proper vector according to described audio frequency characteristics.
In the present embodiment, 13 rank proper vectors are formed according to the degree of bias of the short-time zero-crossing rate obtained in above-mentioned 102 steps, 10 rank LPC predictive coefficients and LPC prediction residual amplitude spectrum and kurtosis.By lpc analysis, N group LPC vector parameters can be obtained by N frame audio frequency, namely can form N group 13 rank proper vector.
Support vector machine training module 404 obtains corresponding support vector machine for using support vector machine (SVM) algorithm to carry out training to described proper vector.
In the present embodiment, use support vector machine (SVM) algorithm to carry out training to the proper vector that feature vector module 403 obtains and obtain corresponding support vector machine.
Support vector machine (SVM) algorithm is, by nonlinear transformation, input vector is mapped to a high-dimensional feature space, then asks optimal separating hyper plane in above-mentioned space, and this nonlinear transformation can by define suitable in Product function realize.The linear form of kernel function main at present, polynomial form, radial basis kernel (RBF) and sigmoid core.The linear kernel of SVM algorithms selection or radial basis kernel.
Identification module 405 is for the voice in each frame voice data according to described support vector machine identification.
Identification module 405 (as shown in Figure 4) comprises comparing unit 4051 and recognition unit 4052, and wherein, comparing unit 4051 is for comparing the described support vector machine of each frame audio frequency with default support vector machine; Recognition unit 4052, for when described support vector machine is identical with default support vector machine, judges in each frame voice data described containing voice.
Whether in the present embodiment, the support vector machine of training to obtain according to support vector machine training module 404 pairs of proper vectors and default support vector machine compare, identify in audio frame containing voice.The voice data that each frame is detected is likely one of following three kinds of situations: 1) only containing accompaniment sound; 2) simultaneously containing accompaniment sound and voice; 3) only containing voice.2) and 3) be all considered as voice to exist, 1) be then considered as without voice.Such as, the audio frequency of input is a whole song, carries out identifying the voice distribution that can obtain whole head song to whole head song.
Second embodiment of the invention sub-frame processing module is by audio frequency framing, then audio feature extraction module is to the effective audio frequency characteristics of the audio extraction after framing, the audio frequency characteristics morphogenesis characters vector that feature vector module will obtain, support vector machine training module carries out training to the proper vector formed and obtains corresponding support vector machine, identification module is according to the support vector machine obtained and default support vector machine relative discern voice, realize the identification of the high precision high confidence level of voice section in audio frequency, for song content analysis provides basic service, thus further realize synchronous lyrics, categorizing songs, the functions such as song recommendations.
One of ordinary skill in the art will appreciate that all or part of step realized in said method embodiment can carry out by program the hardware that instruction is correlated with to have come, described program can be stored in computer read/write memory medium, here the storage medium of indication, as: ROM/RAM, magnetic disc, CD etc.
Note, above are only preferred embodiment of the present invention and institute's application technology principle.Skilled person in the art will appreciate that and the invention is not restricted to specific embodiment described here, various obvious change can be carried out for a person skilled in the art, readjust and substitute and can not protection scope of the present invention be departed from.Therefore, although be described in further detail invention has been by above embodiment, the present invention is not limited only to above embodiment, when not departing from the present invention's design, can also comprise other Equivalent embodiments more, and scope of the present invention is determined by appended right.