CN102708861A

CN102708861A - Poor speech recognition method based on support vector machine

Info

Publication number: CN102708861A
Application number: CN2012101973771A
Authority: CN
Inventors: 傅政军; 姚金良; 王小华; 黄金海; 周建政; 周渝清; 严俊杰
Original assignee: JINHUA JIUYUEWOBA NETWORK TECHNOLOGY CO LTD; Tiange Technology (hangzhou) Co Ltd; Hangzhou Electronic Science and Technology University
Current assignee: JINHUA JIUYUEWOBA NETWORK TECHNOLOGY CO LTD; Tiange Technology (hangzhou) Co Ltd; Hangzhou Dianzi University; Hangzhou Electronic Science and Technology University
Priority date: 2012-06-15
Filing date: 2012-06-15
Publication date: 2012-10-03

Abstract

The invention relates to a poor speech recognition method based on a support vector machine. The method comprises the steps of firstly acquiring an input voice stream, decoding the input voice stream to be an original voice signal, and performing the preprocessing operation; conducting windowing and framing processing for the voice data after being preprocessed; extracting shifting difference cepstrum parameter characteristics from each frame speech; classifying shifting difference cepstrum parameter characteristics by utilizing a Gaussian mixture model; then classifying candidate frames of the classified poor speech by utilizing a support vector machine (SVM) classifier, and confirming a final poor speech frame; and extracting a poor speech fragment according to the quantity of the poor speech frames within a given time to be stored. Multiple Gaussian mixture models can realize the quick classification to extract the candidate poor speech frames, and the classification accuracy can be improved by the support vector machine classifier.

Description

Bad audio recognition method based on SVMs

Technical field

The invention belongs to the intelligent sound process field, be specifically related to a kind of bad audio recognition method based on SVMs.

Background technology

Bad speech recognition is from real-time voice stream, to detect the fragment that includes bad voice automatically, and wherein bad voice are meant various pornographic voice.Along with the arrival in Web2.0 epoch, and the issue of Web content lacks effective monitoring mechanism, and a large amount of pornographic information occur on the internet.How effectively to suppress the propagation of pornographic information on network is an important process.Principle according to government department's " whom who manages is responsible for, and whom who inserts is responsible for " is filtered flame, and how efficiently a large amount of web2.0 websites all need face the problem of filtering eroticism information.Thereby automatic bad voice and video recognition technology has widespread use and industrialization prospect.Bad speech recognition technology can merge with bad video detection technology discerns bad multimedia messages, is the current important technical that promotes the network environment sound development from technical standpoint.

The technological achievement of current bad image/video identification is more, but the achievement of bad speech recognition technology is less, mainly contains following several method:

(1) method at first extracts audio-frequency information from video file, again through being divided into 0.02 second the frame of Audio Processing in short-term after the Hamming window windowing process, and the Audio Processing frame is extracted characteristic such as MFCC coefficient.Utilize short-time energy to divide quiet frame and non-quiet frame then with the Audio Processing frame; Utilize single Gauss model that non-quiet frame further is divided into four types of music, voice, music voice compound voice and ambient sound again, utilize hidden Markov model from remaining voice and music voice hybrid frame, to identify at last possibly to comprise the audio frame of pornographic again.(Ji Pengyu, pornographic Video and Audio aid identification, Beijing University of Post & Telecommunication, Master's thesis, 2011).

(2) method has proposed a kind of characteristic (repeated curve-like spectrum feature) that speech frequency constantly repeats of portraying on the basis of MFCC coefficient, and as the characteristic of bad speech recognition, discerns bad voice with the svm classifier device.(JaeDeok?Lim?et?al.,?Classification?and?Detection?of?Objectionable?Sounds?Using?Repeated?Curve-like?Spectrum?Feature,?2011?International?Conference?on?Information?Science?and?Applications?(ICISA),?pp.1-5,?2011)。

The most key technology of bad speech recognition is: the extraction of bad phonetic feature and the selection of sorter.Because bad speech recognition system can't know in advance the voice messaging of input is recorded under which kind of situation; There is lot of background sound in a lot of bad voice; Such as: music, therefore, it is will be to various noise robustness that bad phonetic feature extracts of paramount importance.The MFCC coefficient is widely used in field of speech recognition, but is not the most effective phonetic feature, and current have the phonetic feature of much new more robust to be extracted out.Aspect the selection of sorter, single Gauss model can be used for discerning bad voice, but the bad voice that it can only the modeling single kind, and in fact, the kind of bad voice is more.Adopt SVMs that the demand side difficulty higher to computation complexity discerned in bad voice.And hidden Markov model improves the accurate of identification through the relation between the adjacent speech frame of modeling, and normal voice converts the probability that the conditional probability of bad voice is difficult to modeling reality into.

Summary of the invention

The object of the invention mainly is to the existing not high problem of bad audio recognition method robustness, and the method that a kind of algorithm complex is low, better precision is arranged is provided, and realizes detecting the bad sound bite of current network voice flow.

The inventive method step is following:

Step (1) is obtained the input voice flow, and voice flow is decoded as primary speech signal, and carries out pretreatment operation, and pretreatment operation mainly may further comprise the steps:

1) if the input audio frequency is the stereo language signal, then carries out monophony and handle, being about to stereo mix is the monophony voice;

2) if the sampling rate and the predefined sampling rate of method of input audio frequency are inconsistent, then carry out the sampling rate adjustment, be about to audio frequency crude sampling rate and convert predefined sampling rate into.

3) if the quantization digit and the predefined quantization digit of method of input audio frequency are inconsistent, then carry out re-quantization, be about to original quantized value and convert new quantized value into.

Step (2) is carried out windowing to pretreated speech data and is divided frame to handle;

Step (3) is extracted shift differential cepstrum parameter characteristic to every frame voice;

Step (4) adopts gauss hybrid models to classify to shift differential cepstrum parameter characteristic; The sample data of its study comprises the bad sound bite of various kinds.

Step (5) is carried out the classification of SVMs to the candidate frame that is categorized as bad voice, confirms final bad speech frame;

Step (6) is extracted bad sound bite according to the quantity of bad speech frame in the certain hour and is stored.

The present invention is a kind of bad audio recognition method based on SVMs, adopts gauss hybrid models and SVMs to classify simultaneously.Many gauss hybrid models can be realized classifying fast, extract the bad speech frame of candidate, and support vector machine classifier can improve the accuracy of classification.Gauss hybrid models can the bad voice of effectively modeling multiclass than single Gauss model.The gauss hybrid models sorter need pass through sample learning gauss hybrid models parameter, and support vector machine classifier need obtain support vector through sample learning.Realize the extraction of bad voice messaging at last through the frame number of bad voice in a period of time.In the sample voice storehouse, test, its recall rate has reached more than 70%.

Description of drawings

Fig. 1 is the inventive method process flow diagram.

Embodiment

Below will introduce embodiments of the invention in detail with reference to accompanying drawing.

Fig. 1 is a FB(flow block), has represented the process flow diagram of the bad audio recognition method of the present invention.

The voice signal that this method is handled can come in the audio-frequency information of decoding video stream, also can be voice flow independently.Voice flow can be directed against multiple coded format, for example: wav, MP3 etc., as long as can this format audio be decoded.This method is supported various sampling rates, various quantization, stereo simultaneously.In the present embodiment, divide three parts to introduce, be respectively: frame feature extraction, sorter study, three parts of Real time identification.The frame feature extraction is the basis of sorter study and real-time grading, and Real time identification also will detect under the situation of classifier parameters having learnt.

Pre-service is the prerequisite step of speech recognition, and it mainly makes predefined being consistent of sampling rate, quantization, channel type and method of audio frequency.In the present embodiment, the sampling rate of predefined treatable voice signal is 16K (HZ), and quantization is 16 (bit), and channel type is monophony (Mono).Therefore, to the sound signal of input, method at first judges whether to be the monophony voice.For stereo, need through the audio mix mode the synthetic single pass sound signal of the sound signal of a plurality of passages; Then, method judges whether the quantization of voice signal is 16, and the audio frequency that is higher than 16 is compressed to 16 again, and the audio frequency that is lower than 16 is expanded to 16 again.At last, resample to make for the inconsistent audio frequency of sampling rate and obtain the sound signal that sampling rate is 16K (HZ).Audio frequency when making audio frequency and the training classifier of detection through pre-service is consistent on sampling rate, quantization, channel type, improves recall rate.

It partly is that voice flow is carried out slicing treatment that windowing divides frame, and the sound signal of regular length is as the frame in the audio stream in every separated set time intercepting voice flow, and the back is to each frame feature extraction and classification of extraction.In order to improve verification and measurement ratio, window needs the overlapping of part.It is 4 seconds sound signal at a distance from 2 seconds intercepted lengths that present embodiment is set every, and just adjacent two frames have half overlapping.

It mainly is on the basis of minute frame, every frame to be extracted the shift differential cepstrum parameter that the shift differential cepstrum parameter extracts part.This characteristic is by the difference cepstrum of a plurality of speech frames (the Torres-Carrasquillo P A that is spliced on the basis of MFCC characteristic; Et al. Approaches to Language Identification Using Gaussian Mixture Models and Shifted Delta Cepstral Features; In ICSLP-2002, PP. 89-92).The computation process of this characteristic is following:

The N dimension MFCC that supposes the t frame is characterized as:

Figure 2012101973771100002DEST_PATH_IMAGE002

;

Calculate the difference of t frame i piece according to following formula:

Figure 2012101973771100002DEST_PATH_IMAGE004

Wherein P moves for the frame that calculates difference cepstrum adjacent block; D is the interFrameGap of difference cepstrum; K is the number of difference cepstrum piece.

According to above formula k piece differential set is combined into a shift differential cepstrum parameter characteristic (SDC), that is:

In the present embodiment, the N of employing, d, P, the parameter of k is respectively: 7,1,3,7.Thereby constituted the characteristic that contains 49 parameters,, made up the proper vector of 56 dimension length altogether simultaneously in the MFCC characteristic that adds 7 dimensions.

Sorter learning section branch comprises: the obtaining of gauss hybrid models study and svm classifier device support vector.

1) gauss hybrid models part

Gauss hybrid models is the mixing of a plurality of Gauss models.Suppose that single Gauss model is:

Wherein

is mean value vector, and

Figure 2012101973771100002DEST_PATH_IMAGE012

is covariance matrix; X is the proper vector of observation.Gauss hybrid models is:

Figure 2012101973771100002DEST_PATH_IMAGE014

Wherein

Figure 2012101973771100002DEST_PATH_IMAGE016

is the weight of Gauss model.In order to make the Gaussian mixture model as a classifier bad speech samples need to learn where the parameters :

Figure 2012101973771100002DEST_PATH_IMAGE020

.Maximum algorithm (EM) is expected in current main employing.The thinking of expecting maximum algorithm is from a given model parameter beginning, estimates new

.Algorithm makes under the parameter of new likelihood score bigger.New again as the current

iterative calculation.When this algorithm begins, need confirm the number of gaussian component, to count M be 4 to the regulation gaussian component in the present embodiment.The key of this algorithm is calculated new

according to current

, and the more new formula of each parameter is following:

Suppose i Gauss's posterior probability

Figure 2012101973771100002DEST_PATH_IMAGE022

The more new formula of gaussian component weight is:

Figure 2012101973771100002DEST_PATH_IMAGE024

The more new formula of mean value vector is:

Figure 2012101973771100002DEST_PATH_IMAGE026

The more new formula of covariance matrix is:

Figure 2012101973771100002DEST_PATH_IMAGE028

Below more new formula has guaranteed the monotone increasing of model likelihood score; Reach certain degree of convergence and can stop to upgrade, thereby obtain final model parameter

.

2) svm classifier device part

The svm classifier device is current the most effectively machine learning classification method, particularly under sample size is not very big situation.The ultimate principle of svm classifier device is that the unique point in the lower dimensional space is mapped in the high-dimensional feature space, makes these unique points according to the label linear separability.The decision-making plane of svm classifier device mainly depends on the support vector that from sample, obtains and estimates.

In the present embodiment, adopted the LibSvm application program interface function storehouse of increasing income to carry out the study of svm classifier device.The proper vector that adopts is made up of the shift differential cepstrum parameter.Training and dividing time-like all to use the normalization operation.In the present embodiment, the positive sample of study is exactly from bad voice, to obtain various bad speech frames, and it comprises various types of bad voice, also comprises the bad speech frame under the diversity of settings.Negative sample is the speech frame of intercepting at random from chat voice and song sound-type.When study, positive sample number is 2314, and the negative sample number is 3457.The parameter of using during the model of study support vector collection is following: the mistake penalty coefficient is set at 4000, and kernel function is selected RBF.Obtaining gamma through cross validation then is 1.2453.Support vector number through obtaining after the LibSvm study is 1711, and wherein positive support vector is 513, and these support vectors and svm classifier device parameter are saved in file.When detecting in real time, only need load this model and can realize real-time classification.

Real time identification partly is that sorter has loaded on the basis of the sorter model of succeeding in school in advance, and the input voice extract through pre-service, shift differential cepstrum parameter, are input in the sorter then.On the basis of gauss hybrid models according to input shift differential cepstrum parameter characteristic, calculate the probability that this proper vector belongs to bad voice gauss hybrid models according to following formula.

Present embodiment adopts a probability threshold value to be used to confirm whether the speech frame of importing belongs to bad voice messaging.

After having passed through the gauss hybrid models classification, be the frame of bad voice to the candidate, adopt the svm classifier device to classify again.Whether the svm classifier device is divided into the shift differential cepstrum parameter characteristic of importing is bad voice.This method is added up the quantity of the bad speech frame of confirming through the svm classifier device in preset time more then, and is used to confirm through a threshold value (th) whether this section voice are bad voice.In the present embodiment, the given time that we set is 30 seconds, and threshold value th is set at 8.If bad voice then are encoded to the wav form with it and preserve, be used for follow-up manual work checking.

Claims

1. based on the bad audio recognition method of SVMs, it is characterized in that this method may further comprise the steps:

Step 1: obtain the input voice flow, voice flow is decoded as primary speech signal, and carry out pretreatment operation;

Step 2: pretreated speech data is carried out windowing divide frame to handle;

Step 3: every frame voice are extracted shift differential cepstrum parameter characteristic;

Step 4: adopt gauss hybrid models to classify to shift differential cepstrum parameter characteristic;

Step 5: adopt the svm classifier device to classify to the candidate frame that is categorized as bad voice, confirm final bad speech frame;

Step 6: extract bad sound bite and store according to the quantity of bad speech frame in the certain hour.

2. bad audio recognition method as claimed in claim 1 is characterized in that, pretreatment operation comprises that monophony is handled, sampling rate is adjusted and re-quantization.

3. bad audio recognition method as claimed in claim 1 is characterized in that: gauss hybrid models in the step 4 adopts is characterized as the shift differential cepstrum parameter.

4. bad audio recognition method as claimed in claim 1 is characterized in that: the proper vector that the svm classifier device in the step 5 adopts is the shift differential cepstrum parameter.

5. bad audio recognition method as claimed in claim 1 is characterized in that: the quantity of the bad speech frame in the step 6 is added up in some frames before the current detection position.