CN102708861A - Poor speech recognition method based on support vector machine - Google Patents

Poor speech recognition method based on support vector machine Download PDF

Info

Publication number
CN102708861A
CN102708861A CN2012101973771A CN201210197377A CN102708861A CN 102708861 A CN102708861 A CN 102708861A CN 2012101973771 A CN2012101973771 A CN 2012101973771A CN 201210197377 A CN201210197377 A CN 201210197377A CN 102708861 A CN102708861 A CN 102708861A
Authority
CN
China
Prior art keywords
bad
frame
voice
speech
recognition method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012101973771A
Other languages
Chinese (zh)
Inventor
傅政军
姚金良
王小华
黄金海
周建政
周渝清
严俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JINHUA JIUYUEWOBA NETWORK TECHNOLOGY CO LTD
Tiange Technology (hangzhou) Co Ltd
Hangzhou Dianzi University
Hangzhou Electronic Science and Technology University
Original Assignee
JINHUA JIUYUEWOBA NETWORK TECHNOLOGY CO LTD
Tiange Technology (hangzhou) Co Ltd
Hangzhou Electronic Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JINHUA JIUYUEWOBA NETWORK TECHNOLOGY CO LTD, Tiange Technology (hangzhou) Co Ltd, Hangzhou Electronic Science and Technology University filed Critical JINHUA JIUYUEWOBA NETWORK TECHNOLOGY CO LTD
Priority to CN2012101973771A priority Critical patent/CN102708861A/en
Publication of CN102708861A publication Critical patent/CN102708861A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a poor speech recognition method based on a support vector machine. The method comprises the steps of firstly acquiring an input voice stream, decoding the input voice stream to be an original voice signal, and performing the preprocessing operation; conducting windowing and framing processing for the voice data after being preprocessed; extracting shifting difference cepstrum parameter characteristics from each frame speech; classifying shifting difference cepstrum parameter characteristics by utilizing a Gaussian mixture model; then classifying candidate frames of the classified poor speech by utilizing a support vector machine (SVM) classifier, and confirming a final poor speech frame; and extracting a poor speech fragment according to the quantity of the poor speech frames within a given time to be stored. Multiple Gaussian mixture models can realize the quick classification to extract the candidate poor speech frames, and the classification accuracy can be improved by the support vector machine classifier.

Description

Bad audio recognition method based on SVMs
Technical field
The invention belongs to the intelligent sound process field, be specifically related to a kind of bad audio recognition method based on SVMs.
Background technology
Bad speech recognition is from real-time voice stream, to detect the fragment that includes bad voice automatically, and wherein bad voice are meant various pornographic voice.Along with the arrival in Web2.0 epoch, and the issue of Web content lacks effective monitoring mechanism, and a large amount of pornographic information occur on the internet.How effectively to suppress the propagation of pornographic information on network is an important process.Principle according to government department's " whom who manages is responsible for, and whom who inserts is responsible for " is filtered flame, and how efficiently a large amount of web2.0 websites all need face the problem of filtering eroticism information.Thereby automatic bad voice and video recognition technology has widespread use and industrialization prospect.Bad speech recognition technology can merge with bad video detection technology discerns bad multimedia messages, is the current important technical that promotes the network environment sound development from technical standpoint.
The technological achievement of current bad image/video identification is more, but the achievement of bad speech recognition technology is less, mainly contains following several method:
(1) method at first extracts audio-frequency information from video file, again through being divided into 0.02 second the frame of Audio Processing in short-term after the Hamming window windowing process, and the Audio Processing frame is extracted characteristic such as MFCC coefficient.Utilize short-time energy to divide quiet frame and non-quiet frame then with the Audio Processing frame; Utilize single Gauss model that non-quiet frame further is divided into four types of music, voice, music voice compound voice and ambient sound again, utilize hidden Markov model from remaining voice and music voice hybrid frame, to identify at last possibly to comprise the audio frame of pornographic again.(Ji Pengyu, pornographic Video and Audio aid identification, Beijing University of Post & Telecommunication, Master's thesis, 2011).
(2) method has proposed a kind of characteristic (repeated curve-like spectrum feature) that speech frequency constantly repeats of portraying on the basis of MFCC coefficient, and as the characteristic of bad speech recognition, discerns bad voice with the svm classifier device.(JaeDeok?Lim?et?al.,?Classification?and?Detection?of?Objectionable?Sounds?Using?Repeated?Curve-like?Spectrum?Feature,?2011?International?Conference?on?Information?Science?and?Applications?(ICISA),?pp.1-5,?2011)。
The most key technology of bad speech recognition is: the extraction of bad phonetic feature and the selection of sorter.Because bad speech recognition system can't know in advance the voice messaging of input is recorded under which kind of situation; There is lot of background sound in a lot of bad voice; Such as: music, therefore, it is will be to various noise robustness that bad phonetic feature extracts of paramount importance.The MFCC coefficient is widely used in field of speech recognition, but is not the most effective phonetic feature, and current have the phonetic feature of much new more robust to be extracted out.Aspect the selection of sorter, single Gauss model can be used for discerning bad voice, but the bad voice that it can only the modeling single kind, and in fact, the kind of bad voice is more.Adopt SVMs that the demand side difficulty higher to computation complexity discerned in bad voice.And hidden Markov model improves the accurate of identification through the relation between the adjacent speech frame of modeling, and normal voice converts the probability that the conditional probability of bad voice is difficult to modeling reality into.
Summary of the invention
The object of the invention mainly is to the existing not high problem of bad audio recognition method robustness, and the method that a kind of algorithm complex is low, better precision is arranged is provided, and realizes detecting the bad sound bite of current network voice flow.
The inventive method step is following:
Step (1) is obtained the input voice flow, and voice flow is decoded as primary speech signal, and carries out pretreatment operation, and pretreatment operation mainly may further comprise the steps:
1) if the input audio frequency is the stereo language signal, then carries out monophony and handle, being about to stereo mix is the monophony voice;
2) if the sampling rate and the predefined sampling rate of method of input audio frequency are inconsistent, then carry out the sampling rate adjustment, be about to audio frequency crude sampling rate and convert predefined sampling rate into.
3) if the quantization digit and the predefined quantization digit of method of input audio frequency are inconsistent, then carry out re-quantization, be about to original quantized value and convert new quantized value into.
Step (2) is carried out windowing to pretreated speech data and is divided frame to handle;
Step (3) is extracted shift differential cepstrum parameter characteristic to every frame voice;
Step (4) adopts gauss hybrid models to classify to shift differential cepstrum parameter characteristic; The sample data of its study comprises the bad sound bite of various kinds.
Step (5) is carried out the classification of SVMs to the candidate frame that is categorized as bad voice, confirms final bad speech frame;
Step (6) is extracted bad sound bite according to the quantity of bad speech frame in the certain hour and is stored.
The present invention is a kind of bad audio recognition method based on SVMs, adopts gauss hybrid models and SVMs to classify simultaneously.Many gauss hybrid models can be realized classifying fast, extract the bad speech frame of candidate, and support vector machine classifier can improve the accuracy of classification.Gauss hybrid models can the bad voice of effectively modeling multiclass than single Gauss model.The gauss hybrid models sorter need pass through sample learning gauss hybrid models parameter, and support vector machine classifier need obtain support vector through sample learning.Realize the extraction of bad voice messaging at last through the frame number of bad voice in a period of time.In the sample voice storehouse, test, its recall rate has reached more than 70%.
Description of drawings
Fig. 1 is the inventive method process flow diagram.
Embodiment
Below will introduce embodiments of the invention in detail with reference to accompanying drawing.
Fig. 1 is a FB(flow block), has represented the process flow diagram of the bad audio recognition method of the present invention.
The voice signal that this method is handled can come in the audio-frequency information of decoding video stream, also can be voice flow independently.Voice flow can be directed against multiple coded format, for example: wav, MP3 etc., as long as can this format audio be decoded.This method is supported various sampling rates, various quantization, stereo simultaneously.In the present embodiment, divide three parts to introduce, be respectively: frame feature extraction, sorter study, three parts of Real time identification.The frame feature extraction is the basis of sorter study and real-time grading, and Real time identification also will detect under the situation of classifier parameters having learnt.
Pre-service is the prerequisite step of speech recognition, and it mainly makes predefined being consistent of sampling rate, quantization, channel type and method of audio frequency.In the present embodiment, the sampling rate of predefined treatable voice signal is 16K (HZ), and quantization is 16 (bit), and channel type is monophony (Mono).Therefore, to the sound signal of input, method at first judges whether to be the monophony voice.For stereo, need through the audio mix mode the synthetic single pass sound signal of the sound signal of a plurality of passages; Then, method judges whether the quantization of voice signal is 16, and the audio frequency that is higher than 16 is compressed to 16 again, and the audio frequency that is lower than 16 is expanded to 16 again.At last, resample to make for the inconsistent audio frequency of sampling rate and obtain the sound signal that sampling rate is 16K (HZ).Audio frequency when making audio frequency and the training classifier of detection through pre-service is consistent on sampling rate, quantization, channel type, improves recall rate.
It partly is that voice flow is carried out slicing treatment that windowing divides frame, and the sound signal of regular length is as the frame in the audio stream in every separated set time intercepting voice flow, and the back is to each frame feature extraction and classification of extraction.In order to improve verification and measurement ratio, window needs the overlapping of part.It is 4 seconds sound signal at a distance from 2 seconds intercepted lengths that present embodiment is set every, and just adjacent two frames have half overlapping.
It mainly is on the basis of minute frame, every frame to be extracted the shift differential cepstrum parameter that the shift differential cepstrum parameter extracts part.This characteristic is by the difference cepstrum of a plurality of speech frames (the Torres-Carrasquillo P A that is spliced on the basis of MFCC characteristic; Et al. Approaches to Language Identification Using Gaussian Mixture Models and Shifted Delta Cepstral Features; In ICSLP-2002, PP. 89-92).The computation process of this characteristic is following:
The N dimension MFCC that supposes the t frame is characterized as:
Figure 2012101973771100002DEST_PATH_IMAGE002
;
Calculate the difference of t frame i piece according to following formula:
Figure 2012101973771100002DEST_PATH_IMAGE004
Wherein P moves for the frame that calculates difference cepstrum adjacent block; D is the interFrameGap of difference cepstrum; K is the number of difference cepstrum piece.
According to above formula k piece differential set is combined into a shift differential cepstrum parameter characteristic (SDC), that is:
In the present embodiment, the N of employing, d, P, the parameter of k is respectively: 7,1,3,7.Thereby constituted the characteristic that contains 49 parameters,, made up the proper vector of 56 dimension length altogether simultaneously in the MFCC characteristic that adds 7 dimensions.
Sorter learning section branch comprises: the obtaining of gauss hybrid models study and svm classifier device support vector.
1) gauss hybrid models part
Gauss hybrid models is the mixing of a plurality of Gauss models.Suppose that single Gauss model is:
Wherein
Figure 2012101973771100002DEST_PATH_IMAGE010
is mean value vector, and
Figure 2012101973771100002DEST_PATH_IMAGE012
is covariance matrix; X is the proper vector of observation.Gauss hybrid models is:
Figure 2012101973771100002DEST_PATH_IMAGE014
Wherein
Figure 2012101973771100002DEST_PATH_IMAGE016
is the weight of Gauss model.In order to make the Gaussian mixture model as a classifier bad speech samples need to learn where the parameters :
Figure 2012101973771100002DEST_PATH_IMAGE020
.Maximum algorithm (EM) is expected in current main employing.The thinking of expecting maximum algorithm is from a given model parameter beginning, estimates new
Figure 174400DEST_PATH_IMAGE018
.Algorithm makes under the parameter of new likelihood score bigger.New again as the current
Figure 288352DEST_PATH_IMAGE018
iterative calculation.When this algorithm begins, need confirm the number of gaussian component, to count M be 4 to the regulation gaussian component in the present embodiment.The key of this algorithm is calculated new
Figure 502482DEST_PATH_IMAGE018
according to current
Figure 297766DEST_PATH_IMAGE018
, and the more new formula of each parameter is following:
Suppose i Gauss's posterior probability
Figure 2012101973771100002DEST_PATH_IMAGE022
The more new formula of gaussian component weight is:
Figure 2012101973771100002DEST_PATH_IMAGE024
The more new formula of mean value vector is:
Figure 2012101973771100002DEST_PATH_IMAGE026
The more new formula of covariance matrix is:
Figure 2012101973771100002DEST_PATH_IMAGE028
Below more new formula has guaranteed the monotone increasing of model likelihood score; Reach certain degree of convergence and can stop to upgrade, thereby obtain final model parameter
Figure 137600DEST_PATH_IMAGE018
.
2) svm classifier device part
The svm classifier device is current the most effectively machine learning classification method, particularly under sample size is not very big situation.The ultimate principle of svm classifier device is that the unique point in the lower dimensional space is mapped in the high-dimensional feature space, makes these unique points according to the label linear separability.The decision-making plane of svm classifier device mainly depends on the support vector that from sample, obtains and estimates.
In the present embodiment, adopted the LibSvm application program interface function storehouse of increasing income to carry out the study of svm classifier device.The proper vector that adopts is made up of the shift differential cepstrum parameter.Training and dividing time-like all to use the normalization operation.In the present embodiment, the positive sample of study is exactly from bad voice, to obtain various bad speech frames, and it comprises various types of bad voice, also comprises the bad speech frame under the diversity of settings.Negative sample is the speech frame of intercepting at random from chat voice and song sound-type.When study, positive sample number is 2314, and the negative sample number is 3457.The parameter of using during the model of study support vector collection is following: the mistake penalty coefficient is set at 4000, and kernel function is selected RBF.Obtaining gamma through cross validation then is 1.2453.Support vector number through obtaining after the LibSvm study is 1711, and wherein positive support vector is 513, and these support vectors and svm classifier device parameter are saved in file.When detecting in real time, only need load this model and can realize real-time classification.
Real time identification partly is that sorter has loaded on the basis of the sorter model of succeeding in school in advance, and the input voice extract through pre-service, shift differential cepstrum parameter, are input in the sorter then.On the basis of gauss hybrid models according to input shift differential cepstrum parameter characteristic, calculate the probability that this proper vector belongs to bad voice gauss hybrid models according to following formula.
Figure 744162DEST_PATH_IMAGE014
Present embodiment adopts a probability threshold value to be used to confirm whether the speech frame of importing belongs to bad voice messaging.
After having passed through the gauss hybrid models classification, be the frame of bad voice to the candidate, adopt the svm classifier device to classify again.Whether the svm classifier device is divided into the shift differential cepstrum parameter characteristic of importing is bad voice.This method is added up the quantity of the bad speech frame of confirming through the svm classifier device in preset time more then, and is used to confirm through a threshold value (th) whether this section voice are bad voice.In the present embodiment, the given time that we set is 30 seconds, and threshold value th is set at 8.If bad voice then are encoded to the wav form with it and preserve, be used for follow-up manual work checking.

Claims (5)

1. based on the bad audio recognition method of SVMs, it is characterized in that this method may further comprise the steps:
Step 1: obtain the input voice flow, voice flow is decoded as primary speech signal, and carry out pretreatment operation;
Step 2: pretreated speech data is carried out windowing divide frame to handle;
Step 3: every frame voice are extracted shift differential cepstrum parameter characteristic;
Step 4: adopt gauss hybrid models to classify to shift differential cepstrum parameter characteristic;
Step 5: adopt the svm classifier device to classify to the candidate frame that is categorized as bad voice, confirm final bad speech frame;
Step 6: extract bad sound bite and store according to the quantity of bad speech frame in the certain hour.
2. bad audio recognition method as claimed in claim 1 is characterized in that, pretreatment operation comprises that monophony is handled, sampling rate is adjusted and re-quantization.
3. bad audio recognition method as claimed in claim 1 is characterized in that: gauss hybrid models in the step 4 adopts is characterized as the shift differential cepstrum parameter.
4. bad audio recognition method as claimed in claim 1 is characterized in that: the proper vector that the svm classifier device in the step 5 adopts is the shift differential cepstrum parameter.
5. bad audio recognition method as claimed in claim 1 is characterized in that: the quantity of the bad speech frame in the step 6 is added up in some frames before the current detection position.
CN2012101973771A 2012-06-15 2012-06-15 Poor speech recognition method based on support vector machine Pending CN102708861A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012101973771A CN102708861A (en) 2012-06-15 2012-06-15 Poor speech recognition method based on support vector machine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012101973771A CN102708861A (en) 2012-06-15 2012-06-15 Poor speech recognition method based on support vector machine

Publications (1)

Publication Number Publication Date
CN102708861A true CN102708861A (en) 2012-10-03

Family

ID=46901563

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012101973771A Pending CN102708861A (en) 2012-06-15 2012-06-15 Poor speech recognition method based on support vector machine

Country Status (1)

Country Link
CN (1) CN102708861A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104038804A (en) * 2013-03-05 2014-09-10 三星电子(中国)研发中心 Subtitle synchronization device and subtitle synchronization method based on speech recognition
CN104616658A (en) * 2015-01-14 2015-05-13 重庆金美通信有限责任公司 Echo canceling implementing method supporting a plurality of voice coding systems
CN106231409A (en) * 2016-08-05 2016-12-14 黄新勇 Method for real-time monitoring in the radio network of audio frequency and system
CN106297770A (en) * 2016-08-04 2017-01-04 杭州电子科技大学 The natural environment sound identification method extracted based on time-frequency domain statistical nature
CN106782505A (en) * 2017-02-21 2017-05-31 南京工程学院 A kind of method based on electric discharge voice recognition high-tension switch cabinet state
CN109147769A (en) * 2018-10-17 2019-01-04 北京猎户星空科技有限公司 A kind of Language Identification, device, translator, medium and equipment
CN110597139A (en) * 2019-09-25 2019-12-20 珠海格力电器股份有限公司 Self-learning method and system of cooking appliance
CN110853648A (en) * 2019-10-30 2020-02-28 广州多益网络股份有限公司 Bad voice detection method and device, electronic equipment and storage medium
CN110910865A (en) * 2019-11-25 2020-03-24 秒针信息技术有限公司 Voice conversion method and device, storage medium and electronic device
CN111445924A (en) * 2020-03-18 2020-07-24 中山大学 Method for detecting and positioning smooth processing in voice segment based on autoregressive model coefficient

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
《Proc.Int'l. Conf. Spoken Language Processing》 20020930 Pedro A. Torres-Carrasquillo1,et al. Approaches to Language Identification using Gaussian Mixture Models and Shifted Delta Cepstral Features 1-4 1-5 , *
PEDRO A. TORRES-CARRASQUILLO1,ET AL.: "Approaches to Language Identification using Gaussian Mixture Models and Shifted Delta Cepstral Features", 《PROC.INT’L. CONF. SPOKEN LANGUAGE PROCESSING》 *
姬鹏宇: "音频辅助的色***检测", 《中国科技论文在线》 *
张素文等: "基于SVM和GMM的视频运动对象分割算法", 《武汉理工大学学报•信息与管理工程版》 *
张素文等: "基于SVM和GMM的视频运动对象分割算法", 《武汉理工大学学报•信息与管理工程版》, vol. 31, no. 6, 31 December 2009 (2009-12-31), pages 857 - 860 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104038804B (en) * 2013-03-05 2017-09-29 三星电子(中国)研发中心 Captioning synchronization apparatus and method based on speech recognition
CN104038804A (en) * 2013-03-05 2014-09-10 三星电子(中国)研发中心 Subtitle synchronization device and subtitle synchronization method based on speech recognition
CN104616658A (en) * 2015-01-14 2015-05-13 重庆金美通信有限责任公司 Echo canceling implementing method supporting a plurality of voice coding systems
CN106297770B (en) * 2016-08-04 2019-11-22 杭州电子科技大学 The natural environment sound identification method extracted based on time-frequency domain statistical nature
CN106297770A (en) * 2016-08-04 2017-01-04 杭州电子科技大学 The natural environment sound identification method extracted based on time-frequency domain statistical nature
CN106231409A (en) * 2016-08-05 2016-12-14 黄新勇 Method for real-time monitoring in the radio network of audio frequency and system
CN106782505A (en) * 2017-02-21 2017-05-31 南京工程学院 A kind of method based on electric discharge voice recognition high-tension switch cabinet state
CN109147769A (en) * 2018-10-17 2019-01-04 北京猎户星空科技有限公司 A kind of Language Identification, device, translator, medium and equipment
CN110597139A (en) * 2019-09-25 2019-12-20 珠海格力电器股份有限公司 Self-learning method and system of cooking appliance
CN110853648A (en) * 2019-10-30 2020-02-28 广州多益网络股份有限公司 Bad voice detection method and device, electronic equipment and storage medium
CN110853648B (en) * 2019-10-30 2022-05-03 广州多益网络股份有限公司 Bad voice detection method and device, electronic equipment and storage medium
CN110910865A (en) * 2019-11-25 2020-03-24 秒针信息技术有限公司 Voice conversion method and device, storage medium and electronic device
CN111445924A (en) * 2020-03-18 2020-07-24 中山大学 Method for detecting and positioning smooth processing in voice segment based on autoregressive model coefficient

Similar Documents

Publication Publication Date Title
CN102708861A (en) Poor speech recognition method based on support vector machine
US11636860B2 (en) Word-level blind diarization of recorded calls with arbitrary number of speakers
CN105405439B (en) Speech playing method and device
Rouvier et al. An open-source state-of-the-art toolbox for broadcast news diarization
Matejka et al. Neural Network Bottleneck Features for Language Identification.
WO2021128741A1 (en) Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium
CN101710490B (en) Method and device for compensating noise for voice assessment
Zhou et al. Efficient audio stream segmentation via the combined T/sup 2/statistic and Bayesian information criterion
CN101548313B (en) Voice activity detection system and method
CN102723078B (en) Emotion speech recognition method based on natural language comprehension
US20170053653A1 (en) Blind Diarization of Recorded Calls With Arbitrary Number of Speakers
CN102486922B (en) Speaker recognition method, device and system
KR101616112B1 (en) Speaker separation system and method using voice feature vectors
CN103985381A (en) Voice frequency indexing method based on parameter fusion optimized decision
CN107480152A (en) A kind of audio analysis and search method and system
CN112992191B (en) Voice endpoint detection method and device, electronic equipment and readable storage medium
Ghosal et al. Automatic male-female voice discrimination
CN107507627A (en) Speech data temperature analysis method and system
Velayatipour et al. A review on speech-music discrimination methods
CN114822557A (en) Method, device, equipment and storage medium for distinguishing different sounds in classroom
Ramona et al. Comparison of different strategies for a SVM-based audio segmentation
Makishima et al. Joint Autoregressive Modeling of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-level Timestamp Prediction
Tsao et al. A study on separation between acoustic models and its applications.
Radadia et al. A Cepstral Mean Subtraction based features for Singer Identification
Kadri et al. Robust unsupervised speaker segmentation for audio diarization

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20121003