CN103021421A

CN103021421A - Multilevel screening detecting recognizing method for shots

Info

Publication number: CN103021421A
Application number: CN2012105740037A
Authority: CN
Inventors: 张涛; 苏春玲; 陈志�; 王晓晨; 蔡晓
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2012-12-24
Filing date: 2012-12-24
Publication date: 2013-04-03

Abstract

Disclosed is a multilevel screening detecting recognizing method for shots. The method includes: selecting a single shot template signal for framing; extracting feature coefficient of MFCC (Mel-frequency cepstral coefficient) of the template signal; selecting a to-be-tested signal for framing; calculating short-time energy and average short-time zero-crossing rate of the to-be-tested signal at the current frame and judging; when continuous effective frames equals to three-seconds those of the template signal, using the front two-seconds part of the continuous effective frames as a target section, and allowing the rest one-second part to participate judging of the next frame, extracting feature coefficient of MFCC from frames in the target section, if the matching distance between the feature coefficient of MFCC of the template signal and the feature coefficient of MFCC of the to-be-tested signal is smaller than a threshold obtained by training, judging the target section as a target signal, and otherwise, not judging the target section as the target signal. Time domain feature parameters, MFCC and DTW (dynamic time warping) are well combined, and system calculation quantity and recognition rate are taken into account at the same time.

Description

The multistage screening that is used for shot detects recognition methods

Technical field

The present invention relates to a kind of shot and detect recognition methods.Particularly relate to a kind of multistage screening for shot and detect recognition methods.

Background technology

Sound is ubiquitous, and the detection of sound is the important content of sound research field with identification always.About detection and the identification of sound, can be divided into two aspects: non-speech recognition system and speech recognition system.More deep for the detection Study of recognition of voice also has the system and method for comparative maturity.In the time of aspect the research non-voice, can use for reference algorithm and the technology of voice aspect, two systems all are made of characteristic parameter extraction algorithm and pattern matching algorithm substantially.

Aspect characteristic parameter extraction, can for detection of characteristic parameter have a lot, can classify from three aspects of time domain, frequency domain and homomorphism (cepstrum).Time domain charactreristic parameter comprises: short signal energy, the average zero-crossing rate of short signal, signal short-time autocorrelation function and average magnitude difference function.The characteristics of time domain charactreristic parameter are that extraction algorithm is all uncomplicated, but shortcoming is limited to the distinguishing ability of signal, and the scope of application has end-point detection and voice to divide frame.The frequency domain character parameter comprises: Fourier transform, discrete cosine transform, linear prediction analysis.Frequency domain character parameter and human auditory system have certain relation, but the frequency domain character parameter is applicable to additive signal, and be bad for the product composite signal processing power of complexity.Homomorphism characteristic parameter: linear prediction cepstrum coefficient coefficient and Mel frequency cepstral coefficient (Mel frequency cepstrum coefficient, MFCC).Nonlinear system analysis is got up very difficult, need to carry out homomorphic analysis, manages that nonlinear problem is converted into linear problem and processes.

At pattern match and model training technical elements, main technology can be summarized as: dynamic time technology (the Dynamic Time Warping that reforms, DTW), hidden Markov model (hidden Markov model, HMM) and artificial neural network.In these three kinds of technology, DTW is a kind of pattern match and model training technology early, its applied dynamic programming method has successfully solved the difficult problem that duration did not wait when the voice signal property argument sequence compared, low and the discrimination of its algorithm complex also has good performance for some particular aspects, has especially obtained superperformance in alone word voice identification.

For the sound detection of accident, such as shot, input signal is similar to the isolated word in the voice, and the needed matching template of system is less.Be used for this type of identification, DTW algorithm and HMM algorithm are under identical environmental baseline, recognition effect is more or less the same, but the HMM algorithm is more complex, the important HMM of being embodied in algorithm and need to provide a large amount of speech datas in the training stage, by the getable model parameter of repeatedly calculating, and need and outer calculating hardly in the training of DTW algorithm.So the DTW algorithm is very briefer to this input signal, when being similar to tone signal and template fewer sound being identified again, all be well suited for aspect algorithm complex and the discrimination, can obtain good effect.

Summary of the invention

Technical matters to be solved by this invention is to provide a kind of multistage screening that is used for shot that can detect fast and accurately the public place shot to detect recognition methods.

The technical solution adopted in the present invention is: a kind of multistage screening for shot detects recognition methods, comprises the steps:

1) from 8KHz～48KHz, determines a sample frequency, choose the template signal of the single shot corresponding with this sample frequency, divide frame to process to this template signal;

2) characteristic coefficient of the cepstrum feature parameter MFCC of extraction template signal;

3) choose the sample frequency measured signal identical with the sample frequency described in the step 1), and carry out with step 1) in template signal divide the frame identical minute frame of counting to process;

4) calculate short-time energy and the short-time average zero-crossing rate of measured signal present frame, if short-time energy and short-time average zero-crossing rate the two one of satisfy corresponding decision condition, just the present frame of measured signal as valid frame and preserve, enter step 5); If but the two is neither to satisfy condition has in first three frame of measured signal present frame and satisfy condition, also smoothly be this present frame valid frame and preservation, enter step 5); If do not satisfy condition in first three frame, then present frame is invalid frame, enters step 6);

5) when continuous available frame count equals the frame number of 3/2 template signal, front 2/2 part identical with the frame number of template signal in this continuous effective frame as target phase, is entered step 7), all the other 1/2 parts are returned the judgement that step 4) participates in next frame;

6) satisfy when the available frame count of before this invalid frame, preserving: during the frame number of the frame number＜available frame count of 1/2 template signal＜3/2 template signal, the valid frame that this is continuous is as target phase, enter step 7), otherwise step 4) is returned in the data zero clearing that will preserve;

7) frame in the target phase is extracted the characteristic coefficient of cepstrum feature parameter MFCC, if the threshold value that the matching distance of the characteristic coefficient of the characteristic coefficient of the cepstrum feature parameter MFCC of template signal and the cepstrum feature parameter MFCC of signal to be detected draws less than training is then thought echo signal with this target phase; Otherwise, judge that this target phase is not echo signal.

Each frame of template signal described in the step 1) is divided into 256～1024 points.

Corresponding decision condition described in the step 4) is, the short-time energy of every frame is greater than the minimum value of the short-time energy of setting, and the short-time average zero-crossing rate of every frame is within the scope of setting.

The judgement that described in the step 5) all the other 1/2 part is returned step 4) participation next frame is, with the continuous effective frame of this 1/2 part as the present frame front in the step 4).

Multistage screening for shot of the present invention detects recognition methods, by multistage screening with a plurality of decision thresholds are set, time domain charactreristic parameter, cepstrum feature parameter and DTW algorithm is well combined, and has taken into account system-computed amount and discrimination.Detection algorithm of the present invention compares MFCC﹠amp at the operand that detects; The DTW algorithm is little a lot, and is high more a lot of than the algorithm of only using short-time energy and short-time average zero-crossing rate combination in the accuracy that detects.The present invention can be applicable to the warning system that the public place shot detects, and lower operand is easy to realize at hardware platform, and robustness can guarantee again the accuracy and the validity that detect preferably.

Description of drawings

Fig. 1 is the partial detection synoptic diagram of the parameter that adopts of the present invention;

It is large that Fig. 2 is that loss becomes, the false drop rate partial detection synoptic diagram that diminishes;

Fig. 3 is that loss diminishes, and fallout ratio becomes most of testing result synoptic diagram.

Among the figure, solid box represents manual annotation results; The dotted line frame represents the algorithm testing result.

Embodiment

Below in conjunction with embodiment and accompanying drawing the multistage screening detection recognition methods for shot of the present invention is made a detailed description.

Multistage screening for shot of the present invention detects recognition methods, it is the shot detection for the public place, because the shot that occurs in the public place can be fewer, so can carry out hierarchical detection to the signal that gathers, can utilize first short-time energy and short-time average zero-crossing rate to carry out the first order detects, again the result who satisfies condition is carried out the second level and detect, at last the testing result of the second level is carried out the detection of the third level.

Multistage screening for shot of the present invention detects recognizer, comprises the steps:

1) from 8KHz～48KHz, determines a sample frequency, choose the template signal of the single shot corresponding with this sample frequency, divide frame to process to this template signal; Each frame of described template signal is divided into 256～1024 points.

According to fs(48KHz) sample frequency choose template signal, quantified precision is 16, and with the sampled point of a fixed qty (1024) as a frame, template signal is divided into a plurality of frames.

Obtain respectively the characteristic coefficient of cepstrum feature parameter MFCC on the N rank (N generally gets 12) of each frame of template signal.In the prior art, the extraction of the characteristic coefficient of cepstrum feature parameter MFCC is by WangBing Xi, Qu Dan, Peng Xuan. practical speech recognition basis [M]. National Defense Industry Press, 2005. and Li Fuhai, Ma Jinwen, Huang Dezhi.MFCC and SVM Basedon Recognition of Chinese Vowels[C] //CIS 2005, Part II, LNAI 3802.[s.l.]: [s.n.], the computing method that provide among the 2005:812-819 are calculated.The leaching process of the characteristic coefficient of cepstrum feature parameter MFCC is roughly: at first the voice signal behind minute frame is done discrete fourier and change, obtain spectrum distribution information.Ask again spectrum amplitude square, obtain energy spectrum.With the triangular filter group of energy spectrum by one group of Mel yardstick, and calculate the logarithm energy S(m that each bank of filters is exported), obtain the MFCC characteristic coefficient through discrete cosine transform again.

4) calculate short-time energy and the short-time average zero-crossing rate of measured signal present frame, if short-time energy and short-time average zero-crossing rate the two one of satisfy corresponding decision condition, just the present frame of measured signal as valid frame and preserve, enter step 5); If but the two is neither to satisfy condition has in first three frame of measured signal present frame and satisfy condition, also smoothly be this present frame valid frame and preservation, enter step 5); If do not satisfy condition in first three frame, then present frame is invalid frame, enters step 5);

Described corresponding decision condition is, the short-time energy of every frame is greater than the minimum value of the short-time energy of setting, and the short-time average zero-crossing rate of every frame is within the scope of setting.

As, the short-time energy of establishing every frame is energy, and the short-time average zero-crossing rate of every frame is zcr_num, and the minimum threshold of setting short-time energy is EN_MIN, and the up and down thresholding of short-time average zero-crossing rate is respectively ZCR1, ZCR2.As energy〉when EN_MIN or ZCR1＜zcr_num＜ZCR2, with present frame as valid frame and preserve; When the two does not satisfy condition, satisfy condition if having in first three frame of present frame, then present frame is also smoothed for valid frame and preserve.

5) when continuous available frame count equals the frame number of 3/2 template signal, front 2/2 part identical with the frame number of template signal in this continuous effective frame as target phase, is entered step 6), all the other 1/2 parts are returned the judgement that step 4) participates in next frame; The judgement that described all the other 1/2 parts are returned step 4) participation next frame is, with the continuous effective frame of this 1/2 part as present frame front in the step 4).

When the available frame count of preserving before this invalid frame satisfies: during the frame number of the frame number＜available frame count of 1/2 template signal＜3/2 template signal, the valid frame that this is continuous is as target phase, enter step 6), otherwise step 4) is returned in the data zero clearing that will preserve;

As, the frame number of establishing the continuous effective frame is fra_num, and setting the template frame number is tem_num, and the minimum threshold of continuous effective frame frame number is FRA_MIN.When fra_num＜FRA_MIN, be judged to corresponding frame invalid and with the data zero clearing of preserving; When fra_num reaches tem_num+FRA_MIN, front tem_num frame as a target phase, is carried out the detection of next stage, simultaneously with the former frames of rear FRA_MIN frame as next section; When FRA_MIN＜fra_num＜tem_num+FRA_MIN, directly it is carried out the analysis of next stage as a target.

6) frame in the target phase is extracted the characteristic coefficient of cepstrum feature parameter MFCC, if the threshold value that the matching distance of the characteristic coefficient of the characteristic coefficient of the cepstrum feature parameter MFCC of template signal and the cepstrum feature parameter MFCC of signal to be detected draws less than training is then thought echo signal with this target phase; Otherwise, judge that this target phase is not echo signal.

That is, establish template signal cepstrum feature parameter MFCC characteristic coefficient and be dist by the matching distance of the characteristic coefficient of the detected cepstrum feature parameter MFCC that may target phase in the second level, setting the threshold value that training draws is GUN_MAX.When dist＜GUN_MAX, determine that it is the object event shot; Otherwise be judged to non-object event.

Because the frame number to the continuous effective frame has minimum requirements, for fear of failing to judge, reduce loss, so when (first order) step 4) is judged valid frame, adopt level and smooth mechanism, valid frame can be smoothly following closely three frame invalid frames, the effective like this length that guarantees target phase greatly reduces loss, makes Detection accuracy of the present invention higher.

It is template that a pure single shot is got in experiment, and sample signal is 11 frames (tem_num=11), and sample frequency is 48000Hz, and each sampling point 16bit, every frame sign are 1024 sampled points.

Measured signal is one section continuous voice signal that the complex environments such as music, voice and braking automobile are arranged, and has 1953 frames, has carried out respectively the detection of manual mark and program.Set EN_MIN=53, ZCR1=65, ZCR2=100, FRA_MIN=6, GUN_MAX=4525.The partial test result schematic diagram as shown in Figure 1.Testing result is added up, and can calculate total undetected frame number is 87, then loss

Total false retrieval frame number is 237, then fallout ratio

β = \frac{237}{1953} \times 100 % = 12.14 % .

By different parameter threshold values is set, can obtain different losss and fallout ratio.Loss and fallout ratio are a pair of this that long parameters that disappear, and both can not reach optimum simultaneously, only have as the case may be, select optimal parameter of suitable present case.If set EN_MIN=55, ZCR1=68, ZCR2=95, FRA_MIN=6, GUN_MAX=4520, then detected as a result loss α can become greatly, and fallout ratio β can diminish.The partial detection synoptic diagram as shown in Figure 2.Testing result is added up, and can calculate total undetected frame number is 203, then loss α=10.39%; Total false retrieval frame number is 152, then fallout ratio β=7.78%.If set EN_MIN=50, ZCR1=60, ZCR2=105, FRA_MIN=6, GUN_MAX=4530, then detected as a result loss α can diminish, and it is large that fallout ratio β can become.The partial detection synoptic diagram as shown in Figure 3.Testing result is added up, and can calculate total undetected frame number is 82, then loss α=4.20%; Total false retrieval frame number is 268, then fallout ratio β=13.72%.

Can be found out by above-mentioned experiment, the present invention not only on operand than traditional MFCC﹠amp; The DTW algorithm is little a lot, and passes through the detection of the target phase of the first order (step 4)), the second level (step 5) and step 6)), well finds the terminal of shot, like this so that matching result is more accurate, makes the detection discrimination higher.Because shot belongs to danger signal, and is larger on the impact of safety for the detection loss of this sound, can find out that from experimental result testing result of the present invention also is more prone to non-echo signal is judged as echo signal.So as seen, the present invention not only is easy to transplant and realization at hardware such as DSP and ARM, and has certain robustness, guarantees the accuracy and the validity that detect.

Claims

1. a multistage screening that is used for shot detects recognition methods, it is characterized in that, comprises the steps:

2. the multistage screening for shot according to claim 1 detects recognition methods, it is characterized in that each frame of the template signal described in the step 1) is divided into 256～1024 points.

3. the multistage screening for shot according to claim 1 detects recognition methods, it is characterized in that, corresponding decision condition described in the step 4) is, the short-time energy of every frame is greater than the minimum value of the short-time energy of setting, and the short-time average zero-crossing rate of every frame is within the scope of setting.

4. the multistage screening for shot according to claim 1 detects recognition methods, it is characterized in that, the judgement that described in the step 5) all the other 1/2 part is returned step 4) participation next frame is, with the continuous effective frame of this 1/2 part as the present frame front in the step 4).