CN108074585A

CN108074585A - A kind of voice method for detecting abnormality based on sound source characteristics

Info

Publication number: CN108074585A
Application number: CN201810126670.6A
Authority: CN
Inventors: 姚潇; 白文松; 李伟亮; 徐宁; 李旭; 蒋爱民; 刘小峰; 张学武
Original assignee: Changzhou Campus of Hohai University
Current assignee: Changzhou Campus of Hohai University
Priority date: 2018-02-08
Filing date: 2018-02-08
Publication date: 2018-05-25

Abstract

The invention discloses a kind of voice method for detecting abnormality based on sound source characteristics, include the following steps：Pass through sensor real-time collecting voice data；Obtained speech segments are pre-processed；For the voice data of voice segments, glottis ripple signal is obtained using iteration self-adapting liftering；Characteristic parameter is extracted from glottis ripple signal：Normalized amplitude business compares data with glottis closing time；The characteristic extracted input ideal SVM models are classified；Tag along sort is obtained, for judging speaker's situation, speaker's situation label is exported, execution module is transferred to be fed back.The present invention characteristic be, for the stressed speech under stress, it has broken away from and model is generated based on traditional linear speech, extraction lacks the recognition methods of the acoustical characteristic parameters of physical significance, establish sound source estimation model, using the liftering technology of speech production, analysis and extraction carry out the detection of abnormal speech based on the characteristic parameter of human vocal cord vibration.

Description

A kind of voice method for detecting abnormality based on sound source characteristics

Technical field

The present invention relates to a kind of voice method for detecting abnormality based on sound source characteristics, belong to intelligent sound technical field.

Background technology

Pressure is the natural reaction that body is stimulated for physics, psychology or emotion, when we are subject to these stimulations, greatly Brain can release alcohol and peptide matters to body, so as to cause nervous reaction.This lasting Anxiety for work, it will It is reflected on phonatory organ, so as to cause the change of the series of parameters such as audible frequency, the rate of articulation.These change believes in voice The various fields of number processing suffer from very important meaning, such as stressed speech identification, emotion recognition etc..

The important embodiment mode of pressure one is voice when speaker speaks, and becoming, which influences voice, generates very important one A influence factor.Mostly it is absorbed in Mr. Yu when ambient enviroment or words person's self-condition are abnormal variation or due to the use of person Work, speech recognition is to aid in the underworks of other work, in this process, at this moment depositing due to operating pressure It is subject to stress in, speaker, interlocutor's pronunciation will have large effect, and so as to generate abnormality, generate language The change of tune is different, and abnormality is often embodied among the voice of speaker, forms the voice signal under pressure anomaly state.

But the stressed speech under the stressed speech under stress, particularly multitask brain load pressure, from acoustically Discrimination it is relatively low, general acoustic feature cannot by its it is correct classify, deficient in stability and robustness.Further, since Stressed speech is in generating process, significant difference compared with sound source characteristics have with normally voice.Therefore, in detection process In, we improve the reliability of Stressful speech classification by sound source characteristics.It is language by improving the mark efficiency of stressed speech The strong robustness of sound identifying system lays the foundation.

The content of the invention

Problem to be solved by this invention is that pressure state is detected from the angle of the sound source of speech production, proposes one Pressure detection method of the kind based on speech production modeling.The characteristic of the present invention is to have broken away to generate mould based on traditional linear speech Type and lack physical significance acoustical characteristic parameters recognition methods, establish sound source estimation model, utilize the inverse of speech production Filtering technique, analysis and extraction carry out the detection of abnormal speech based on the characteristic parameter of human vocal cord vibration.

Technical scheme is as follows：

A kind of voice method for detecting abnormality based on sound source characteristics, includes the following steps：

(1), sensor real-time collecting voice data is passed through；

(2), the voice segments and noise segment of voice data are judged by end-point detection, to decide whether to carry out next step voice Signal processing works；

(3), the voice data framing adding window to obtained voice segments, and high frequency preemphasis processing is carried out to each frame；

(4), for the voice data of voice segments, glottis ripple signal is obtained using iteration self-adapting liftering；

(5), extract the characteristic parameter normalized amplitude business of glottis ripple and glottis closing time compares data；

(6), by the data extracted input, trained SVM models are classified；

(7), obtain tag along sort, for judging speaker's situation, export speaker's situation label, transfer to execution module into Row feedback.

Adding window uses Hamming window to a frame voice adding window in above-mentioned steps (3).

The processing of above-mentioned steps (3) medium-high frequency preemphasis is by the limited exciter response high-pass filter of a single order to its high frequency Part is promoted.

The step of glottis ripple signal is obtained in above-mentioned steps (4) is as follows：

(a), channel model is established using iteration self-adapting liftering；

Iteration self-adapting liftering eliminates the influence that glottal excitation is brought in primary speech signal frequency spectrum；

(b) and then by the method for liftering the influence of formant is eliminated；

Acoustic model is established by linear predictive coding and discrete complete or collected works' model exactly, is finally obtained using liftering Glottis ripple signal.

The extracting method of the characteristic parameter normalized amplitude business of glottis ripple is as follows in above-mentioned steps (5)：

NAQ is normalized amplitude business in formula；T is pitch period；AQ is amplitude business, is glottis ripple peak swing and its correspondence The ratio between the maximum negative peak of first derivative；

F in formula_acFor the maximum crest value of glottal；Dpeak is the maximum negative peak that glottal corresponds to first derivative.

Glottis closing time is as follows than the extracting method of data in above-mentioned steps (5)：

CPR compares data for glottis closing time in formula；CP is the Closure states of glottis stage；O is glottis total opening time.

The advantageous effect that the present invention is reached：

The present invention by sounding physiological system under pressure influence on the Research foundation of variation characteristic, studying physiological feature Inner link between sound source parameter verifies an important factor for can reflecting pressure state in glottis wave characteristic so that required The glottis wave parameter obtained not only possesses theoretical direction, and with specific physical significance；Finding out can describe in sonification system The glottis wave parameter of pressure correlation sound source characteristic establishes the inner link of sound source characteristics and physiological characteristic, is identified with this feature It with the correlation of pressure variance factor, indicates the mode of vibration of vocal cords and has physical significance, finally to voice exception shape The detection of state improves the precision and reliability of speech recognition system.

Present invention could apply to environment inside cars, judge its pressure shape by detecting the voice data of driver and passenger State, by transmission equipment by status information feedback to execution module, and then adopted an effective measure automatically by execution module as：It reminds Driver takes care, notifies that nearby vehicle pays attention to avoidance etc. using car networking, so as to reach protection safety of life and property Purpose.

Description of the drawings

Fig. 1 is the basic flow chart of the present invention；

Fig. 2 is the basic flow chart for obtaining svm classifier model.

Fig. 3 is iteration self-adapting liftering (IAIF) technical architecture plan that the present invention establishes；

Fig. 4 is the ROC curve of five kinds of parameters in embodiment 1；

Fig. 5 is each parameter value of ROC curve of five kinds of features in embodiment 1, and wherein AUC is area under the curve, and SE is standard Difference；CL is confidence interval；

Fig. 6 is to be tested in embodiment 1 by 50 wheels, the average recognition rate of the grader drawn.

Specific embodiment

The invention will be further described below in conjunction with the accompanying drawings.Following embodiment is only used for clearly illustrating the present invention Technical solution, and be not intended to limit the protection scope of the present invention and limit the scope of the invention.

As shown in Figure 1, a kind of voice method for detecting abnormality based on sound source characteristics, includes the following steps：

(1), sensor real-time collecting voice data is passed through；

Present invention uses the sound end detecting method based on energy and short-time zero-crossing rate, to efficiently differentiate voice Section.The above method is the detection method of existing maturation, is not elaborated herein.

Preemphasis：The average power spectra of voice signal is influenced by glottal excitation and mouth and nose radiation, and front end about exists More than 800Hz decays by 6dB/oct (octave), and the more high corresponding ingredient of frequency is smaller, to be carried out therefore to voice signal Its high frequency section is promoted by the limited exciter response high-pass filter of a single order before analysis.

Framing：Since voice signal has short-term stationarity, we can carry out sub-frame processing to signal.From macroscopically It sees, it short enough must ensure that signal is stable in frame, i.e., the length of a frame should be less than the length of a phoneme.Normally Under word speed, the duration of phoneme is about 50-200 milliseconds, so frame length is generally less than 50 milliseconds.From microcosmic, it It must include enough vibration periods again, because Fourier transformation will analyze frequency, only repeatedly enough times ability Analyze frequency.The fundamental frequency of voice, male voice is in 100 hertzs, and for female voice in 200 hertzs, it is exactly 10 milliseconds to be converted into the cycle With 5 milliseconds.Since a frame will include multiple cycles, so generally taking at least 20 milliseconds.

Adding window：Using Hamming window to a frame voice adding window, it not only has preferable frequency resolution, can also reduce frequency spectrum Leakage, so as to reduce the influence of Gibbs' effect.

(4), as shown in figure 3, voice data for voice segments, obtains glottis ripple using iteration self-adapting liftering and believe Number；

(a), channel model is established using iteration self-adapting liftering (IAIF)；

(b) and then by the method for liftering (IF) influence of formant is eliminated；

Acoustic model is established by linear predictive coding (LPC) and discrete complete or collected works' model (DAP) exactly, finally using inverse Filter (IF) to obtain glottis ripple signal, as shown in Figure 2.

Under working pressure conditions, since vocal cords contraction of muscle causes the randomization of vocal cord vibration.So as to cause The variation of glottis interior air-flow fluidised form so that voice signal is made a variation.The variation of these vocal cords features will be reacted in glottis In the feature of ripple, hence glottis ripple can reflect operating pressure to a certain extent.We use normalized amplitude business (NAQ) and glottis closing time ratio (CPR) characterizes the intrinsic propesties of glottis ripple, and the feature of proposition has specific physical significance, Reflect the different vibration mode of vocal cords during speech production.

The characteristic parameter one of glottis ripple：Normalized amplitude business, the main closed manners for reflecting vocal cords, extracting method is such as Under：

F in formula_acFor the maximum crest value of glottal；Dpeak is the maximum negative peak that glottal corresponds to first derivative. Since the instantaneous moment of glottis closure or openness need not be measured so that AQ becomes more readily available, but the value of AQ depends on The measurement of signal fundamental frequency (F0), therefore in formula (1), normalize to obtain NAQ by pitch period, eliminate and fundamental frequency is measured Dependence.

The characteristic parameter two of glottis ripple：Glottis closing time ratio (CPR).CPR parameters reflect the Closure states of glottis stage and account for sound The ratio of door total opening time is mainly shown as the crooked degree of glottis signal in glottis ripple..

Glottis closing time is as follows than the extracting method of data：

(6), by the data extracted input, trained SVM models are classified；

Support vector machines (SVM) plays an important role [8] always in area of pattern recognition, and so-called supporting vector refers to that A bit in the training sample point of interval area edge.SVM is classified using linear processes hyperplane.SVM is built upon statistics The VC dimensions of the theories of learning are theoretical and Structural risk minization basis on, according to limited sample information model complexity Seek between (i.e. to the study precision of specific training sample) and learning ability (ability for identifying arbitrary sample without error) Optimal compromise, in the hope of obtaining best Generalization Ability.Support vector machines is substantially the Nonlinear Classifier of two classes, very It is suitble to unique recognition point of stressed speech：(1) due to being not at every moment in pressure state in speaker's voiced process In, pressure shows as of short duration instantaneity in continuous speech, so only a small amount of samples can be defined as the variation under pressure Voice, therefore stressed speech identification is usually small sample problem.(2) a kind of typical two class of the stressed speech identification of causalgia is known Other problem.We establish the Classification and Identification model based on SVM, relevant in speaker, since each subject is spoken The sample size of people is relatively fewer, is typical small sample problem, so in this case, SVM models achieve relatively good Recognition effect.

Classification is identified to voice under stressed speech and normal condition by svm classifier model in the present invention, realizes institute Sound source parameter is for becoming the evaluation of the susceptibility of metachromatic state in proposition method, so as to test the validity of proposed method Card.

Embodiment 1

We used the database that Fuji Tsu collects, wherein the speech samples comprising 11 speakers (4 Male and 7 women).For simulation psychological pressure generate specific situation, for speaker be provided with three kinds of different tasks, with Operator carries out telephone talk when progress, to simulate the situation of pressure in the phone.

Three task (A) high concentrations being related to are (it is required that speaker completes to include solving logic puzzle and finds two figures The task of difference between piece)；(B) time pressure (it is required that speaker answers a question under time pressure)；(C) risk taking behavior (is adopted Risk task is taken, to assess serious hope of the speaker to pecuniary gain).For each speaker, there are four types of the dialogues of different task. In talking with twice, spokesman is required to complete task within the limited time, and in other dialogues, without any task, It can easily chat.

The part intercepted from speech is vowel/a/ ,/i/ ,/u/ ,/e/ ,/o/.These experiments are for every spokesman It carries out, all results are all determined by spokesman.11 experimental subjects for testing to choose are in speaker system It carries out, the number of sample depends on speaker, and total speech samples number is 700.

In the present invention, used verification data are all from telephone communication data, wherein 100 subjects (male 50 people, 50 people of female) participate in experiment.In experiment, operator is chatted by phone and each subject, everyone average four groups of dialogues, every group Chatting time is 10 minutes, and records most real voice communication data.In four groups of dialogues, two groups are stopping under light state It chats, in addition in two groups of dialogues, subject is applied in different types of pressure respectively, and the pressure of application includes：(1) multiplexing is appointed Business；(2) it is pressed for time；(3) venture, detail such as table 1.The real speech data that subject people speaks under pressure state It is logged for the verification of pressure detection method validity.

Table 1

In order to verify the validity of proposed method, the present invention is using Receiver operating curve (ROC), to evaluate not The recognition performance of same parameter, as shown in Figure 4, Figure 5, ROC curve are that (cut off value is determined according to a series of two different mode classifications Determine threshold), with true positive rate (sensitivity) for ordinate, false positive rate (1- specificities) is the curve that abscissa is drawn.ROC curve Closer to the upper left corner, area under the curve (AUC) is bigger, and method for expressing recognition performance is better, and accuracy is higher.

True positives (TPR)：

False positive (FPR)：

TP：True positives；TN：True negative；FP：False positive；FN：False negative

The Source Model parameter of proposition compared with traditional parameter, is passed through the average knowledge in pressure detecting by the present invention It does not compare in rate, illustrates that the method based on speech production modeling has apparent advantage in pressure detection method, so as to reach Distinguish the purpose of normal condition and abnormality.Three traditional speech parameters include, and fundamental frequency, mel-frequency cepstrum coefficient are thrown Object line frequency spectrum parameter (F0, MFCC, PSP) is used as experimental comparison group.

In sorting phase, NAQ and CPR are established SVM models as bivector, choose 125 groups of samples as training sample This, 125 groups of samples are used as test group, test and have selected the speech samples of 7 different speakers (4 3 female of man) to be from database, It is intended to eliminate the variation of the speech parameter caused by individual specificity, while F0, MFCC, PSP with the shape of one-dimensional sample Formula is trained, as experimental comparison group.As shown in fig. 6, after 50 wheel experiments, parameter is calculated in SV disaggregated models Average recognition rate.As can be seen that NAQ with CPR sound source characteristics compared with traditional parameters, embodied under abnormality good different The recognition performance of Chang Yuyin.

The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, without departing from the technical principles of the invention, several improvement and deformation can also be made, these are improved and deformation Also it should be regarded as protection scope of the present invention.

Claims

1. a kind of voice method for detecting abnormality based on sound source characteristics, it is characterised in that include the following steps：

(1), sensor real-time collecting voice data is passed through；

(2), the voice segments and noise segment of voice data are judged by end-point detection, to decide whether to carry out next step voice signal Handle work；

(6), by the data extracted input, trained SVM models are classified；

(7), tag along sort is obtained, for judging speaker's situation, exports speaker's situation label, execution module is transferred to carry out anti- Feedback.

2. a kind of voice method for detecting abnormality based on sound source characteristics according to claim 1, it is characterised in that：The step Suddenly adding window uses Hamming window to a frame voice adding window in (3).

3. a kind of voice method for detecting abnormality based on sound source characteristics according to claim 1, it is characterised in that：The step Suddenly (3) medium-high frequency preemphasis processing promotes its high frequency section by the limited exciter response high-pass filter of a single order.

4. a kind of voice method for detecting abnormality based on sound source characteristics according to claim 1, it is characterised in that：The step Suddenly the step of glottis ripple signal is obtained in (4) is as follows：

(a), channel model is established using iteration self-adapting liftering；

Acoustic model is established by linear predictive coding and discrete complete or collected works' model exactly, finally obtains glottis using liftering Ripple signal.

5. a kind of voice method for detecting abnormality based on sound source characteristics according to claim 1, it is characterised in that：The step Suddenly the extracting method of the characteristic parameter normalized amplitude business of glottis ripple is as follows in (5)：

<mrow> <mi>N</mi> <mi>A</mi> <mi>Q</mi> <mo>=</mo> <mfrac> <mrow> <mi>A</mi> <mi>Q</mi> </mrow> <mi>T</mi> </mfrac> <mo>=</mo> <mfrac> <msub> <mi>f</mi> <mrow> <mi>a</mi> <mi>c</mi> </mrow> </msub> <mrow> <mi>d</mi> <mi>p</mi> <mi>e</mi> <mi>a</mi> <mi>k</mi> <mo>&CenterDot;</mo> <mi>T</mi> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

NAQ is normalized amplitude business in formula；T is pitch period；AQ is amplitude business, is glottis ripple peak swing corresponding with its one The ratio between maximum negative peak of order derivative；

6. a kind of voice method for detecting abnormality based on sound source characteristics according to claim 1, it is characterised in that：The step Suddenly glottis closing time is as follows than the extracting method of data in (5)：