CN109935241A - Voice information processing method - Google Patents
Voice information processing method Download PDFInfo
- Publication number
- CN109935241A CN109935241A CN201711363536.XA CN201711363536A CN109935241A CN 109935241 A CN109935241 A CN 109935241A CN 201711363536 A CN201711363536 A CN 201711363536A CN 109935241 A CN109935241 A CN 109935241A
- Authority
- CN
- China
- Prior art keywords
- frame
- sound bite
- mood
- voice
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000010365 information processing Effects 0.000 title claims abstract description 31
- 238000003672 processing method Methods 0.000 title claims abstract description 31
- 230000036651 mood Effects 0.000 claims abstract description 81
- 230000002996 emotional effect Effects 0.000 claims abstract description 57
- 239000013598 vector Substances 0.000 claims abstract description 47
- 238000000034 method Methods 0.000 claims description 33
- 230000008569 process Effects 0.000 claims description 23
- 238000004590 computer program Methods 0.000 claims description 12
- 230000001755 vocal effect Effects 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 4
- 230000004044 response Effects 0.000 claims description 4
- 238000012544 monitoring process Methods 0.000 abstract description 6
- 230000008451 emotion Effects 0.000 description 10
- 238000001228 spectrum Methods 0.000 description 9
- 230000003993 interaction Effects 0.000 description 8
- 239000002609 medium Substances 0.000 description 8
- 238000004422 calculation algorithm Methods 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 230000002452 interceptive effect Effects 0.000 description 5
- 238000013461 design Methods 0.000 description 4
- 238000007689 inspection Methods 0.000 description 4
- 238000013019 agitation Methods 0.000 description 3
- 238000005311 autocorrelation function Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000001514 detection method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012880 independent component analysis Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- HUTDUHSNJYTCAR-UHFFFAOYSA-N ancymidol Chemical compound C1=CC(OC)=CC=C1C(O)(C=1C=NC=NC=1)C1CC1 HUTDUHSNJYTCAR-UHFFFAOYSA-N 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000008909 emotion recognition Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000012120 mounting media Substances 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
Landscapes
- Telephonic Communication Services (AREA)
Abstract
The embodiment of the invention provides a kind of voice information processing method, computer equipment and computer readable storage medium, solve the problems, such as the prior art can not in real-time monitoring call center system customer service and client emotional state.The voice information processing method includes: the audio feature vector for extracting the sound bite in audio stream to be identified, and wherein sound bite corresponds to one section of word in audio stream to be identified;The audio feature vector of sound bite is matched with multiple emotional characteristics models, plurality of emotional characteristics model respectively corresponds one of multiple mood classification;It and by matching result is mood classification of the corresponding mood classification of the emotional characteristics model that matches as sound bite.
Description
Technical field
The present invention relates to technical field of intelligent interaction, and in particular to a kind of voice information processing method, computer equipment and
Computer readable storage medium.
Background technique
Call center system refer to it is a kind of utilize modern communication and computer technology, automatically and flexibly processing it is a large amount of it is various not
With phone incoming/outgoing business realize the operating system of Service Operation.With economic development, customer service in call center system
Interactive portfolio is also increasing, tracks and monitor in time and effectively the emotional state of customer service and client in customer service call,
Its service quality is promoted for enterprise to have great importance.Currently, most enterprises, which rely primarily on, engages special quality inspection people
Member is sampled monitoring to calling record to realize that this purpose, this aspect can bring additional cost, another party to enterprise
Face due to sampling coverage area uncertainty and artificially determine the subjective emotion contained so that manually quality inspection effect
Fruit has some limitations.In addition, quality inspection personnel can only obtain recording later to the mood of customer service and client in end of conversation
Performance carries out subsequent evaluation, and is difficult in the middle of call carries out and monitors customer service and the emotional state of client in real time,
When very negative mood occur in customer service in call or client, also timely and effectively contact staff can not be reminded.
Summary of the invention
It can in view of this, the embodiment of the invention provides a kind of voice information processing method, computer equipment and computers
Read storage medium, solving the prior art emotional state of customer service and client can not ask in real-time monitoring call center system
Topic.
One embodiment of the invention provide a kind of voice information processing method include:
Extract the audio feature vector of the sound bite in audio stream to be identified, wherein the sound bite it is corresponding described to
Identify one section of word in audio stream;
The audio feature vector of the sound bite is matched with multiple emotional characteristics models, wherein the multiple feelings
Thread characteristic model respectively corresponds one of multiple mood classification, and the multiple emotional characteristics model passes through to including the multiple mood
The multiple default respective audio feature vectors of sound bite of corresponding mood tag along sort of classifying are learnt in advance and are established;With
And
It is that the corresponding mood classification of the emotional characteristics model to match is used as the sound bite by matching result
Mood classification;Wherein, the audio feature vector includes one of following several audio frequency characteristics or a variety of: energy feature,
Pronunciation frame number feature, fundamental frequency feature, formant feature, harmonic to noise ratio feature and mel cepstrum coefficients feature;It is described
Sound bite includes the customer service input voice segments in the audio stream to be identified, and the multiple mood classification includes: satisfaction
Classification, tranquil classification and irritated classification.
Optionally, the pre- learning process includes:
It will include that the multiple mood is classified multiple default respective audios of sound bite of corresponding mood tag along sort
Feature vector carries out clustering processing, obtains the cluster result of default mood classification;And
According to the cluster result, the audio feature vector of the default sound bite in each cluster is trained for one
A emotional characteristics model.
Optionally, when the emotional characteristics model is mixed Gauss model, then the audio by the sound bite
Feature vector match with multiple emotional characteristics models
It is general to calculate likelihood of the audio feature vector of the sound bite respectively between the multiple emotional characteristics model
Rate;
Wherein, described by matching result is described in the corresponding mood classification of the emotional characteristics model that matches is used as
The mood of sound bite is classified
Likelihood probability is greater than preset threshold and the corresponding mood classification of the maximum emotional characteristics model is used as institute
State the mood classification of sound bite.
Optionally, before the audio feature vector for extracting the sound bite in audio stream to be identified, further comprise:
Determine the voice start frame and voice end frame in the audio stream to be identified;And
The audio stream part between the voice start frame and the voice end frame is extracted as the sound bite.
Optionally, the voice start frame in the determination audio stream to be identified and voice end frame include:
Judge that the speech frame in the audio stream to be identified is pronunciation frame or non-vocal frame;
After the voice end frame of the preceding paragraph sound bite or it is current it is unidentified to first segment sound bite when,
When there is the first preset quantity speech frame to be continuously judged as pronunciation frame, by the in the first preset quantity speech frame
The voice start frame of one speech frame as current speech segment;And
After the voice start frame of current speech segment, when there is the second preset quantity speech frame continuously to be judged
When for non-vocal frame, using first speech frame in the second preset quantity speech frame as described in current speech segment
Voice end frame.
Optionally, the energy feature includes: short-time energy first-order difference and/or predeterminated frequency energy size below;
And/or
The fundamental frequency feature includes: fundamental frequency and/or fundamental frequency first-order difference;And/or
The formant feature includes one of following items or a variety of: the first formant, the second formant, third are total
Vibration peak, the first formant first-order difference, the second formant first-order difference and third formant first-order difference;And/or
The mel cepstrum coefficients feature includes 1-12 rank mel cepstrum coefficients and/or 1-12 rank mel cepstrum coefficients single order
Difference.
Optionally, the audio frequency characteristics by one of following computational representation mode or a variety of characterize: ratio value,
Value, maximum value, intermediate value and standard deviation.
Optionally, the energy feature includes: mean value, maximum value, intermediate value and the standard deviation of short-time energy first-order difference,
And/or the ratio value of predeterminated frequency energy below and total energy;And/or
The pronunciation frame number feature includes: the ratio value of pronunciation frame number and mute frame number, and/or pronunciation frame number and total frame
Several ratio values;
The fundamental frequency feature includes: mean value, maximum value, intermediate value and the standard deviation and/or fundamental tone frequency of fundamental frequency
Mean value, maximum value, intermediate value and the standard deviation of rate first-order difference;And/or
The formant feature includes one of following items or a variety of: mean value, maximum value, the intermediate value of the first formant
And standard deviation, mean value, maximum value, intermediate value and the standard deviation of the second formant, the mean value of third formant, maximum value, in
Value and standard deviation, mean value, maximum value, intermediate value and the standard deviation of the first formant first-order difference, one scale of the second formant
Mean value, maximum value, intermediate value and the mark of mean value, maximum value, intermediate value and the standard deviation and third formant first-order difference divided
It is quasi- poor;And/or
The mel cepstrum coefficients feature includes mean value, maximum value, intermediate value and the standard of 1-12 rank mel cepstrum coefficients
Mean value, maximum value, intermediate value and the standard deviation of difference and/or 1-12 rank mel cepstrum coefficients first-order difference.
Optionally, the voice information processing method further include:
Show the mood classification of the sound bite currently identified;And/or
Count the mood classification of the sound bite identified in preset time period;And/or
Send mood response message corresponding with the classification of the mood of the sound bite identified.
A kind of computer equipment that one embodiment of the invention provides, including memory, processor and it is stored in memory
On the computer program that is executed by processor, realized when processor executes computer program and as above state voice information processing method
Step.
A kind of computer readable storage medium that one embodiment of the invention provides, is stored thereon with computer program, calculates
The step of as above stating voice information processing method is realized when machine program is executed by processor.
A kind of voice information processing method, computer equipment and computer-readable storage medium provided in an embodiment of the present invention
Matter by extracting the audio feature vector of the sound bite in audio stream to be identified, and utilizes the emotional characteristics mould pre-established
Type matches extracted audio feature vector, to realize the real-time emotion identification to sound bite.In this way in example
As call center system application scenarios under, may be implemented customer service interaction call in real-time monitoring customer service and client mood shape
State is remarkably improved the service quality using the enterprise of the call center system and the customer service experience of client.
Detailed description of the invention
Fig. 1 show the flow diagram of voice information processing method provided by one embodiment of the invention.
Fig. 2, which is shown, establishes the pre- of emotional characteristics model in voice information processing method provided by one embodiment of the invention
The flow diagram of learning process
The process that Fig. 3 show extraction sound bite in voice information processing method provided by one embodiment of the invention is shown
It is intended to.
Fig. 4 show in voice information processing method provided by one embodiment of the invention and determines in audio stream to be identified
The flow diagram of voice start frame and voice end frame.
Fig. 5 show detection pronunciation frame or non-vocal frame in voice information processing method provided by one embodiment of the invention
Flow diagram.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that the described embodiment is only a part of the embodiment of the present invention, instead of all the embodiments.Based on this
Embodiment in invention, every other reality obtained by those of ordinary skill in the art without making creative efforts
Example is applied, shall fall within the protection scope of the present invention.
Fig. 1 show the flow diagram of voice information processing method provided by one embodiment of the invention.Such as Fig. 1 institute
Show, which includes:
Step 101: extract the audio feature vector of the sound bite in audio stream to be identified, wherein sound bite it is corresponding to
Identify one section of word in audio stream.
Audio feature vector includes at least one audio frequency characteristics.It is to utilize an at least one-dimensional vector space in fact in this way
In vector characterize all audio frequency characteristics, in the vector space, a kind of meter of the corresponding audio frequency characteristics of each dimension
Characteristic manner is calculated, the direction of audio feature vector and value can be regarded as by the respective different computational representations of many audio frequency characteristics
Mode is summed in vector space, wherein every kind of computational representation mode of each audio frequency characteristics can regard audio feature vector as
One-component.It include that the sound bite of different moods necessarily has different audio frequency characteristics, the present invention is exactly using different
Corresponding relationship between mood and different audio frequency characteristics identifies the mood of sound bite.Specifically, audio feature vector
It may include one of following several audio frequency characteristics or a variety of: energy feature, pronunciation frame number feature, fundamental frequency feature, resonance
Peak feature, harmonic to noise ratio feature and mel cepstrum coefficients feature.In an embodiment of the present invention, these audio frequency characteristics can lead to
It crosses one of following computational representation mode or a variety of characterizes: ratio value, mean value, maximum value, intermediate value and standard deviation.
Energy feature refers to the power spectrum characteristic of sound bite, can sum to obtain by power spectrum.Calculation formula can are as follows:Wherein E indicates the value of energy feature, and k represents the number of frame, and j represents the number of Frequency point, and N is frame
Long, P indicates the value of power spectrum.In an embodiment of the present invention, energy feature may include short-time energy first-order difference, and/or
Predeterminated frequency energy size below.The calculation formula of short-time energy first-order difference can are as follows:
VE (k)=(- 2*E (k-2)-E (k-1)+E (k+1)+2*E (k+2))/3;
Predeterminated frequency energy size below can be measured by ratio value, such as 500Hz or less band energy accounts for total energy
The calculation formula of the ratio value of amount can are as follows:
Wherein j500For the corresponding frequency point number of 500Hz, k1 is the number of the voice start frame of sound bite to be identified,
K2 is the number of the voice end frame of sound bite to be identified.
Pronunciation frame number feature refers to that the population size of pronunciation frame in sound bite, the population size of the pronunciation frame can also lead to
Ratio value is crossed to measure.Such as remember that the quantity of pronunciation frame and mute frame is respectively n1 and n2 in the sound bite, then pronounce frame
Several and mute frame number ratio is p2=n1/n2, the ratio of pronounce frame number and totalframes are as follows: p3=n1/ (n1+n2).
Fundamental frequency feature can be used based on the algorithm of the auto-correlation function of linear prediction (LPC) error signal and extract.
Fundamental frequency feature may include fundamental frequency and/or fundamental frequency first-order difference.The algorithm flow of fundamental frequency can be as follows: first
First, it calculates the linear predictor coefficient of pronunciation frame x (k) and calculates linear prediction estimation signalSecondly, error signal
Auto-correlation function c1:Then, the offset ranges for being 80-500Hz in corresponding fundamental frequency
It is interior, the maximum value of auto-correlation function is found, its corresponding offset Δ h is recorded.The calculation formula of fundamental frequency F0 are as follows: F0=
Fs/ Δ h, wherein Fs is sample frequency.
Formant feature can be used based on the algorithm of the polynomial rooting of linear prediction and extract, it may include the first resonance
The first-order difference at peak, the second formant and third formant and three formants.Harmonic to noise ratio (HNR) feature can adopt
It is extracted with based on the algorithm of independent component analysis (ICA).Mel cepstrum (MFCC) coefficient characteristics may include that 1-12 rank Meier is fallen
Spectral coefficient and 1-12 rank mel cepstrum coefficients first-order difference can be used general mel cepstrum coefficients calculation process and obtain,
This is repeated no more.
Can be depending on the demand of actual scene it should be appreciated which audio feature vector specifically extracted, the present invention is to institute
Extract type, quantity and the vector direction of audio frequency characteristics corresponding to audio feature vector without limitation.However in the present invention
In one embodiment, in order to obtain optimal Emotion identification effect, six above-mentioned audio frequency characteristics can be extracted simultaneously: energy feature,
Pronunciation frame number feature, fundamental frequency feature, formant feature, harmonic to noise ratio feature and mel cepstrum coefficients feature.For example,
When extracting six above-mentioned audio frequency characteristics simultaneously, extracted audio feature vector just may include 173 as shown in table 1 below
Component, using the audio feature vector and Gauss model (GMM) of the following table 1 as emotional characteristics model come to casia Chinese feelings
The accuracy that sense corpus carries out voice mood identification can achieve 74% to 80%.
Table 1
In an embodiment of the present invention, audio stream to be identified can be customer service interactive audio stream, and sound bite corresponds to be identified
A user in audio stream inputs voice segments or a customer service inputs voice segments.Since customer interaction process is often to ask one
The form answered, therefore a user inputs voice segments and can correspond to the primary enquirement of user in an interactive process or answer, and
Customer service input voice segments can correspond to the primary enquirement or answer of contact staff in an interactive process.Due to being generally acknowledged that
User or the customer service can completely in primary enquirement or answer express mood, therefore by the way that a user is inputted voice segments or one
Unit of the secondary customer service input voice segments as Emotion identification, not only can guarantee the integrality of Emotion identification, but also can guarantee customer service interaction
The real-time of Emotion identification in the process.
Step 102: the audio feature vector of sound bite being matched with multiple emotional characteristics models, plurality of feelings
Thread characteristic model respectively corresponds one of multiple mood classification.
These emotional characteristics models can be by including that multiple moods are classified the multiple default of corresponding mood tag along sort
The respective audio feature vector of sound bite is learnt in advance and is established, be equivalent to establish in this way emotional characteristics model with
Corresponding relationship between mood classification, each emotional characteristics model can correspond to a mood classification.As shown in Fig. 2, this establishes feelings
The pre- learning process of thread characteristic model can include: will classify the multiple pre- of corresponding mood tag along sort including multiple moods first
If the respective audio feature vector of sound bite carries out clustering processing, the cluster result (S21) of default mood classification is obtained;So
Afterwards, according to cluster result, the audio feature vector of the default sound bite in each cluster is trained for an emotional characteristics mould
Type (S22).Based on these emotional characteristics models, can be obtained by the matching process based on audio feature vector and current speech
The corresponding emotional characteristics model of segment, and corresponding mood classification is obtained in turn.
In an embodiment of the present invention, these emotional characteristics models can be that (degree of mixing can be mixed Gauss model (GMM)
5).It can first be clustered in this way using emotional characteristics vector of the K-means algorithm to the speech samples that same mood is classified, according to
Cluster result calculates the initial value of the parameter of mixed Gauss model (the number of iterations can be 50).Then it is instructed again using E-M algorithm
Practise the corresponding mixed Gauss model (the number of iterations 200) of all kinds of moods classification.When to utilize these mixed Gauss models into
Market thread classification matching process when, can by calculate current speech segment audio feature vector respectively with multiple emotional characteristics
Likelihood probability between model, then determines matched emotional characteristics model by measuring the likelihood probability, such as by likelihood
Probability is greater than preset threshold and maximum emotional characteristics model as matched emotional characteristics model.
Although it should be appreciated that elaborating that emotional characteristics model can be mixed Gauss model in the above description, in fact
The emotional characteristics model can also be realized by other forms, such as support vector machines (SVM) model, K arest neighbors sorting algorithm
(KNN) model, Markov model (HMM) and neural network (ANN) model etc..Tool of the present invention to the emotional characteristics model
Body way of realization does not do considered critical.Simultaneously it should be appreciated that according to the variation of emotional characteristics model realization mode, matching process
Way of realization can also be adjusted, the present invention to the specific implementation form of the matching process equally without limitation.
In an embodiment of the present invention, multiple mood classification can include: satisfied classification, tranquil classification and irritated point
Class, to correspond to the emotional state that user is likely to occur in customer service interaction scenarios.In another embodiment, multiple mood classification can
It include: that satisfaction is classified, tranquil classification, agitation is classified and anger classification, it may to correspond to contact staff in customer service interaction scenarios
The emotional state of appearance.That is, when audio stream to be identified is user's customer service interactive audio stream in customer service interaction scenarios, if current language
When tablet section corresponds to customer service input voice segments, multiple mood is classified can include: satisfied classification, calmness classification and agitation
Classification;If the corresponding user of current speech segment inputs voice segments, multiple mood classification can include: satisfied to classify, is flat
Quiet classification, irritated classification and angry classification.Classified by the above-mentioned mood to user and customer service, can be more succinct it is suitable
For call center system, reduces calculation amount and meet the emotion recognition demand of call center system.It should be appreciated, however, that these
The type and quantity of mood classification can adjust, the type sum number that the present invention classifies to mood according to actual application scenarios demand
Amount does not do considered critical equally.
Step 103: being that the corresponding mood classification of emotional characteristics model to match is used as sound bite by matching result
Mood classification.
As previously described, because between emotional characteristics model and mood classification, there are corresponding relationships, therefore when according to step 102
Matching process the emotional characteristics model to match has been determined after, the corresponding mood classification of the matched emotional characteristics model is just
For the mood classification identified.For example, the matching process can lead to when these emotional characteristics models are mixed Gauss model
Cross measure the audio feature vector of the current speech segment likelihood probability between multiple emotional characteristics models respectively mode it is real
It is existing, likelihood probability is then greater than preset threshold and the corresponding mood classification of maximum emotional characteristics model is used as sound bite
Mood classification.
It can be seen that a kind of voice information processing method provided in an embodiment of the present invention, by extracting audio stream to be identified
In sound bite audio feature vector, and using the emotional characteristics model that pre-establishes to extracted audio feature vector
It is matched, to realize the real-time emotion identification to sound bite.In this way in the application scenarios of such as call center system
Under, the emotional state of real-time monitoring customer service and client in customer service interaction call may be implemented, be remarkably improved using the calling
The service quality of the enterprise of centring system and the customer service experience of client.
It is also understood that the mood classification that voice information processing method based on the embodiment of the present invention is identified,
Specific scene demand can be also further cooperated to realize more flexible secondary applications.It in an embodiment of the present invention, can be real-time
Show the mood classification of the sound bite currently identified, specific real-time display mode can be adjusted according to actual scene demand
It is whole.For example, can be classified with the different colours of signal lamp to characterize different moods, blue lamp represents " satisfaction ", and green light represents " flat
It is quiet ", amber light represents " agitation ", and red light represents " anger ".In this way according to the variation of signal lamp color, customer service can be reminded in real time
Personnel and quality inspection personnel are conversed locating emotional state at present.In another embodiment, the institute in also statistics available preset time period
The mood of the sound bite identified is classified, such as the audio of calling record is numbered, the starting point and end point of sound bite
Timestamp and Emotion identification result record, ultimately form an Emotion identification data bank, and count a period of time
The number and probability that interior various moods occur, make curve graph or table, judge contact staff's clothes in a period of time for enterprise
The reference frame for quality of being engaged in.In another embodiment, it can also send in real time and the classification pair of the mood of the sound bite identified
The mood response message answered, this is applicable to prosthetic machine customer service scene on duty.For example, current call ought be identified in real time
When middle user has been in " anger " state, then automatically reply that user is corresponding with " anger " state to pacify language, to calm down use
Family mood achievees the purpose that continue to link up.It can be by learning in advance as the corresponding relationship between mood classification and mood response message
Habit process pre-establishes.
In an embodiment of the present invention, before the audio feature vector for extracting the sound bite in audio stream to be identified,
It needs first to extract sound bite from audio stream to be identified, carries out mood knowledge as unit of sound bite in order to subsequent
Not, which can be real-time perfoming.
The process that Fig. 3 show extraction sound bite in voice information processing method provided by one embodiment of the invention is shown
It is intended to.As shown in figure 3, the extracting method of the sound bite includes:
Step 301: determining the voice start frame and voice end frame in audio stream to be identified.
Voice start frame is the start frame of a sound bite, and voice end frame is the end frame of a sound bite.When
After voice start frame and voice end frame has been determined, the part between voice start frame and voice end frame is to be extracted
Sound bite.
Step 302: extracting the audio stream part between voice start frame and voice end frame as sound bite.
In an embodiment of the present invention, as shown in figure 4, the language in audio stream to be identified can be determined especially by following steps
Sound start frame and voice end frame:
Step 401: judging that the speech frame in audio stream to be identified is pronunciation frame or non-vocal frame.
In an embodiment of the present invention, the deterministic process of the pronunciation frame or non-vocal frame can be based on to speech terminals detection
(VAD) judgement of decision parameter and power spectrum mean value is realized, as shown in figure 5, specific as follows:
Step 4011: the pretreatment such as framing, adding window, preemphasis is carried out to audio stream to be identified.Hamming can be used in window function
Window, pre emphasis factor desirable 0.97.Remember pretreated kth frame signal be x (k)=[x (k*N), x (k*N+1) ..., x (k*N
+ N-1)], N is frame length, such as desirable 256.Which it should be appreciated, however, that whether needing to carry out preprocessing process, and need by
A little preprocessing process can depending on actual scene demand, the present invention this without limitation.
Step 4012: discrete Fourier transform (DFT) being done to pretreated kth frame signal x (k) and calculates its power
Spectrum, DFT length is taken as consistent with frame length:
P (k, j)=| FFT (x (k)) |2, j=0,1 ..., N-1;
Here j represents the number of Frequency point.
Step 4013: calculate posteriori SNR γ and prior weight ξ:
ξ (k, j)=α ξ (k-1, j)+(1- α) max (γ (k, j) -1,0);
Here factor alpha=0.98;λ is Background Noise Power spectrum, can detecte the power spectrum of initial 5 to 10 frame of beginning
Arithmetic average is as initial value;Min () and max () is respectively to take minimum function and take maximal function;Prior weight ξ (k,
J) 0.98 can be initialized as.
Step 4014: calculate likelihood ratio parameter η:
Step 4015: VAD decision parameter Γ and power spectrum mean value ρ is calculated,
VAD decision parameter can be initialized as
1。
Step 4016: judge whether the VAD decision parameter Γ (k) of kth frame signal is more than or equal to the first default VAD threshold value,
And whether ρ (k) is more than or equal to predetermined power mean value threshold value.In an embodiment of the present invention, which can be
5, which can be 0.01.
Step 4017: if two results judged in step 4016 are to be, kth frame audio signal being determined as
Pronounce frame.
Step 4018: if two in step 4016 judge at least one result be it is no, by kth frame audio signal
It is determined as mute frame, executes step 4019.
Step 4019: noise power spectrum λ is updated by following formula:
λ (k+1, j)=β * λ (k, j)+(1- β) * P (k, j);
Here factor beta be smoothing factor, can value be 0.98.
It can be seen that by constantly recycle method and step as shown in Figure 5 can real-time monitoring go out in audio stream to be identified
Pronunciation frame and non-vocal frame.The recognition result of these pronunciation frames and non-vocal frame is subsequent identification voice start frame and voice knot
The basis of beam frame.
Step 402: after the voice end frame of the preceding paragraph sound bite or current unidentified to first segment language
When tablet section, when there is the first preset quantity speech frame to be continuously judged as pronunciation frame, by the first preset quantity voice
Voice start frame of first speech frame as current speech segment in frame.
In an embodiment of the present invention, two end markers flag_start and flag_end can be set first, respectively generation
The detecting state variable of predicative sound start frame and voice end frame, ture and false respectively represent appearance and do not occur.When
When flag_end=ture, then illustrate that the end frame of a sound bite has been determined, starts to detect next language at this time
The start frame of tablet section.And when the VAD decision parameter of continuous 30 frame signal meets and is more than or equal to the second preset threshold, illustrate this
30 frames have come into a sound bite, at this time using first speech frame in 30 frame as voice start frame, flag_
Start=ture;Otherwise lag_start=false.
Step 403: after the voice start frame of current speech segment, when there is the second preset quantity speech frame quilt
Continuously when being judged as non-vocal frame, illustrate that the second preset quantity speech frame has been not belonging to the sound bite, at this time by the
Voice end frame of first speech frame as current speech segment in two preset quantity speech frames.
Specifically, still continuing to use above example, as flag_start=ture, then explanation has come into a language
The voice start frame of tablet section and the sound bite has been determined, and starts the end frame for checking current speech segment at this time.And
When the VAD decision parameter of continuous 30 frame signal, which meets, is less than third predetermined threshold value, it is determined as that current speech segment terminates,
Flag_end=ture, the first frame of corresponding 30 frames are voice end frame;Otherwise flag_end=false.
In an embodiment of the present invention, in order to further increase the accuracy of judgement degree of voice start frame and voice end frame,
It avoids judging by accident, second preset threshold and third predetermined threshold value may make to be all larger than aforementioned pronunciation frame and non-vocal frame identification process
In the first preset threshold, such as second preset threshold can be 40, the third predetermined threshold value can be 20.
It can be seen that by method and step as shown in Figure 4, can determine the voice start frame in audio stream to be identified with
And voice end frame, and the sound bite between extractable voice start frame and voice end frame carries out Emotion identification.
It, can also be with it should be appreciated that the process of above-mentioned determining voice start frame and voice end frame can be real-time perfoming
It is non-real-time perfoming, the present invention is to execution opportunity of method and step shown in Fig. 4 and without limitation.
Although it should be appreciated that above-mentioned Fig. 4 and Fig. 5 embodiment description in introduce some design factors, parameter just
Initial value and some judgment thresholds, but the initial value of these design factors, parameter and judgment threshold can be according to actual applications
Scene and adjust, the present invention to the size of the initial value of these design factors, parameter and judgment threshold without limitation.
One embodiment of the invention also provides a kind of computer equipment, including memory, processor and is stored in memory
On the computer program that is executed by processor, which is characterized in that processor realizes such as preceding any implementation when executing computer program
Voice information processing method described in example.
One embodiment of the invention also provides a kind of computer readable storage medium, is stored thereon with computer program, special
Sign is, the voice information processing method as described in preceding any embodiment is realized when computer program is executed by processor.The meter
Calculation machine storage medium can be any tangible media, such as floppy disk, CD-ROM, DVD, hard disk drive, even network medium etc..
Although being produced it should be appreciated that can be computer program the foregoing describe a kind of way of realization of embodiment of the present invention
Product, but the method or apparatus of embodiments of the present invention can be come in fact according to the combination of software, hardware or software and hardware
It is existing.Hardware components can use special logic to realize;Software section can store in memory, by instruction execution appropriate
System, such as microprocessor or special designs hardware execute.It will be understood by those skilled in the art that above-mentioned side
Method and equipment can be used computer executable instructions and/or is included in the processor control code to realize, such as such as
Disk, the mounting medium of CD or DVD-ROM, the programmable memory of such as read-only memory (firmware) or such as optics or
Such code is provided in the data medium of electrical signal carrier.Methods and apparatus of the present invention can be by such as ultra-large
The semiconductor or such as field programmable gate array of integrated circuit or gate array, logic chip, transistor etc. can be compiled
The hardware circuit realization of the programmable hardware device of journey logical device etc., can also be soft with being executed by various types of processors
Part is realized, can also be realized by the combination such as firmware of above-mentioned hardware circuit and software.
It should be appreciated that determiner " first ", " second " and " third " used in description of the embodiment of the present invention is only used for
Clearer elaboration technical solution can not be used to limit the scope of the invention.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Within mind and principle, made any modification, equivalent replacement etc. be should all be included in the protection scope of the present invention.
Claims (11)
1. a kind of voice information processing method characterized by comprising
Extract the audio feature vector of the sound bite in audio stream to be identified, wherein the sound bite correspond to it is described to be identified
One section of word in audio stream;
The audio feature vector of the sound bite is matched with multiple emotional characteristics models, wherein the multiple mood is special
Sign model respectively corresponds one of multiple mood classification, and the multiple emotional characteristics model to including the multiple mood by classifying
Multiple default respective audio feature vectors of sound bite of corresponding mood tag along sort are learnt in advance and are established;And
It is the feelings that the corresponding mood of the emotional characteristics model to match is classified as the sound bite by matching result
Thread classification;Wherein, the audio feature vector includes one of following several audio frequency characteristics or a variety of: energy feature, pronunciation
Frame number feature, fundamental frequency feature, formant feature, harmonic to noise ratio feature and mel cepstrum coefficients feature;The voice
Segment includes the customer service input voice segments in the audio stream to be identified, the classification of the multiple mood include: satisfied classification,
Calmness classification and irritated classification.
2. voice information processing method according to claim 1, which is characterized in that the pre- learning process includes:
It will include that the multiple mood is classified multiple default respective audio frequency characteristics of sound bite of corresponding mood tag along sort
Vector carries out clustering processing, obtains the cluster result of default mood classification;And
According to the cluster result, the audio feature vector of the default sound bite in each cluster is trained for an institute
State emotional characteristics model.
3. voice information processing method according to claim 1, which is characterized in that when the emotional characteristics model is mixing
When Gauss model, then the audio feature vector by the sound bite match with multiple emotional characteristics models include:
Calculate the audio feature vector of the sound bite likelihood probability between the multiple emotional characteristics model respectively;
Wherein, described by matching result is that the corresponding mood classification of the emotional characteristics model that matches is used as the voice
The mood of segment is classified
Likelihood probability is greater than preset threshold and the corresponding mood classification of the maximum emotional characteristics model is used as institute's predicate
The mood of tablet section is classified.
4. voice information processing method according to claim 1, which is characterized in that extracting the language in audio stream to be identified
Before the audio feature vector of tablet section, further comprise:
Determine the voice start frame and voice end frame in the audio stream to be identified;And
The audio stream part between the voice start frame and the voice end frame is extracted as the sound bite.
5. voice information processing method according to claim 4, which is characterized in that the determination audio stream to be identified
In voice start frame and voice end frame include:
Judge that the speech frame in the audio stream to be identified is pronunciation frame or non-vocal frame;
After the voice end frame of the preceding paragraph sound bite or it is current it is unidentified to first segment sound bite when, when having
When first preset quantity speech frame is continuously judged as pronunciation frame, by first in the first preset quantity speech frame
The voice start frame of the speech frame as current speech segment;And
After the voice start frame of current speech segment, when there is the second preset quantity speech frame to be continuously judged as non-
When pronunciation frame, using first speech frame in the second preset quantity speech frame as the voice of current speech segment
End frame.
6. voice information processing method according to claim 1, which is characterized in that the energy feature includes: in short-term can
Measure first-order difference and/or predeterminated frequency energy size below;And/or
The fundamental frequency feature includes: fundamental frequency and/or fundamental frequency first-order difference;And/or
The formant feature includes one of following items or a variety of: the first formant, the second formant, third resonance
Peak, the first formant first-order difference, the second formant first-order difference and third formant first-order difference;And/or
The mel cepstrum coefficients feature includes one scale of 1-12 rank mel cepstrum coefficients and/or 1-12 rank mel cepstrum coefficients
Point.
7. voice information processing method according to claim 1, which is characterized in that the audio frequency characteristics pass through following calculating
One of characteristic manner a variety of characterizes: ratio value, mean value, maximum value, intermediate value and standard deviation.
8. voice information processing method according to claim 1, which is characterized in that the energy feature includes: in short-term can
Measure mean value, maximum value, intermediate value and the standard deviation of first-order difference and/or the ratio of predeterminated frequency energy below and total energy
Example value;And/or
The pronunciation frame number feature includes: the pronounce ratio value of frame number and mute frame number, and/or pronunciation frame number and totalframes
Ratio value;
The fundamental frequency feature includes: mean value, maximum value, intermediate value and the standard deviation and/or fundamental frequency one of fundamental frequency
Mean value, maximum value, intermediate value and the standard deviation of order difference;And/or
The formant feature includes one of following items or a variety of: the mean value of the first formant, maximum value, intermediate value and
Standard deviation, mean value, maximum value, intermediate value and the standard deviation of the second formant, the mean value of third formant, maximum value, intermediate value with
And standard deviation, mean value, maximum value, intermediate value and the standard deviation of the first formant first-order difference, the second formant first-order difference
Mean value, maximum value, intermediate value and the standard of mean value, maximum value, intermediate value and standard deviation and third formant first-order difference
Difference;And/or
The mel cepstrum coefficients feature includes mean value, maximum value, intermediate value and the standard deviation of 1-12 rank mel cepstrum coefficients,
And/or mean value, maximum value, intermediate value and the standard deviation of 1-12 rank mel cepstrum coefficients first-order difference.
9. voice information processing method according to claim 1, which is characterized in that further comprise:
Show the mood classification of the sound bite currently identified;And/or
Count the mood classification of the sound bite identified in preset time period;And/or
Send mood response message corresponding with the classification of the mood of the sound bite identified.
10. a kind of computer equipment, including memory, processor and being stored on the memory is executed by the processor
Computer program, which is characterized in that the processor is realized when executing the computer program as appointed in claim 1 to 9
The step of one the method.
11. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program
It realizes when being executed by processor such as the step of any one of claims 1 to 9 the method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711363536.XA CN109935241A (en) | 2017-12-18 | 2017-12-18 | Voice information processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711363536.XA CN109935241A (en) | 2017-12-18 | 2017-12-18 | Voice information processing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109935241A true CN109935241A (en) | 2019-06-25 |
Family
ID=66982427
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711363536.XA Pending CN109935241A (en) | 2017-12-18 | 2017-12-18 | Voice information processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109935241A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110334240A (en) * | 2019-07-08 | 2019-10-15 | 联想(北京)有限公司 | Information processing method, system and the first equipment, the second equipment |
CN110399837A (en) * | 2019-07-25 | 2019-11-01 | 深圳智慧林网络科技有限公司 | User emotion recognition methods, device and computer readable storage medium |
CN112435691A (en) * | 2020-10-12 | 2021-03-02 | 珠海亿智电子科技有限公司 | On-line voice endpoint detection post-processing method, device, equipment and storage medium |
CN113724735A (en) * | 2021-09-01 | 2021-11-30 | 广州博冠信息科技有限公司 | Voice stream processing method and device, computer readable storage medium and electronic equipment |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH06236195A (en) * | 1993-02-12 | 1994-08-23 | Sony Corp | Method for detecting sound section |
US5937375A (en) * | 1995-11-30 | 1999-08-10 | Denso Corporation | Voice-presence/absence discriminator having highly reliable lead portion detection |
US6151571A (en) * | 1999-08-31 | 2000-11-21 | Andersen Consulting | System, method and article of manufacture for detecting emotion in voice signals through analysis of a plurality of voice signal parameters |
US20100191524A1 (en) * | 2007-12-18 | 2010-07-29 | Fujitsu Limited | Non-speech section detecting method and non-speech section detecting device |
CN101930735A (en) * | 2009-06-23 | 2010-12-29 | 富士通株式会社 | Speech emotion recognition equipment and speech emotion recognition method |
CN103531198A (en) * | 2013-11-01 | 2014-01-22 | 东南大学 | Speech emotion feature normalization method based on pseudo speaker clustering |
US20160027452A1 (en) * | 2014-07-28 | 2016-01-28 | Sone Computer Entertainment Inc. | Emotional speech processing |
CN106570496A (en) * | 2016-11-22 | 2017-04-19 | 上海智臻智能网络科技股份有限公司 | Emotion recognition method and device and intelligent interaction method and device |
-
2017
- 2017-12-18 CN CN201711363536.XA patent/CN109935241A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH06236195A (en) * | 1993-02-12 | 1994-08-23 | Sony Corp | Method for detecting sound section |
US5937375A (en) * | 1995-11-30 | 1999-08-10 | Denso Corporation | Voice-presence/absence discriminator having highly reliable lead portion detection |
US6151571A (en) * | 1999-08-31 | 2000-11-21 | Andersen Consulting | System, method and article of manufacture for detecting emotion in voice signals through analysis of a plurality of voice signal parameters |
US20100191524A1 (en) * | 2007-12-18 | 2010-07-29 | Fujitsu Limited | Non-speech section detecting method and non-speech section detecting device |
CN101930735A (en) * | 2009-06-23 | 2010-12-29 | 富士通株式会社 | Speech emotion recognition equipment and speech emotion recognition method |
CN103531198A (en) * | 2013-11-01 | 2014-01-22 | 东南大学 | Speech emotion feature normalization method based on pseudo speaker clustering |
US20160027452A1 (en) * | 2014-07-28 | 2016-01-28 | Sone Computer Entertainment Inc. | Emotional speech processing |
CN106570496A (en) * | 2016-11-22 | 2017-04-19 | 上海智臻智能网络科技股份有限公司 | Emotion recognition method and device and intelligent interaction method and device |
Non-Patent Citations (1)
Title |
---|
李嘉等: "语音情感的维度特征提取与识别", 《数据采集与处理》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110334240A (en) * | 2019-07-08 | 2019-10-15 | 联想(北京)有限公司 | Information processing method, system and the first equipment, the second equipment |
CN110399837A (en) * | 2019-07-25 | 2019-11-01 | 深圳智慧林网络科技有限公司 | User emotion recognition methods, device and computer readable storage medium |
CN110399837B (en) * | 2019-07-25 | 2024-01-05 | 深圳智慧林网络科技有限公司 | User emotion recognition method, device and computer readable storage medium |
CN112435691A (en) * | 2020-10-12 | 2021-03-02 | 珠海亿智电子科技有限公司 | On-line voice endpoint detection post-processing method, device, equipment and storage medium |
CN112435691B (en) * | 2020-10-12 | 2024-03-12 | 珠海亿智电子科技有限公司 | Online voice endpoint detection post-processing method, device, equipment and storage medium |
CN113724735A (en) * | 2021-09-01 | 2021-11-30 | 广州博冠信息科技有限公司 | Voice stream processing method and device, computer readable storage medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108122552A (en) | Voice mood recognition methods and device | |
US11373641B2 (en) | Intelligent interactive method and apparatus, computer device and computer readable storage medium | |
CN109961803A (en) | Voice mood identifying system | |
CN109961776A (en) | Speech information processing apparatus | |
US10878823B2 (en) | Voiceprint recognition method, device, terminal apparatus and storage medium | |
US10896428B1 (en) | Dynamic speech to text analysis and contact processing using agent and customer sentiments | |
WO2021128741A1 (en) | Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium | |
US9536547B2 (en) | Speaker change detection device and speaker change detection method | |
CN110085262A (en) | Voice mood exchange method, computer equipment and computer readable storage medium | |
US20200195779A1 (en) | System and method for performing agent behavioral analytics | |
CN110085221A (en) | Speech emotional exchange method, computer equipment and computer readable storage medium | |
CN109935241A (en) | Voice information processing method | |
CN110085211A (en) | Speech recognition exchange method, device, computer equipment and storage medium | |
EP2363852A1 (en) | Computer-based method and system of assessing intelligibility of speech represented by a speech signal | |
CN110085220A (en) | Intelligent interaction device | |
US20160267924A1 (en) | Speech detection device, speech detection method, and medium | |
Black et al. | Automatic classification of married couples' behavior using audio features | |
US11837236B2 (en) | Speaker recognition based on signal segments weighted by quality | |
CN112966082A (en) | Audio quality inspection method, device, equipment and storage medium | |
Jiao et al. | Convex weighting criteria for speaking rate estimation | |
Yücesoy et al. | A new approach with score-level fusion for the classification of a speaker age and gender | |
CN109935240A (en) | Pass through the method for speech recognition mood | |
Rumagit et al. | Model comparison in speech emotion recognition for Indonesian language | |
Shah et al. | Speech emotion recognition based on SVM using MATLAB | |
US9697825B2 (en) | Audio recording triage system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190625 |
|
RJ01 | Rejection of invention patent application after publication |