CN108010516A

CN108010516A - Semantic independent speech emotion feature recognition method and device

Info

Publication number: CN108010516A
Application number: CN201711258175.2A
Authority: CN
Inventors: 郑渊中
Original assignee: Speakin Technologies Co ltd
Current assignee: Speakin Technologies Co ltd
Priority date: 2017-12-04
Filing date: 2017-12-04
Publication date: 2018-05-08

Abstract

The embodiment of the invention discloses a semantic independent speech emotion feature recognition method and device. The method can directly judge the emotion of the speaker without depending on semantics, determines emotion types corresponding to PCM data according to the matching degree by matching the PCM data with the voice spectrum characteristics, the prosody characteristics and the tone quality characteristics in the emotion database, is simple and convenient in extracting the physical characteristics, is efficient and quick in processing process, can realize accurate recognition of the emotion characteristics by comprehensively matching the voice characteristics of various types, and solves the technical problems that the current voice emotion recognition processing process is complex, the realization difficulty is high, and the method excessively depends on semantics and has long processing time.

Description

A kind of semanteme independent voice mood characteristic recognition method and device

Technical field

The present invention relates to audio identification field, more particularly to the voice mood characteristic recognition method and dress that a kind of semanteme is independent Put.

Background technology

With the deep combination of computer technology and daily life, people be not content with by computer into Row audio identification only can confirm that speaker and speech recognition, can be more intelligent it is desirable to computer, can identify semanteme, The information of the higher levels such as mood.

Emotional information is very important a kind of information resources in voice.It is different from speech recognition technology, Emotion identification system System is more concerned with the tongue of speaker, is the deeper tone and attitude hidden in surface and play, can recognize For be in voice signal hide order of information.

In fact, during person to person exchanges, same speaker says identical two with different moods, can To show the entirely different meaning.

But in traditional intelligent sound data analysis, emotional information is regarded as the difference between individual, so as to damage Very valuable information is lost.

The implementation of voice mood identification technology is the identification such as speech recognition and Expression Recognition and semantics recognition mostly at present Mode is combined.But a variety of identification methods combine and carry out Emotion identification not only complex disposal process, realize difficulty height, Need to carry out the processing method such as image and Video processing, and processing time is longer.Therefore, current voice mood is result in know Other complex disposal process, realizes difficulty height, is overly dependent upon the technical problem of length of semantic and processing time.

The content of the invention

The present invention provides the voice mood characteristic recognition method and device that a kind of semanteme is independent, solves current voice Emotion identification complex disposal process, realizes difficulty height, is overly dependent upon the technical problem of length of semantic and processing time.

The present invention provides the voice mood characteristic recognition method that a kind of semanteme is independent, including：

S1：Obtain the PCM data in the audio file of wav forms；

S2：PCM data is subjected to speech feature extraction, the sound spectrum, prosodic features and tonequality for obtaining PCM data are special Sign；

S3：By the sound spectrum in PCM data, prosodic features and tonequality feature respectively with various feelings in mood data storehouse The corresponding preset sound spectrum of thread classification, preset prosodic features and preset tonequality feature carry out pattern match, according to pattern The mood classification for the result output matching degree maximum matched somebody with somebody.

Preferably, the step S3 is specifically included：

S301：Obtain corresponding with preset sound spectrum, preset prosodic features and preset tonequality feature in mood data storehouse Preset weights；

S302：By the sound spectrum in PCM data, prosodic features and tonequality feature respectively with it is various in mood data storehouse The corresponding preset sound spectrum of mood classification, preset prosodic features and preset tonequality feature carry out pattern match；

S303：Sound spectrum, prosodic features and tonequality feature in PCM data respectively with it is each in mood data storehouse The matching degree and mood data of the corresponding preset sound spectrum of kind mood classification, preset prosodic features and preset tonequality feature Preset sound spectrum in storehouse, preset prosodic features and the corresponding preset various mood classifications of weight computing of preset tonequality feature Weighted average, using weighted average as matching degree, export the mood classification of matching degree maximum.

Preferably, the sound spectrum specifically includes：MFCC features and GFCC features.

Preferably, the prosodic features specifically includes：Pitch features, Short Term Energy features, ZCR features With Speed features.

Preferably, the tonequality feature specifically includes：Formants features.

The present invention provides the voice mood specific identification device that a kind of semanteme is independent, including：

Audio acquisition module, the PCM data in audio file for obtaining wav forms；

Characteristic extracting module, for PCM data to be carried out speech feature extraction, obtains sound spectrum, the rhythm of PCM data Learn feature and tonequality feature；

Match output module, for by the sound spectrum in PCM data, prosodic features and tonequality feature respectively with mood Various mood classifications corresponding preset sound spectrum, preset prosodic features and preset tonequality feature are into row mode in database Match somebody with somebody, the mood classification of matching degree maximum is exported according to the result of pattern match.

Preferably, the matching output module specifically includes：

Weights submodule, for obtaining and preset sound spectrum, preset prosodic features and preset sound in mood data storehouse The corresponding preset weights of matter feature；

Matched sub-block, for by the sound spectrum in PCM data, prosodic features and tonequality feature respectively with mood number According to the corresponding preset sound spectrum of various mood classifications, preset prosodic features and preset tonequality feature in storehouse into row mode Match somebody with somebody；

Output sub-module, for the sound spectrum in PCM data, prosodic features and tonequality feature respectively with mood The matching degree of the corresponding preset sound spectrum of various mood classifications, preset prosodic features and preset tonequality feature in database Preset weight computing corresponding with preset sound spectrum in mood data storehouse, preset prosodic features and preset tonequality feature is various The weighted average of mood classification, using weighted average as matching degree, export the mood classification of matching degree maximum.

Preferably, the tonequality feature specifically includes：Formants features.

As can be seen from the above technical solutions, example of the present invention has the following advantages：

The present invention provides the voice mood characteristic recognition method that a kind of semanteme is independent, including：S1：Obtain wav forms PCM data in audio file；S2：PCM data is subjected to speech feature extraction, obtains sound spectrum, the metrics of PCM data Feature and tonequality feature；S3：By the sound spectrum in PCM data, prosodic features and tonequality feature respectively with mood data storehouse In the corresponding preset sound spectrum of various mood classifications, preset prosodic features and preset tonequality feature carry out pattern match, root According to the mood classification of the result output matching degree maximum of pattern match.

The present invention can not depend on semantic and directly judge speaker's mood, by PCM data and mood data storehouse Sound spectrum, prosodic features and tonequality feature matched, the corresponding mood classification of PCM data is determined according to matching degree, The method for extracting these physical features more succinctly facilitates, and processing procedure is efficiently quick, and the voice of plurality of classes is special Sign comprehensive matching can realize accurately identifying for emotional characteristics, solve current voice mood identification processing procedure complexity, real Existing difficulty is high, is overly dependent upon the technical problem of length of semantic and processing time.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is attached drawing needed in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, without having to pay creative labor, may be used also To obtain other attached drawings according to these attached drawings.

Fig. 1 is a kind of one embodiment of the independent voice mood characteristic recognition method of semanteme provided in an embodiment of the present invention Flow diagram；

Fig. 2 is a kind of another implementation of the independent voice mood characteristic recognition method of semanteme provided in an embodiment of the present invention The flow diagram of example；

Fig. 3 is a kind of one embodiment of the independent voice mood specific identification device of semanteme provided in an embodiment of the present invention Structure diagram.

Embodiment

An embodiment of the present invention provides the voice mood characteristic recognition method and device that a kind of semanteme is independent, solves current Voice mood identification processing procedure it is complicated, realize difficulty height, be overly dependent upon the technical problem of length of semantic and processing time.

Goal of the invention, feature, advantage to enable the present invention is more obvious and understandable, below in conjunction with the present invention Attached drawing in embodiment, is clearly and completely described the technical solution in the embodiment of the present invention, it is clear that disclosed below Embodiment be only part of the embodiment of the present invention, and not all embodiment.Based on the embodiments of the present invention, this area All other embodiment that those of ordinary skill is obtained without making creative work, belongs to protection of the present invention Scope.

Referring to Fig. 1, one an embodiment of the present invention provides a kind of independent voice mood characteristic recognition method of semanteme Embodiment, including：

Step 101：Obtain the PCM data in the audio file of wav forms；

It should be noted that in actual application, it is necessary to first obtain wav forms audio file in PCM data, and PCM data is introduced directly into memory, so as to the progress of subsequent step.

Step 102：PCM data is subjected to speech feature extraction, obtain the sound spectrum of PCM data, prosodic features and Tonequality feature；

It should be noted that obtain wav forms audio file in PCM data after, it is also necessary to PCM data is carried out Speech feature extraction, obtains the sound spectrum, prosodic features and tonequality feature of PCM data；

And for accuracy, it can be extracted from each dimension of various phonetic features, composition one is more than 100 dimensions Vector, for follow-up pattern match.

Step 103：By the sound spectrum in PCM data, prosodic features and tonequality feature respectively with mood data storehouse The corresponding preset sound spectrum of various mood classifications, preset prosodic features and preset tonequality feature carry out pattern match, according to The mood classification of the result output matching degree maximum of pattern match.

It should be noted that the present embodiment passes through to the sound spectrum in PCM data and mood data storehouse, prosodic features Matched with tonequality feature, the corresponding mood classification of PCM data, the method for extracting these physical features are determined according to matching degree It is more succinct convenient, and processing procedure is efficiently quick, and the phonetic feature comprehensive matching of plurality of classes can realize mood Feature accurately identifies, and improves flexibility, convenience, tightness and the recognition efficiency of Emotion identification, can better adapt to intelligence The demand in hardware future can be changed, the sustainable intelligent hardware progress growing to complexity is complete, rapidly configures, and solves The voice mood identification processing procedure for having determined current is complicated, realizes difficulty height, is overly dependent upon the skill of length of semantic and processing time Art problem.

It is above an a kind of implementation of the independent voice mood characteristic recognition method of semanteme provided in an embodiment of the present invention Example, is below a kind of another embodiment of the independent voice mood characteristic recognition method of semanteme provided in an embodiment of the present invention.

Referring to Fig. 2, an embodiment of the present invention provides a kind of the another of the independent voice mood characteristic recognition method of semanteme A embodiment, including：

Step 201：Obtain the PCM data in the audio file of wav forms；

Step 202：PCM data is subjected to speech feature extraction, obtain the sound spectrum of PCM data, prosodic features and Tonequality feature；

Step 203：Obtain and preset sound spectrum, preset prosodic features and preset tonequality feature pair in mood data storehouse The preset weights answered；

Step 204：By the sound spectrum in PCM data, prosodic features and tonequality feature respectively with mood data storehouse The corresponding preset sound spectrum of various mood classifications, preset prosodic features and preset tonequality feature carry out pattern match；

Step 205：Sound spectrum, prosodic features and tonequality feature in PCM data respectively with mood data storehouse In the corresponding preset sound spectrum of various mood classifications, preset prosodic features and preset tonequality feature matching degree and mood Preset sound spectrum, preset prosodic features and the corresponding preset various mood classes of weight computing of preset tonequality feature in database Other weighted average, using weighted average as matching degree, export the mood classification of matching degree maximum.

It should be noted that the calculating of matching degree can pass through weighted average, neural network model or clustering algorithm etc. Mode is calculated, and a kind of embodiment only therein is calculated by weighted average；

The calculation formula of the weighted average of matching degree is as follows：

P=A*a+B*b+C*c

Wherein, P is matching degree, and A is the matching degree of the sound spectrum in PCM data and preset sound spectrum, B PCM The matching degree of prosodic features and preset prosodic features in data, C are tonequality feature and preset tonequality in PCM data The matching degree of feature, a are the corresponding preset weights of preset sound spectrum, and b is the corresponding preset weights of preset prosodic features, C is the corresponding preset weights of preset tonequality feature.

Further, the sound spectrum specifically includes：MFCC features and GFCC features.

It should be noted that MFCC is the abbreviation of Mel frequency cepstral coefficients；

Mel frequencies are extracted based on human hearing characteristic, it is with Hz frequencies into nonlinear correspondence relation, and Mel is frequently Rate cepstrum coefficient (MFCC) is then the Hz spectrum signatures being calculated using this relation between them；

GFCC is characterized as the aural signature based on Gammatone wave filters.

Further, the prosodic features specifically includes：Pitch features, Short Term Energy features, ZCR are special Seek peace Speed features.

It should be noted that Pitch features are related with the fundamental frequency (fundamental frequency) of sound, reflection It is the information of pitch；

Short Term Energy are characterized as short-time energy feature；

ZCR (zero-crossing rate, zero-crossing rate) feature refers to the ratio of the sign change of a signal, such as believes Number from positive number become negative or reversely, be to tap sound the main feature classify；

Speed is characterized as word speed feature.

Further, the tonequality feature specifically includes：Formants features.

It should be noted that the translator of Chinese of Formants features is formant feature, formant refers to the frequency in sound Some regions of energy Relatively centralized in spectrum, the formant not still determinant of tonequality, and reflect sound channel (resonant cavity) Physical features.

The present embodiment by the sound spectrum in PCM data and mood data storehouse, prosodic features and tonequality feature into Row matching, the corresponding mood classification of PCM data is determined according to matching degree, extracts the more succinct side of method of these physical features Just, and processing procedure is efficiently quick；

The comprehensive matching of the phonetic feature of plurality of classes is used at the same time, it is possible to achieve emotional characteristics accurately identifies；

The present invention improves flexibility, convenience, tightness and the recognition efficiency of Emotion identification, can better adapt to intelligence Change the demand in hardware future, the sustainable intelligent hardware progress growing to complexity is complete, rapidly configures；

Solve current voice mood identification processing procedure complexity, realize that difficulty is high and the technology of processing time length is asked Topic.

It is above a kind of another reality of the independent voice mood characteristic recognition method of semanteme provided in an embodiment of the present invention Example is applied, is below a kind of one embodiment of the independent voice mood specific identification device of semanteme provided in an embodiment of the present invention.

Referring to Fig. 3, one an embodiment of the present invention provides a kind of independent voice mood specific identification device of semanteme Embodiment, including：

Audio acquisition module 301, the PCM data in audio file for obtaining wav forms；

Characteristic extracting module 302, for PCM data to be carried out speech feature extraction, obtain PCM data sound spectrum, Prosodic features and tonequality feature；

Match output module 303, for by the sound spectrum in PCM data, prosodic features and tonequality feature respectively with The corresponding preset sound spectrum of various mood classifications, preset prosodic features and preset tonequality feature carry out mould in mood data storehouse Formula matches, and the mood classification of matching degree maximum is exported according to the result of pattern match.

Further, matching output module 303 specifically includes：

Weights submodule 3031, for obtaining and preset sound spectrum in mood data storehouse, preset prosodic features and pre- Put the corresponding preset weights of tonequality feature；

Matched sub-block 3032, for by the sound spectrum in PCM data, prosodic features and tonequality feature respectively with feelings Various mood classifications corresponding preset sound spectrum, preset prosodic features and preset tonequality feature are into row mode in thread database Matching；

Output sub-module 3033, for the sound spectrum in PCM data, prosodic features and tonequality feature respectively with The matching of the corresponding preset sound spectrum of various mood classifications, preset prosodic features and preset tonequality feature in mood data storehouse Preset sound spectrum, preset prosodic features and the corresponding preset weight computing of preset tonequality feature in degree and mood data storehouse The weighted average of various mood classifications, using weighted average as matching degree, export the mood classification of matching degree maximum.

Further, sound spectrum specifically includes：MFCC features and GFCC features.

Further, prosodic features specifically includes：Pitch features, Short Term Energy features, ZCR features and Speed features.

Further, tonequality feature specifically includes：Formants features.

It is apparent to those skilled in the art that for convenience and simplicity of description, the device of foregoing description With the specific work process of module, the corresponding process in preceding method embodiment is may be referred to, details are not described herein.

In several embodiments provided herein, it should be understood that disclosed apparatus and method, can pass through it Its mode is realized.For example, device embodiment described above is only schematical, for example, the division of the module, only Only a kind of division of logic function, can there is other dividing mode when actually realizing, such as multiple module or components can be tied Another system is closed or is desirably integrated into, or some features can be ignored, or do not perform.It is another, it is shown or discussed Mutual coupling, direct-coupling or communication connection can be the INDIRECT COUPLING or logical by some interfaces, device or module Letter connection, can be electrical, machinery or other forms.

The module illustrated as separating component may or may not be physically separate, be shown as module The component shown may or may not be physical module, you can with positioned at a place, or can also be distributed to multiple On mixed-media network modules mixed-media.Some or all of module therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.

In addition, each function module in each embodiment of the present invention can be integrated in a processing module, can also That modules are individually physically present, can also two or more modules be integrated in a module.Above-mentioned integrated mould Block can both be realized in the form of hardware, can also be realized in the form of software function module.

If the integrated module is realized in the form of software function module and is used as independent production marketing or use When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially The part to contribute in other words to the prior art or all or part of the technical solution can be in the form of software products Embody, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server, or network equipment etc.) performs the complete of each embodiment the method for the present invention Portion or part steps.And foregoing storage medium includes：USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can store journey The medium of sequence code.

The above, the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although with reference to before Embodiment is stated the present invention is described in detail, it will be understood by those of ordinary skill in the art that：It still can be to preceding State the technical solution described in each embodiment to modify, or equivalent substitution is carried out to which part technical characteristic；And these Modification is replaced, and the essence of appropriate technical solution is departed from the spirit and scope of various embodiments of the present invention technical solution.

Claims

A kind of 1. independent voice mood characteristic recognition method of semanteme, it is characterised in that including：

S1：Obtain the PCM data in the audio file of wav forms；

S2：PCM data is subjected to speech feature extraction, obtains the sound spectrum, prosodic features and tonequality feature of PCM data；

S3：By the sound spectrum in PCM data, prosodic features and tonequality feature respectively with various mood classes in mood data storehouse Not corresponding preset sound spectrum, preset prosodic features and preset tonequality feature carry out pattern match, according to pattern match As a result the mood classification of matching degree maximum is exported.
A kind of 2. independent voice mood characteristic recognition method of semanteme according to claim 1, it is characterised in that the step Rapid S3 is specifically included：

S301：Obtain corresponding pre- with preset sound spectrum, preset prosodic features and preset tonequality feature in mood data storehouse Put weights；

S302：By the sound spectrum in PCM data, prosodic features and tonequality feature respectively with various moods in mood data storehouse The corresponding preset sound spectrum of classification, preset prosodic features and preset tonequality feature carry out pattern match；

S303：Various mood classes in sound spectrum, prosodic features and tonequality feature and mood data storehouse in PCM data It is preset in not corresponding preset sound spectrum, preset prosodic features and the matching degree of preset tonequality feature and mood data storehouse The weighted average of sound spectrum, preset prosodic features and the corresponding preset various mood classifications of weight computing of preset tonequality feature Number, using weighted average as matching degree, exports the mood classification of matching degree maximum.
A kind of 3. independent voice mood characteristic recognition method of semanteme according to claim 1, it is characterised in that the sound Spectrum signature specifically includes：MFCC features and GFCC features.
A kind of 4. independent voice mood characteristic recognition method of semanteme according to claim 1, it is characterised in that the rhythm Study of law feature specifically includes：Pitch features, Short Term Energy features, ZCR features and Speed features.
A kind of 5. independent voice mood characteristic recognition method of semanteme according to claim 1, it is characterised in that the sound Matter feature specifically includes：Formants features.
A kind of 6. independent voice mood specific identification device of semanteme, it is characterised in that including：

Audio acquisition module, the PCM data in audio file for obtaining wav forms；

Characteristic extracting module, for PCM data to be carried out speech feature extraction, sound spectrum, the metrics for obtaining PCM data are special Tonequality of seeking peace feature；

Match output module, for by the sound spectrum in PCM data, prosodic features and tonequality feature respectively with mood data The corresponding preset sound spectrum of various mood classifications, preset prosodic features and preset tonequality feature carry out pattern match in storehouse, The mood classification of matching degree maximum is exported according to the result of pattern match.
7. the independent voice mood specific identification device of a kind of semanteme according to claim 6, it is characterised in that described Specifically included with output module：

Weights submodule, it is special with preset sound spectrum, preset prosodic features and preset tonequality in mood data storehouse for obtaining Levy corresponding preset weights；

Matched sub-block, for will in the sound spectrum in PCM data, prosodic features and tonequality feature and mood data storehouse it is each The corresponding preset sound spectrum of kind mood classification, preset prosodic features and preset tonequality feature carry out pattern match；

Output sub-module, for the sound spectrum in PCM data, prosodic features and tonequality feature respectively with mood data The matching degree and feelings of the corresponding preset sound spectrum of various mood classifications, preset prosodic features and preset tonequality feature in storehouse Preset sound spectrum, preset prosodic features and the corresponding preset various moods of weight computing of preset tonequality feature in thread database The weighted average of classification, using weighted average as matching degree, export the mood classification of matching degree maximum.
A kind of 8. independent voice mood specific identification device of semanteme according to claim 6, it is characterised in that the sound Spectrum signature specifically includes：MFCC features and GFCC features.
A kind of 9. independent voice mood specific identification device of semanteme according to claim 6, it is characterised in that the rhythm Study of law feature specifically includes：Pitch features, Short Term Energy features, ZCR features and Speed features.
10. the independent voice mood specific identification device of a kind of semanteme according to claim 6, it is characterised in that described Tonequality feature specifically includes：Formants features.