CN110534091A - A kind of people-car interaction method identified based on microserver and intelligent sound - Google Patents
A kind of people-car interaction method identified based on microserver and intelligent sound Download PDFInfo
- Publication number
- CN110534091A CN110534091A CN201910758860.4A CN201910758860A CN110534091A CN 110534091 A CN110534091 A CN 110534091A CN 201910758860 A CN201910758860 A CN 201910758860A CN 110534091 A CN110534091 A CN 110534091A
- Authority
- CN
- China
- Prior art keywords
- audio data
- feature
- people
- voice
- emotion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000003993 interaction Effects 0.000 title claims abstract description 47
- 238000000034 method Methods 0.000 title claims abstract description 42
- 230000008909 emotion recognition Effects 0.000 claims abstract description 34
- 230000008451 emotion Effects 0.000 claims abstract description 30
- 238000000605 extraction Methods 0.000 claims abstract description 19
- 238000004458 analytical method Methods 0.000 claims abstract description 15
- 238000007635 classification algorithm Methods 0.000 claims description 8
- 238000009432 framing Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 5
- 230000007935 neutral effect Effects 0.000 claims description 4
- 238000001514 detection method Methods 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 3
- 230000003595 spectral effect Effects 0.000 claims description 3
- 235000013399 edible fruits Nutrition 0.000 claims 1
- 230000006870 function Effects 0.000 description 17
- 238000001228 spectrum Methods 0.000 description 6
- 238000012549 training Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 230000000875 corresponding effect Effects 0.000 description 4
- 230000002596 correlated effect Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000000178 monomer Substances 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 210000000214 mouth Anatomy 0.000 description 1
- 210000003205 muscle Anatomy 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000010183 spectrum analysis Methods 0.000 description 1
- 230000005654 stationary process Effects 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of people-car interaction methods identified based on microserver and intelligent sound, by the voice data for obtaining user's input, data acquisition is carried out to voice data, generate audio data, and audio data is pre-processed to remove the background noise in audio data, feature extraction is carried out to audio data simultaneously, generate speech recognition features and emotion recognition feature, then feature identification is carried out to speech recognition features and emotion recognition feature, generate voice content and emotion information, finally according to voice content and emotion information, it is inquired in default rule database, generate the matching highest result of score, and implementing result is to carry out people-car interaction, it is more intelligent to be compared to traditional people-car interaction method embodiment of the present invention, and also has the function of Sentiment orientation analysis.
Description
Technical field
The present invention relates to field of artificial intelligence, more particularly to one kind to know others based on microserver and intelligent sound
Vehicle exchange method.
Background technique
Existing people-car interaction method, is based primarily upon the semantic analysis of speech recognition, and the voice issued for driver refers to
It enables, carries out instruction analysis, then make corresponding feedback action, used identification facility is based on traditional keyword models
Matched speech recognition system, storage tool are traditional Relational DataBase, are from the background the monomer service framework of traditional MVC.
However existing people-car interaction method, due to ignoring the Sentiment orientation analysis of driver at that time, so showing not
Enough intelligences, not humanized enough, the single machine limited storage space of traditional Relational DataBase, it is desirable to carry out mass data storage too
Difficulty is crossed, and recognition accuracy is not high, recognition speed is slow, dictionary maintenance is troublesome, the background system used is huge, maintenance is numb
Tired, poor expandability.
Summary of the invention
The purpose of the embodiment of the present invention is that a kind of people-car interaction method identified based on microserver and intelligent sound is provided,
It is compared to that traditional people-car interaction method embodiment of the present invention is more intelligent, and the function also with Sentiment orientation analysis
Energy.
To achieve the above object, the embodiment of the invention provides one kind knows others vehicle based on microserver and intelligent sound
Exchange method, comprising the following steps:
The voice data for obtaining user's input carries out data acquisition to the voice data, generates audio data;
The audio data is pre-processed to remove the background noise in the audio data, while to the audio
Data carry out feature extraction, generate speech recognition features and emotion recognition feature;
Special identification is carried out to the speech recognition features and the emotion recognition feature, generates voice content and emotion letter
Breath;
It according to the voice content and the emotion information, is inquired in default rule database, generates matching
Score is highest as a result, and executing the result to carry out people-car interaction.
Further, the pretreatment includes: denoising, preemphasis, short-time analysis, framing, adding window and end-point detection.
Further, described that feature extraction is carried out to the audio data, it generates speech recognition features and emotion recognition is special
Sign, specifically:
Mel-frequency cepstrum coefficient (MFCC) is carried out to the audio data to extract, and generates the Meier frequency of the audio data
Rate cepstrum coefficient (MFCC), and it regard the mel-frequency cepstrum coefficient (MFCC) of the audio data as speech recognition features;
Affective feature extraction is carried out to the audio data by GeMAPS feature set, generates the audio data
GeMAPS feature set, and using the GeMAPS feature set of the audio data as emotion recognition feature.
Further, the GeMAPS feature set includes 62 features, and 62 features are HSF feature, and institute
Stating 62 features is obtained by 18 LLD feature calculations.
Further, mel-frequency cepstrum coefficient (MFCC) is carried out to the audio data to extract, generate the audio number
According to mel-frequency cepstrum coefficient (MFCC), and by the mel-frequency cepstrum coefficient (MFCC) of the audio data as voice know
Other feature, specifically:
It is FFT to the audio data framing adding window, and to each frame, obtains linear sonograph;
To log is taken after linear sonograph application Meier filter, log Meier sonograph is obtained;
DCT and discrete cosine transform are done to the log Meier sonograph, retain the 2nd to the 13rd coefficient in result,
And using this obtained 12 coefficients as the mel-frequency cepstrum coefficient (MFCC) of the audio data, while by the audio number
According to mel-frequency cepstrum coefficient (MFCC) be used as speech recognition features.
Further, described that special identification is carried out to the speech recognition features and the emotion recognition feature, generate language
Sound content and emotion information, specifically:
It is matched by characteristic parameter of the acoustic model to the speech recognition features, matching generates the language of the voice
Sound content;
Classified calculating is carried out to the emotion recognition feature by preset SVM multi-classification algorithm, obtains the voice
Emotion information.
Further, the emotion information is k class, including it is glad, indignation, fear, be sad, is surprised and neutral.
Further, described that classified calculating, tool are carried out to the emotion recognition feature by preset SVM multi-classification algorithm
Body are as follows:
By designing k (k-1)/2 SVM, classified between any two classification sample using a SVM, and will
Who gets the most votes's classification is as final classification.
Further, it when carrying out classified calculating to the emotion recognition feature, is calculated using big data Spark memory flat
Platform, quickly to obtain calculated result.
Compared with prior art, it has the following beneficial effects:
The people-car interaction method provided in an embodiment of the present invention identified based on microserver and intelligent sound, is used by obtaining
Family input voice data, to voice data carry out data acquisition, generate audio data, and to audio data pre-processed with
The background noise in audio data is removed, while feature extraction is carried out to audio data, speech recognition features is generated and emotion is known
Then other feature carries out feature identification to speech recognition features and emotion recognition feature, generate voice content and emotion information, most
Afterwards according to voice content and emotion information, inquired in default rule database, generate matching score it is highest as a result,
And implementing result to be to carry out people-car interaction, it is more intelligent to be compared to traditional people-car interaction method embodiment of the present invention, and
And also has the function of Sentiment orientation analysis.
Detailed description of the invention
Fig. 1 is one embodiment of the people-car interaction method provided by the invention identified based on microserver and intelligent sound
Flow diagram;
Fig. 2 is one embodiment of the people-car interaction method provided by the invention identified based on microserver and intelligent sound
The architecture diagram of the people-car interaction system of offer;
Fig. 3 is one embodiment of the people-car interaction method provided by the invention identified based on microserver and intelligent sound
Working principle diagram.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
Referring to one that Fig. 1, Fig. 1 are the people-car interaction methods provided by the invention identified based on microserver and intelligent sound
The structural schematic diagram of a embodiment;The embodiment of the present invention provides a kind of people-car interaction identified based on microserver and intelligent sound
Method, including step S1-S4;
S1 obtains the voice data of user's input, carries out data acquisition to the voice data, generates audio data.
The mass data of PB rank can be stored based on HDFS distributed memory system, have High Availabitity, high fault tolerance, can
The advantages of scalability, in the present embodiment, all these primary voice datas, are stored in HDFS distributed file system.
S2 pre-processes the audio data to remove the background noise in the audio data, while to described
Audio data carries out feature extraction, generates speech recognition features and emotion recognition feature.
In the present embodiment, the pretreatment includes: denoising, preemphasis, short-time analysis, framing, adding window and endpoint inspection
It surveys;
Specifically, denoising: after completing the input of voice, just pre-processing to noise, uses automatic segmentation program
Cut off non-artificial noise extra in voice, such as too long mute section, current noise;Preemphasis: the purpose of preemphasis is to improve
High frequency section makes the frequency spectrum of signal become flat, in order to which spectrum analysis or channel parameters are analyzed.Preemphasis can be believed in voice
Number digitlization when carried out before anti-aliasing filter, but be usually after digitization of speech signals;Short-time analysis: voice letter
It number changes over time as a whole, is a non-stationary process, be unable to the Digital Signal Processing of use reason stationary signal
Technology analyzes it processing.But since different voices is to constitute certain shape of sound channel by the movement of the oral cavity muscle of people
And the response generated, this movement are slowly for speech frequency, so on the other hand, although voice is believed
Number there are time-varying characteristics, but its characteristic keeps relatively steady substantially in (it is generally acknowledged that in 10-30ms) short time range
Fixed, i.e., voice has short-term stationarity.So the analysis and processing of any voice signal must be set up on the basis of " in short-term ",
Carry out " short-time analysis;Framing: in order to carry out short-time analysis, being divided into one section one section for voice signal, wherein each section is known as one
Frame generally takes 10-30ms, in order to make to seamlessly transit between frame and frame, keeps continuity, can be with using the method for overlapping segmentation
It is thought of as a pointer p from the beginning, one paragraph header of interception is p, and length is the segment of frame length, and then pointer p is mobile, mobile step-length
It is known as frame shifting, every movement once all intercepts one section, thus obtains many frames;Adding window: adding window is exactly with certain window function w
(n) w (n) multiplies s (n) s (n), to form adding window voice signal sw (n)=s (n) * w (n) sw (n)=s (n) * w (n), commonly use
Window function be rectangular window and Hamming window, be in fact exactly not adding window with rectangular window, have a N in window function, refer to length of window
(sample point number), a corresponding frame, usually under 8kHz sampling frequency, N compromise is selected as 80-160 (i.e. when 10-20ms continues
Between);End-point detection: accurately finding out the starting point and end point of voice signal from one section of voice signal, its purpose be for
The effective voice signal and useless noise signal is set to separate.
In the present embodiment, feature extraction is carried out to the audio data, generates speech recognition features and emotion recognition is special
Sign, specifically: mel-frequency cepstrum coefficient (MFCC) is carried out to the audio data and is extracted, the Meier of the audio data is generated
Frequency cepstral coefficient (MFCC), and it regard the mel-frequency cepstrum coefficient (MFCC) of the audio data as speech recognition features;
Affective feature extraction is carried out to the audio data by GeMAPS feature set, generates the GeMAPS feature of the audio data
Collection, and using the GeMAPS feature set of the audio data as emotion recognition feature.
The GeMAPS feature set includes 62 features, and 62 features are HSF feature, and 62 features
It is to be obtained by 18 LLD feature calculations.
It is extracted it should be noted that carrying out mel-frequency cepstrum coefficient (MFCC) to the audio data, generates the sound
The mel-frequency cepstrum coefficient (MFCC) of frequency evidence, and it regard the mel-frequency cepstrum coefficient (MFCC) of the audio data as language
Sound identification feature, specifically: it is FFT to the audio data framing adding window, and to each frame, obtains linear sonograph;To line
Log is taken after property sonograph application Meier filter, obtains log Meier sonograph;To the log Meier sonograph be DCT and from
Cosine transform is dissipated, retains the 2nd to the 13rd coefficient in result, and using this obtained 12 coefficients as the audio data
Mel-frequency cepstrum coefficient (MFCC), while by the mel-frequency cepstrum coefficient (MFCC) of the audio data as voice know
Other feature.
In the methods of the invention, the link of characteristics extraction carries out: speech feature extraction and affective characteristics in two sub-sections
It extracts.
Speech feature extraction service, using mel-frequency cepstrum coefficient (MFCC): using mel-frequency cepstrum coefficient
(MFCC) key characterization parameter that extraction is able to reflect phonic signal character is formed by characteristic sequence;Extract mel-frequency cepstrum
The step of coefficient (MFCC): framing adding window first obtains (single frames) linear sonograph, to linear after being then FFT to each frame
Then log is taken to obtain log Meier sonograph after sonograph application Meier filter, then to log filtered energy (log Meier sound
Spectrum) it is DCT, then discrete cosine transform retains second to the 13rd coefficient, this obtained 12 coefficients are exactly MFCC.
Affective feature extraction service, using GeMAPS feature set: the GeMAPS feature set 62 features in total, this
62 are all HSF features, are obtained by 18 LLD feature calculations.Wherein, 18 LLD features include 6 frequency-dependent features, and 3
A energy/amplitude correlated characteristic, 9 spectrum signatures.
The concept of fundamental tone F0: fundamental tone is usually denoted as F0 (F0 generally also refers to fundamental frequency), and general sound is all by pronouncing
What the different vibration of a series of frequencies of body sending, amplitude was combined.The vibration for having a frequency minimum in these vibrations
Dynamic, the sound issued by it is exactly fundamental tone, remaining is overtone.
6 frequency-dependent features include: Pitch (log F0 is calculated on frequency halftone scale, since 27.5Hz);
(deviation in single successive pitch periods, what deviation was measured is the difference of observational variable and particular value to Jitter, if do not indicated
Particular value it is usually used be variable mean value);The centre frequency of first three formant, the bandwidth of first formant.
3 energy/amplitude features include: Shimmer (difference of amplitude peak between adjacent pitch period), Loudness
(estimation of the intensity of sound obtained in the frequency spectrum, can be calculated according to energy), HNR (Harmonics-to-noise) noise
Than.
9 spectrum signatures include: Alpha Ratio (energy of 50-1000Hz and divided by 1-5kHz energy and),
The Hammarberg Index most strong energy peak of 2-5kHz (the most strong energy peak of 0-2kHz divided by), Spectral Slope 0-
500Hz and 500-1500Hz (does linear regression to two regions 0-500Hz and 500-1500Hz of linear power spectrum to obtain
Two slopes), Formant 1,2, (centre frequency of first three formant is divided by fundamental tone by and 3relative energy
Peak energy amounts), (energy of first fundamental tone harmonic wave H1 is divided by second fundamental tone harmonic wave by Harmonic difference H1-H2
Energy), (energy of first fundamental tone harmonic wave H1 is divided by third formant range by Harmonic difference H1-A3
Interior highest harmonic energy).
18 LLD are counted, calculating when is to be symmetric moving average to 3 frame voices.First
Arithmetic average and coefficient of variation (calculating standard deviation then to be standardized with arithmetic average) are calculated, obtains 36
A statistical nature.Then to 8 functions of loudness and pitch operation, 20 percentiles, 50 percentiles, 80 percentiles, 20 to 80
Range between percentile, the mean value and standard deviation of the slope of rise/fall voice signal.It is special thus to obtain 16 statistics
Sign.Function above is done to voiced regions (F0 of non-zero).To Alpha Ratio, Hammarberg
Index, Spectral Slope 0-500Hz and 500-1500Hz, which do sums, averagely obtains 4 statistical natures.In addition also
6 temporal characteristics, the number at the peak loudness per second, the average length and standard deviation of continuous voiced regions (F0 > 0),
The average length and standard deviation of unvoiced regions (F0=0), the number of voiced regions per second, 36+16+4+6
Obtain 62 features.
S3 carries out feature identification to the speech recognition features and the emotion recognition feature, generates voice content and feelings
Feel information.
In the present embodiment, step S3 specifically: by acoustic model to the characteristic parameters of the speech recognition features into
Row matching, matching generate the voice content of the voice;By preset SVM multi-classification algorithm to the emotion recognition feature into
Row classified calculating obtains the emotion information of the voice.
As the preferred embodiment of the present invention, the emotion information is k class, including it is glad, indignation, fear, be sad, is surprised
And it is neutral.
In the present embodiment, described that classification meter is carried out to the emotion recognition feature by preset SVM multi-classification algorithm
It calculates, specifically: by design k (k-1)/2 SVM, classified between any two classification sample using a SVM, and
Using who gets the most votes's classification as final classification.
Wherein, when carrying out classified calculating to the emotion recognition feature, using big data Spark memory computing platform,
Quickly to obtain calculated result.
It should be noted that in the methods of the invention, the link of identification includes two links: speech recognition and emotion are known
Not.
According to the speech recognition features mel-frequency cepstrum coefficient (MFCC) extracted, Classification of Speech identification is carried out.Language
Sound recognition principle: voice carries out the training of acoustic model using the characteristic parameter of training sound bank, is by voice to be identified
Characteristic parameter matched with acoustic model, identify the content of voice.
Wherein, in the present embodiment, speech recognition uses improved softmax multi-classification algorithm, original of classifying softmax more
Manage as follows: the output of multiple neurons is mapped in (0,1) section, can regard probability as to understand by it, thus more to carry out
Classification.
The specific assorting process of softmax is as follows:
Voice a is inputted when beginning, is calculated by softmax, is obtained the calculated result of voice content: listening to music, probability is
85%;Parking, probability 10%;It watches movie, probability 5%;Can thus identify the content of voice is: listening to music.
According to the emotion recognition feature GeMAPS feature set extracted, emotion recognition is carried out.Emotion in the present embodiment
Identification uses improved SVM multi-classification algorithm: one-to-one method (one-versus-one, abbreviation OVO SVMs or
pairwise)。
Wherein, the more principles of classification of SVM are as follows: a SVM, therefore the sample of k classification are designed between any two classes sample
Originally need to design k (k-1)/2 SVM.When classifying to a unknown sample, last who gets the most votes's classification is should
The classification of unknown sample.
Specifically, the specific assorting process of SVM is as follows:
In this embodiment of the present invention, there is 6 class emotions: happiness indignation, is feared, is sad, is surprised, is neutral, is denoted as respectively: A,
B, C, D, E, F.
(A, B), (A, C), (A, D), (A are constructed in training, E), (A, F), (B, C), (B, D), (B, E), (B, F), (C
D), then (C, E), (C, F), (D, E), (D, F), vector corresponding to (E, F) obtain 15 training results as training set,
When test, corresponding vector respectively tests 15 results, then takes ballot form, finally obtains one group
As a result.
Ballot is such that
Start: A=B=C=D=E=F=0;
(A, B)-classifier is if it is A win, then A=A+1;Otherwise, B=B+1;
(A, C)-classifier is if it is A win, then A=A+1;Otherwise, C=C+1;
...
(E, F)-classifier is if it is E win, then E=E+1;Otherwise, F=F+1;
The decision is the Max(A,B,C,D,E,F)。
It can thus identify the emotion information of voice.
Since the subclassification set in classification is relatively more, calculation amount can be bigger, so taking big data Spark memory meter
It calculates platform parallel to calculate, the advanced technology of memory computation model is used based on Spark Distributed Computing Platform, there is mass data
Computing capability, can quickly obtain calculated result in this way.
It should be noted that the affection data library of training and test uses CASIA Chinese affection data library.
S4 is inquired in default rule database according to the voice content and the emotion information, generation
It is highest with score as a result, and executing the result to carry out people-car interaction.
It is highest as a result, then to inquire matching for recommendation rules database by the content and emotion that make in advance
Intelligence Feedback is made, people-car interaction is completed, wherein the recommendation rules database of content and emotion, storage are and trip height
Relevant content.
Refer to 2, Fig. 2 is the people-car interaction method provided by the invention identified based on microserver and intelligent sound one
The architecture diagram for the people-car interaction system that a embodiment provides, specifically, the people-car interaction system can by micro services platform
Execute the people-car interaction method provided by the invention identified based on microserver and intelligent sound, and by by different steps
Service that is single small-sized but having business function one by one is developed into, wherein each service has the processing and light weight communication of oneself
Mechanism can be deployed on single or multiple servers.
Continuing with referring to fig. 2, it can be seen that each service unit in people-car interaction system is all independent, wherein language
The function of sound input service: voice data is received, voice data is sampled;
The function of data prediction service: background noise is filtered out;
The function of feature extraction service: the relevant feature of voice, including phonetic feature, affective characteristics are extracted;
The function of speech-recognition services: it identifies the content of voice, is speaker has said anything;
The function of emotion recognition service: by the feature extracted, the emotion information identification of speaker is carried out;
The function of Intelligence Feedback service: by the emotion information identified, Intelligence Feedback service is provided.
It please join Fig. 2 and Fig. 3, in order to better illustrate the present invention the working principle of method, the following are bases provided by the invention
In the working principle for the people-car interaction method that microserver and intelligent sound identify: firstly, user says people-car interaction system
Phonetic order: I wants to listen to music;Server is inputted by the voice in people-car interaction system, and data acquisition is carried out to this section of voice,
Generate audio data;Then the data prediction service in people-car interaction system pre-processes this section audio data: removing
Background noise;The correlated characteristic list of speech recognition is extracted by the speech feature extraction service in people-car interaction system again,
The correlated characteristic list of emotion recognition is extracted simultaneously;Then by the speech-recognition services in people-car interaction system according to voice
Feature list carries out speech recognition, identifies the content of voice, such as: listening to music;Emotion recognition service simultaneously is according to affective characteristics
List carries out emotion recognition, identifies the emotion information of speaker, such as: indignation;Pass through the Intelligence Feedback of people-car interaction system again
Voice content, the emotion information according to identifying processing are serviced, intelligent feedback result is made, such as: music player is opened, to use
Family plays a first light cheerful and light-hearted music, mood of releiving, and end user obtains feedback information, this interactive voice process terminates.
To sum up, the people-car interaction method provided in an embodiment of the present invention identified based on microserver and intelligent sound, is passed through
The voice data for obtaining user's input carries out data acquisition to voice data, generates audio data, and carries out to audio data pre-
Processing carries out feature extraction to audio data to remove the background noise in audio data, generate speech recognition features and
Then emotion recognition feature carries out feature identification to speech recognition features and emotion recognition feature, generate voice content and emotion
Information is inquired in default rule database finally according to voice content and emotion information, generates matching score highest
As a result, simultaneously implementing result is compared to traditional people-car interaction method embodiment of the present invention more to carry out people-car interaction
Intelligence, and also have the function of Sentiment orientation analysis.
Using embodiment provided by the invention, have the following beneficial effects:
1. the speech recognition accuracy based on machine learning is higher;
2. the speech recognition recognition speed based on machine learning is faster;
3. having the function of Sentiment orientation analysis, keep interaction more intelligent;
4. being based on Distributed Computing Platform, mass data can be handled in real time;
5. being based on distributed storage platform, robustness is high;
6. be applied to system, it is suitable for various trip scenes;
7. being applied to the micro services used when system, O&M is simple;
8. being applied to the function of having data mining when system;
The input of multilingual is supported when 9. being applied to system;
The output of multilingual is supported when 10. being applied to system.
The above is a preferred embodiment of the present invention, it is noted that for those skilled in the art
For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as
Protection scope of the present invention.
Claims (9)
1. a kind of people-car interaction method identified based on microserver and intelligent sound, which comprises the following steps:
The voice data for obtaining user's input carries out data acquisition to the voice data, generates audio data;
The audio data is pre-processed to remove the background noise in the audio data, while to the audio data
Feature extraction is carried out, speech recognition features and emotion recognition feature are generated;
Feature identification is carried out to the speech recognition features and the emotion recognition feature, generates voice content and emotion information;
It according to the voice content and the emotion information, is inquired in default rule database, generates matching score
It is highest as a result, and executing the result to carry out people-car interaction.
2. the people-car interaction method identified as described in claim 1 based on microserver and intelligent sound, which is characterized in that institute
Stating pretreatment includes: denoising, preemphasis, short-time analysis, framing, adding window and end-point detection.
3. the people-car interaction method identified as described in claim 1 based on microserver and intelligent sound, which is characterized in that institute
It states and feature extraction is carried out to the audio data, generate speech recognition features and emotion recognition feature, specifically:
It carries out mel-frequency cepstrum coefficient (MFCC) to the audio data to extract, the mel-frequency for generating the audio data falls
Spectral coefficient (MFCC), and it regard the mel-frequency cepstrum coefficient (MFCC) of the audio data as speech recognition features;
Affective feature extraction is carried out to the audio data by GeMAPS feature set, the GeMAPS for generating the audio data is special
Collection, and using the GeMAPS feature set of the audio data as emotion recognition feature.
4. the people-car interaction method identified as claimed in claim 3 based on microserver and intelligent sound, which is characterized in that institute
Stating GeMAPS feature set includes 62 features, and 62 features are HSF feature, and 62 features are by 18
LLD feature calculation obtains.
5. the people-car interaction method identified as claimed in claim 3 based on microserver and intelligent sound, which is characterized in that right
The audio data carries out mel-frequency cepstrum coefficient (MFCC) and extracts, and generates the mel-frequency cepstrum coefficient of the audio data
(MFCC), speech recognition features and by the mel-frequency cepstrum coefficient (MFCC) of the audio data are used as, specifically:
It is FFT to the audio data framing adding window, and to each frame, obtains linear sonograph;
To log is taken after linear sonograph application Meier filter, log Meier sonograph is obtained;
DCT and discrete cosine transform are done to the log Meier sonograph, retain the 2nd to the 13rd coefficient in result, and will
Mel-frequency cepstrum coefficient (MFCC) of this obtained 12 coefficients as the audio data, while by the audio data
Mel-frequency cepstrum coefficient (MFCC) is used as speech recognition features.
6. the people-car interaction method identified as claimed in claim 5 based on microserver and intelligent sound, which is characterized in that institute
It states and feature identification is carried out to the speech recognition features and the emotion recognition feature, generate voice content and emotion information, tool
Body are as follows:
It is matched by characteristic parameter of the acoustic model to the speech recognition features, matching generates in the voice of the voice
Hold;
Classified calculating is carried out to the emotion recognition feature by preset SVM multi-classification algorithm, obtains the emotion of the voice
Information.
7. the people-car interaction method identified as claimed in claim 6 based on microserver and intelligent sound, which is characterized in that institute
Stating emotion information is k class, including it is glad, indignation, fear, be sad, is surprised and neutral.
8. the people-car interaction method identified as claimed in claim 7 based on microserver and intelligent sound, which is characterized in that institute
It states and classified calculating is carried out to the emotion recognition feature by preset SVM multi-classification algorithm, specifically:
By designing k (k-1)/2 SVM, classified between any two classification sample using a SVM, and will gained vote
Most classifications is as final classification.
9. the people-car interaction method identified as claimed in claim 8 based on microserver and intelligent sound, which is characterized in that In
When carrying out classified calculating to the emotion recognition feature, using big data Spark memory computing platform, quickly to obtain calculating knot
Fruit.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910758860.4A CN110534091A (en) | 2019-08-16 | 2019-08-16 | A kind of people-car interaction method identified based on microserver and intelligent sound |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910758860.4A CN110534091A (en) | 2019-08-16 | 2019-08-16 | A kind of people-car interaction method identified based on microserver and intelligent sound |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110534091A true CN110534091A (en) | 2019-12-03 |
Family
ID=68663448
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910758860.4A Pending CN110534091A (en) | 2019-08-16 | 2019-08-16 | A kind of people-car interaction method identified based on microserver and intelligent sound |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110534091A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111128178A (en) * | 2019-12-31 | 2020-05-08 | 上海赫千电子科技有限公司 | Voice recognition method based on facial expression analysis |
CN111785294A (en) * | 2020-06-12 | 2020-10-16 | Oppo广东移动通信有限公司 | Audio detection method and device, terminal and storage medium |
CN111968622A (en) * | 2020-08-18 | 2020-11-20 | 广州市优普科技有限公司 | Attention mechanism-based voice recognition method, system and device |
CN114141239A (en) * | 2021-11-29 | 2022-03-04 | 江南大学 | Voice short instruction identification method and system based on lightweight deep learning |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103544963A (en) * | 2013-11-07 | 2014-01-29 | 东南大学 | Voice emotion recognition method based on core semi-supervised discrimination and analysis |
CN106601231A (en) * | 2016-12-22 | 2017-04-26 | 深圳市元征科技股份有限公司 | Vehicle control method and apparatus |
CN106803423A (en) * | 2016-12-27 | 2017-06-06 | 智车优行科技(北京)有限公司 | Man-machine interaction sound control method, device and vehicle based on user emotion state |
CN106874016A (en) * | 2017-03-07 | 2017-06-20 | 长江大学 | A kind of new customizable big data platform architecture method |
CN109712681A (en) * | 2018-12-21 | 2019-05-03 | 河海大学常州校区 | A kind of vehicle-mounted analysis system based on sign big data |
-
2019
- 2019-08-16 CN CN201910758860.4A patent/CN110534091A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103544963A (en) * | 2013-11-07 | 2014-01-29 | 东南大学 | Voice emotion recognition method based on core semi-supervised discrimination and analysis |
CN106601231A (en) * | 2016-12-22 | 2017-04-26 | 深圳市元征科技股份有限公司 | Vehicle control method and apparatus |
CN106803423A (en) * | 2016-12-27 | 2017-06-06 | 智车优行科技(北京)有限公司 | Man-machine interaction sound control method, device and vehicle based on user emotion state |
CN106874016A (en) * | 2017-03-07 | 2017-06-20 | 长江大学 | A kind of new customizable big data platform architecture method |
CN109712681A (en) * | 2018-12-21 | 2019-05-03 | 河海大学常州校区 | A kind of vehicle-mounted analysis system based on sign big data |
Non-Patent Citations (1)
Title |
---|
FLORIAN EYBEN等: ""The geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing"", 《IEEE TRANSACTIONS ON AFFECTIVE COMPUTING》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111128178A (en) * | 2019-12-31 | 2020-05-08 | 上海赫千电子科技有限公司 | Voice recognition method based on facial expression analysis |
CN111785294A (en) * | 2020-06-12 | 2020-10-16 | Oppo广东移动通信有限公司 | Audio detection method and device, terminal and storage medium |
CN111785294B (en) * | 2020-06-12 | 2024-04-02 | Oppo广东移动通信有限公司 | Audio detection method and device, terminal and storage medium |
CN111968622A (en) * | 2020-08-18 | 2020-11-20 | 广州市优普科技有限公司 | Attention mechanism-based voice recognition method, system and device |
CN114141239A (en) * | 2021-11-29 | 2022-03-04 | 江南大学 | Voice short instruction identification method and system based on lightweight deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Bhavan et al. | Bagged support vector machines for emotion recognition from speech | |
CN101599271B (en) | Recognition method of digital music emotion | |
CN110534091A (en) | A kind of people-car interaction method identified based on microserver and intelligent sound | |
Shaw et al. | Emotion recognition and classification in speech using artificial neural networks | |
EP2418643A1 (en) | Computer-implemented method and system for analysing digital speech data | |
Ismail et al. | Mfcc-vq approach for qalqalahtajweed rule checking | |
Yu et al. | Sparse cepstral codes and power scale for instrument identification | |
KR20200088263A (en) | Method and system of text to multiple speech | |
Kandali et al. | Vocal emotion recognition in five native languages of Assam using new wavelet features | |
Rahman et al. | Dynamic time warping assisted svm classifier for bangla speech recognition | |
CN108369803A (en) | The method for being used to form the pumping signal of the parameter speech synthesis system based on glottal model | |
Revathy et al. | Performance comparison of speaker and emotion recognition | |
Khanna et al. | Application of vector quantization in emotion recognition from human speech | |
Kadyan et al. | Prosody features based low resource Punjabi children ASR and T-NT classifier using data augmentation | |
Pratama et al. | Human vocal type classification using MFCC and convolutional neural network | |
Gaudani et al. | Comparative study of robust feature extraction techniques for ASR for limited resource Hindi language | |
KR100766170B1 (en) | Music summarization apparatus and method using multi-level vector quantization | |
Bouchakour et al. | Noise-robust speech recognition in mobile network based on convolution neural networks | |
Płonkowski | Using bands of frequencies for vowel recognition for Polish language | |
Dharini et al. | CD-HMM Modeling for raga identification | |
Camarena-Ibarrola et al. | Speaker identification using entropygrams and convolutional neural networks | |
Bansod et al. | Speaker Recognition using Marathi (Varhadi) Language | |
Boonthong et al. | Fisher feature selection for emotion recognition | |
Fahmeeda et al. | Voice Based Gender Recognition Using Deep Learning | |
Hosain et al. | Deep-learning-based speech emotion recognition using synthetic bone-conducted speech |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191203 |
|
RJ01 | Rejection of invention patent application after publication |