CN110534091A - A kind of people-car interaction method identified based on microserver and intelligent sound - Google Patents

A kind of people-car interaction method identified based on microserver and intelligent sound Download PDF

Info

Publication number
CN110534091A
CN110534091A CN201910758860.4A CN201910758860A CN110534091A CN 110534091 A CN110534091 A CN 110534091A CN 201910758860 A CN201910758860 A CN 201910758860A CN 110534091 A CN110534091 A CN 110534091A
Authority
CN
China
Prior art keywords
audio data
feature
people
voice
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910758860.4A
Other languages
Chinese (zh)
Inventor
邱华礼
孙一帅
陈晶
曹刚
梁维新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Wilson Information Technology Co Ltd
Original Assignee
Guangzhou Wilson Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Wilson Information Technology Co Ltd filed Critical Guangzhou Wilson Information Technology Co Ltd
Priority to CN201910758860.4A priority Critical patent/CN110534091A/en
Publication of CN110534091A publication Critical patent/CN110534091A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of people-car interaction methods identified based on microserver and intelligent sound, by the voice data for obtaining user's input, data acquisition is carried out to voice data, generate audio data, and audio data is pre-processed to remove the background noise in audio data, feature extraction is carried out to audio data simultaneously, generate speech recognition features and emotion recognition feature, then feature identification is carried out to speech recognition features and emotion recognition feature, generate voice content and emotion information, finally according to voice content and emotion information, it is inquired in default rule database, generate the matching highest result of score, and implementing result is to carry out people-car interaction, it is more intelligent to be compared to traditional people-car interaction method embodiment of the present invention, and also has the function of Sentiment orientation analysis.

Description

A kind of people-car interaction method identified based on microserver and intelligent sound
Technical field
The present invention relates to field of artificial intelligence, more particularly to one kind to know others based on microserver and intelligent sound Vehicle exchange method.
Background technique
Existing people-car interaction method, is based primarily upon the semantic analysis of speech recognition, and the voice issued for driver refers to It enables, carries out instruction analysis, then make corresponding feedback action, used identification facility is based on traditional keyword models Matched speech recognition system, storage tool are traditional Relational DataBase, are from the background the monomer service framework of traditional MVC.
However existing people-car interaction method, due to ignoring the Sentiment orientation analysis of driver at that time, so showing not Enough intelligences, not humanized enough, the single machine limited storage space of traditional Relational DataBase, it is desirable to carry out mass data storage too Difficulty is crossed, and recognition accuracy is not high, recognition speed is slow, dictionary maintenance is troublesome, the background system used is huge, maintenance is numb Tired, poor expandability.
Summary of the invention
The purpose of the embodiment of the present invention is that a kind of people-car interaction method identified based on microserver and intelligent sound is provided, It is compared to that traditional people-car interaction method embodiment of the present invention is more intelligent, and the function also with Sentiment orientation analysis Energy.
To achieve the above object, the embodiment of the invention provides one kind knows others vehicle based on microserver and intelligent sound Exchange method, comprising the following steps:
The voice data for obtaining user's input carries out data acquisition to the voice data, generates audio data;
The audio data is pre-processed to remove the background noise in the audio data, while to the audio Data carry out feature extraction, generate speech recognition features and emotion recognition feature;
Special identification is carried out to the speech recognition features and the emotion recognition feature, generates voice content and emotion letter Breath;
It according to the voice content and the emotion information, is inquired in default rule database, generates matching Score is highest as a result, and executing the result to carry out people-car interaction.
Further, the pretreatment includes: denoising, preemphasis, short-time analysis, framing, adding window and end-point detection.
Further, described that feature extraction is carried out to the audio data, it generates speech recognition features and emotion recognition is special Sign, specifically:
Mel-frequency cepstrum coefficient (MFCC) is carried out to the audio data to extract, and generates the Meier frequency of the audio data Rate cepstrum coefficient (MFCC), and it regard the mel-frequency cepstrum coefficient (MFCC) of the audio data as speech recognition features;
Affective feature extraction is carried out to the audio data by GeMAPS feature set, generates the audio data GeMAPS feature set, and using the GeMAPS feature set of the audio data as emotion recognition feature.
Further, the GeMAPS feature set includes 62 features, and 62 features are HSF feature, and institute Stating 62 features is obtained by 18 LLD feature calculations.
Further, mel-frequency cepstrum coefficient (MFCC) is carried out to the audio data to extract, generate the audio number According to mel-frequency cepstrum coefficient (MFCC), and by the mel-frequency cepstrum coefficient (MFCC) of the audio data as voice know Other feature, specifically:
It is FFT to the audio data framing adding window, and to each frame, obtains linear sonograph;
To log is taken after linear sonograph application Meier filter, log Meier sonograph is obtained;
DCT and discrete cosine transform are done to the log Meier sonograph, retain the 2nd to the 13rd coefficient in result, And using this obtained 12 coefficients as the mel-frequency cepstrum coefficient (MFCC) of the audio data, while by the audio number According to mel-frequency cepstrum coefficient (MFCC) be used as speech recognition features.
Further, described that special identification is carried out to the speech recognition features and the emotion recognition feature, generate language Sound content and emotion information, specifically:
It is matched by characteristic parameter of the acoustic model to the speech recognition features, matching generates the language of the voice Sound content;
Classified calculating is carried out to the emotion recognition feature by preset SVM multi-classification algorithm, obtains the voice Emotion information.
Further, the emotion information is k class, including it is glad, indignation, fear, be sad, is surprised and neutral.
Further, described that classified calculating, tool are carried out to the emotion recognition feature by preset SVM multi-classification algorithm Body are as follows:
By designing k (k-1)/2 SVM, classified between any two classification sample using a SVM, and will Who gets the most votes's classification is as final classification.
Further, it when carrying out classified calculating to the emotion recognition feature, is calculated using big data Spark memory flat Platform, quickly to obtain calculated result.
Compared with prior art, it has the following beneficial effects:
The people-car interaction method provided in an embodiment of the present invention identified based on microserver and intelligent sound, is used by obtaining Family input voice data, to voice data carry out data acquisition, generate audio data, and to audio data pre-processed with The background noise in audio data is removed, while feature extraction is carried out to audio data, speech recognition features is generated and emotion is known Then other feature carries out feature identification to speech recognition features and emotion recognition feature, generate voice content and emotion information, most Afterwards according to voice content and emotion information, inquired in default rule database, generate matching score it is highest as a result, And implementing result to be to carry out people-car interaction, it is more intelligent to be compared to traditional people-car interaction method embodiment of the present invention, and And also has the function of Sentiment orientation analysis.
Detailed description of the invention
Fig. 1 is one embodiment of the people-car interaction method provided by the invention identified based on microserver and intelligent sound Flow diagram;
Fig. 2 is one embodiment of the people-car interaction method provided by the invention identified based on microserver and intelligent sound The architecture diagram of the people-car interaction system of offer;
Fig. 3 is one embodiment of the people-car interaction method provided by the invention identified based on microserver and intelligent sound Working principle diagram.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
Referring to one that Fig. 1, Fig. 1 are the people-car interaction methods provided by the invention identified based on microserver and intelligent sound The structural schematic diagram of a embodiment;The embodiment of the present invention provides a kind of people-car interaction identified based on microserver and intelligent sound Method, including step S1-S4;
S1 obtains the voice data of user's input, carries out data acquisition to the voice data, generates audio data.
The mass data of PB rank can be stored based on HDFS distributed memory system, have High Availabitity, high fault tolerance, can The advantages of scalability, in the present embodiment, all these primary voice datas, are stored in HDFS distributed file system.
S2 pre-processes the audio data to remove the background noise in the audio data, while to described Audio data carries out feature extraction, generates speech recognition features and emotion recognition feature.
In the present embodiment, the pretreatment includes: denoising, preemphasis, short-time analysis, framing, adding window and endpoint inspection It surveys;
Specifically, denoising: after completing the input of voice, just pre-processing to noise, uses automatic segmentation program Cut off non-artificial noise extra in voice, such as too long mute section, current noise;Preemphasis: the purpose of preemphasis is to improve High frequency section makes the frequency spectrum of signal become flat, in order to which spectrum analysis or channel parameters are analyzed.Preemphasis can be believed in voice Number digitlization when carried out before anti-aliasing filter, but be usually after digitization of speech signals;Short-time analysis: voice letter It number changes over time as a whole, is a non-stationary process, be unable to the Digital Signal Processing of use reason stationary signal Technology analyzes it processing.But since different voices is to constitute certain shape of sound channel by the movement of the oral cavity muscle of people And the response generated, this movement are slowly for speech frequency, so on the other hand, although voice is believed Number there are time-varying characteristics, but its characteristic keeps relatively steady substantially in (it is generally acknowledged that in 10-30ms) short time range Fixed, i.e., voice has short-term stationarity.So the analysis and processing of any voice signal must be set up on the basis of " in short-term ", Carry out " short-time analysis;Framing: in order to carry out short-time analysis, being divided into one section one section for voice signal, wherein each section is known as one Frame generally takes 10-30ms, in order to make to seamlessly transit between frame and frame, keeps continuity, can be with using the method for overlapping segmentation It is thought of as a pointer p from the beginning, one paragraph header of interception is p, and length is the segment of frame length, and then pointer p is mobile, mobile step-length It is known as frame shifting, every movement once all intercepts one section, thus obtains many frames;Adding window: adding window is exactly with certain window function w (n) w (n) multiplies s (n) s (n), to form adding window voice signal sw (n)=s (n) * w (n) sw (n)=s (n) * w (n), commonly use Window function be rectangular window and Hamming window, be in fact exactly not adding window with rectangular window, have a N in window function, refer to length of window (sample point number), a corresponding frame, usually under 8kHz sampling frequency, N compromise is selected as 80-160 (i.e. when 10-20ms continues Between);End-point detection: accurately finding out the starting point and end point of voice signal from one section of voice signal, its purpose be for The effective voice signal and useless noise signal is set to separate.
In the present embodiment, feature extraction is carried out to the audio data, generates speech recognition features and emotion recognition is special Sign, specifically: mel-frequency cepstrum coefficient (MFCC) is carried out to the audio data and is extracted, the Meier of the audio data is generated Frequency cepstral coefficient (MFCC), and it regard the mel-frequency cepstrum coefficient (MFCC) of the audio data as speech recognition features; Affective feature extraction is carried out to the audio data by GeMAPS feature set, generates the GeMAPS feature of the audio data Collection, and using the GeMAPS feature set of the audio data as emotion recognition feature.
The GeMAPS feature set includes 62 features, and 62 features are HSF feature, and 62 features It is to be obtained by 18 LLD feature calculations.
It is extracted it should be noted that carrying out mel-frequency cepstrum coefficient (MFCC) to the audio data, generates the sound The mel-frequency cepstrum coefficient (MFCC) of frequency evidence, and it regard the mel-frequency cepstrum coefficient (MFCC) of the audio data as language Sound identification feature, specifically: it is FFT to the audio data framing adding window, and to each frame, obtains linear sonograph;To line Log is taken after property sonograph application Meier filter, obtains log Meier sonograph;To the log Meier sonograph be DCT and from Cosine transform is dissipated, retains the 2nd to the 13rd coefficient in result, and using this obtained 12 coefficients as the audio data Mel-frequency cepstrum coefficient (MFCC), while by the mel-frequency cepstrum coefficient (MFCC) of the audio data as voice know Other feature.
In the methods of the invention, the link of characteristics extraction carries out: speech feature extraction and affective characteristics in two sub-sections It extracts.
Speech feature extraction service, using mel-frequency cepstrum coefficient (MFCC): using mel-frequency cepstrum coefficient (MFCC) key characterization parameter that extraction is able to reflect phonic signal character is formed by characteristic sequence;Extract mel-frequency cepstrum The step of coefficient (MFCC): framing adding window first obtains (single frames) linear sonograph, to linear after being then FFT to each frame Then log is taken to obtain log Meier sonograph after sonograph application Meier filter, then to log filtered energy (log Meier sound Spectrum) it is DCT, then discrete cosine transform retains second to the 13rd coefficient, this obtained 12 coefficients are exactly MFCC.
Affective feature extraction service, using GeMAPS feature set: the GeMAPS feature set 62 features in total, this 62 are all HSF features, are obtained by 18 LLD feature calculations.Wherein, 18 LLD features include 6 frequency-dependent features, and 3 A energy/amplitude correlated characteristic, 9 spectrum signatures.
The concept of fundamental tone F0: fundamental tone is usually denoted as F0 (F0 generally also refers to fundamental frequency), and general sound is all by pronouncing What the different vibration of a series of frequencies of body sending, amplitude was combined.The vibration for having a frequency minimum in these vibrations Dynamic, the sound issued by it is exactly fundamental tone, remaining is overtone.
6 frequency-dependent features include: Pitch (log F0 is calculated on frequency halftone scale, since 27.5Hz); (deviation in single successive pitch periods, what deviation was measured is the difference of observational variable and particular value to Jitter, if do not indicated Particular value it is usually used be variable mean value);The centre frequency of first three formant, the bandwidth of first formant.
3 energy/amplitude features include: Shimmer (difference of amplitude peak between adjacent pitch period), Loudness (estimation of the intensity of sound obtained in the frequency spectrum, can be calculated according to energy), HNR (Harmonics-to-noise) noise Than.
9 spectrum signatures include: Alpha Ratio (energy of 50-1000Hz and divided by 1-5kHz energy and), The Hammarberg Index most strong energy peak of 2-5kHz (the most strong energy peak of 0-2kHz divided by), Spectral Slope 0- 500Hz and 500-1500Hz (does linear regression to two regions 0-500Hz and 500-1500Hz of linear power spectrum to obtain Two slopes), Formant 1,2, (centre frequency of first three formant is divided by fundamental tone by and 3relative energy Peak energy amounts), (energy of first fundamental tone harmonic wave H1 is divided by second fundamental tone harmonic wave by Harmonic difference H1-H2 Energy), (energy of first fundamental tone harmonic wave H1 is divided by third formant range by Harmonic difference H1-A3 Interior highest harmonic energy).
18 LLD are counted, calculating when is to be symmetric moving average to 3 frame voices.First Arithmetic average and coefficient of variation (calculating standard deviation then to be standardized with arithmetic average) are calculated, obtains 36 A statistical nature.Then to 8 functions of loudness and pitch operation, 20 percentiles, 50 percentiles, 80 percentiles, 20 to 80 Range between percentile, the mean value and standard deviation of the slope of rise/fall voice signal.It is special thus to obtain 16 statistics Sign.Function above is done to voiced regions (F0 of non-zero).To Alpha Ratio, Hammarberg Index, Spectral Slope 0-500Hz and 500-1500Hz, which do sums, averagely obtains 4 statistical natures.In addition also 6 temporal characteristics, the number at the peak loudness per second, the average length and standard deviation of continuous voiced regions (F0 > 0), The average length and standard deviation of unvoiced regions (F0=0), the number of voiced regions per second, 36+16+4+6 Obtain 62 features.
S3 carries out feature identification to the speech recognition features and the emotion recognition feature, generates voice content and feelings Feel information.
In the present embodiment, step S3 specifically: by acoustic model to the characteristic parameters of the speech recognition features into Row matching, matching generate the voice content of the voice;By preset SVM multi-classification algorithm to the emotion recognition feature into Row classified calculating obtains the emotion information of the voice.
As the preferred embodiment of the present invention, the emotion information is k class, including it is glad, indignation, fear, be sad, is surprised And it is neutral.
In the present embodiment, described that classification meter is carried out to the emotion recognition feature by preset SVM multi-classification algorithm It calculates, specifically: by design k (k-1)/2 SVM, classified between any two classification sample using a SVM, and Using who gets the most votes's classification as final classification.
Wherein, when carrying out classified calculating to the emotion recognition feature, using big data Spark memory computing platform, Quickly to obtain calculated result.
It should be noted that in the methods of the invention, the link of identification includes two links: speech recognition and emotion are known Not.
According to the speech recognition features mel-frequency cepstrum coefficient (MFCC) extracted, Classification of Speech identification is carried out.Language Sound recognition principle: voice carries out the training of acoustic model using the characteristic parameter of training sound bank, is by voice to be identified Characteristic parameter matched with acoustic model, identify the content of voice.
Wherein, in the present embodiment, speech recognition uses improved softmax multi-classification algorithm, original of classifying softmax more Manage as follows: the output of multiple neurons is mapped in (0,1) section, can regard probability as to understand by it, thus more to carry out Classification.
The specific assorting process of softmax is as follows:
Voice a is inputted when beginning, is calculated by softmax, is obtained the calculated result of voice content: listening to music, probability is 85%;Parking, probability 10%;It watches movie, probability 5%;Can thus identify the content of voice is: listening to music.
According to the emotion recognition feature GeMAPS feature set extracted, emotion recognition is carried out.Emotion in the present embodiment Identification uses improved SVM multi-classification algorithm: one-to-one method (one-versus-one, abbreviation OVO SVMs or pairwise)。
Wherein, the more principles of classification of SVM are as follows: a SVM, therefore the sample of k classification are designed between any two classes sample Originally need to design k (k-1)/2 SVM.When classifying to a unknown sample, last who gets the most votes's classification is should The classification of unknown sample.
Specifically, the specific assorting process of SVM is as follows:
In this embodiment of the present invention, there is 6 class emotions: happiness indignation, is feared, is sad, is surprised, is neutral, is denoted as respectively: A, B, C, D, E, F.
(A, B), (A, C), (A, D), (A are constructed in training, E), (A, F), (B, C), (B, D), (B, E), (B, F), (C D), then (C, E), (C, F), (D, E), (D, F), vector corresponding to (E, F) obtain 15 training results as training set, When test, corresponding vector respectively tests 15 results, then takes ballot form, finally obtains one group As a result.
Ballot is such that
Start: A=B=C=D=E=F=0;
(A, B)-classifier is if it is A win, then A=A+1;Otherwise, B=B+1;
(A, C)-classifier is if it is A win, then A=A+1;Otherwise, C=C+1;
...
(E, F)-classifier is if it is E win, then E=E+1;Otherwise, F=F+1;
The decision is the Max(A,B,C,D,E,F)。
It can thus identify the emotion information of voice.
Since the subclassification set in classification is relatively more, calculation amount can be bigger, so taking big data Spark memory meter It calculates platform parallel to calculate, the advanced technology of memory computation model is used based on Spark Distributed Computing Platform, there is mass data Computing capability, can quickly obtain calculated result in this way.
It should be noted that the affection data library of training and test uses CASIA Chinese affection data library.
S4 is inquired in default rule database according to the voice content and the emotion information, generation It is highest with score as a result, and executing the result to carry out people-car interaction.
It is highest as a result, then to inquire matching for recommendation rules database by the content and emotion that make in advance Intelligence Feedback is made, people-car interaction is completed, wherein the recommendation rules database of content and emotion, storage are and trip height Relevant content.
Refer to 2, Fig. 2 is the people-car interaction method provided by the invention identified based on microserver and intelligent sound one The architecture diagram for the people-car interaction system that a embodiment provides, specifically, the people-car interaction system can by micro services platform Execute the people-car interaction method provided by the invention identified based on microserver and intelligent sound, and by by different steps Service that is single small-sized but having business function one by one is developed into, wherein each service has the processing and light weight communication of oneself Mechanism can be deployed on single or multiple servers.
Continuing with referring to fig. 2, it can be seen that each service unit in people-car interaction system is all independent, wherein language The function of sound input service: voice data is received, voice data is sampled;
The function of data prediction service: background noise is filtered out;
The function of feature extraction service: the relevant feature of voice, including phonetic feature, affective characteristics are extracted;
The function of speech-recognition services: it identifies the content of voice, is speaker has said anything;
The function of emotion recognition service: by the feature extracted, the emotion information identification of speaker is carried out;
The function of Intelligence Feedback service: by the emotion information identified, Intelligence Feedback service is provided.
It please join Fig. 2 and Fig. 3, in order to better illustrate the present invention the working principle of method, the following are bases provided by the invention In the working principle for the people-car interaction method that microserver and intelligent sound identify: firstly, user says people-car interaction system Phonetic order: I wants to listen to music;Server is inputted by the voice in people-car interaction system, and data acquisition is carried out to this section of voice, Generate audio data;Then the data prediction service in people-car interaction system pre-processes this section audio data: removing Background noise;The correlated characteristic list of speech recognition is extracted by the speech feature extraction service in people-car interaction system again, The correlated characteristic list of emotion recognition is extracted simultaneously;Then by the speech-recognition services in people-car interaction system according to voice Feature list carries out speech recognition, identifies the content of voice, such as: listening to music;Emotion recognition service simultaneously is according to affective characteristics List carries out emotion recognition, identifies the emotion information of speaker, such as: indignation;Pass through the Intelligence Feedback of people-car interaction system again Voice content, the emotion information according to identifying processing are serviced, intelligent feedback result is made, such as: music player is opened, to use Family plays a first light cheerful and light-hearted music, mood of releiving, and end user obtains feedback information, this interactive voice process terminates.
To sum up, the people-car interaction method provided in an embodiment of the present invention identified based on microserver and intelligent sound, is passed through The voice data for obtaining user's input carries out data acquisition to voice data, generates audio data, and carries out to audio data pre- Processing carries out feature extraction to audio data to remove the background noise in audio data, generate speech recognition features and Then emotion recognition feature carries out feature identification to speech recognition features and emotion recognition feature, generate voice content and emotion Information is inquired in default rule database finally according to voice content and emotion information, generates matching score highest As a result, simultaneously implementing result is compared to traditional people-car interaction method embodiment of the present invention more to carry out people-car interaction Intelligence, and also have the function of Sentiment orientation analysis.
Using embodiment provided by the invention, have the following beneficial effects:
1. the speech recognition accuracy based on machine learning is higher;
2. the speech recognition recognition speed based on machine learning is faster;
3. having the function of Sentiment orientation analysis, keep interaction more intelligent;
4. being based on Distributed Computing Platform, mass data can be handled in real time;
5. being based on distributed storage platform, robustness is high;
6. be applied to system, it is suitable for various trip scenes;
7. being applied to the micro services used when system, O&M is simple;
8. being applied to the function of having data mining when system;
The input of multilingual is supported when 9. being applied to system;
The output of multilingual is supported when 10. being applied to system.
The above is a preferred embodiment of the present invention, it is noted that for those skilled in the art For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as Protection scope of the present invention.

Claims (9)

1. a kind of people-car interaction method identified based on microserver and intelligent sound, which comprises the following steps:
The voice data for obtaining user's input carries out data acquisition to the voice data, generates audio data;
The audio data is pre-processed to remove the background noise in the audio data, while to the audio data Feature extraction is carried out, speech recognition features and emotion recognition feature are generated;
Feature identification is carried out to the speech recognition features and the emotion recognition feature, generates voice content and emotion information;
It according to the voice content and the emotion information, is inquired in default rule database, generates matching score It is highest as a result, and executing the result to carry out people-car interaction.
2. the people-car interaction method identified as described in claim 1 based on microserver and intelligent sound, which is characterized in that institute Stating pretreatment includes: denoising, preemphasis, short-time analysis, framing, adding window and end-point detection.
3. the people-car interaction method identified as described in claim 1 based on microserver and intelligent sound, which is characterized in that institute It states and feature extraction is carried out to the audio data, generate speech recognition features and emotion recognition feature, specifically:
It carries out mel-frequency cepstrum coefficient (MFCC) to the audio data to extract, the mel-frequency for generating the audio data falls Spectral coefficient (MFCC), and it regard the mel-frequency cepstrum coefficient (MFCC) of the audio data as speech recognition features;
Affective feature extraction is carried out to the audio data by GeMAPS feature set, the GeMAPS for generating the audio data is special Collection, and using the GeMAPS feature set of the audio data as emotion recognition feature.
4. the people-car interaction method identified as claimed in claim 3 based on microserver and intelligent sound, which is characterized in that institute Stating GeMAPS feature set includes 62 features, and 62 features are HSF feature, and 62 features are by 18 LLD feature calculation obtains.
5. the people-car interaction method identified as claimed in claim 3 based on microserver and intelligent sound, which is characterized in that right The audio data carries out mel-frequency cepstrum coefficient (MFCC) and extracts, and generates the mel-frequency cepstrum coefficient of the audio data (MFCC), speech recognition features and by the mel-frequency cepstrum coefficient (MFCC) of the audio data are used as, specifically:
It is FFT to the audio data framing adding window, and to each frame, obtains linear sonograph;
To log is taken after linear sonograph application Meier filter, log Meier sonograph is obtained;
DCT and discrete cosine transform are done to the log Meier sonograph, retain the 2nd to the 13rd coefficient in result, and will Mel-frequency cepstrum coefficient (MFCC) of this obtained 12 coefficients as the audio data, while by the audio data Mel-frequency cepstrum coefficient (MFCC) is used as speech recognition features.
6. the people-car interaction method identified as claimed in claim 5 based on microserver and intelligent sound, which is characterized in that institute It states and feature identification is carried out to the speech recognition features and the emotion recognition feature, generate voice content and emotion information, tool Body are as follows:
It is matched by characteristic parameter of the acoustic model to the speech recognition features, matching generates in the voice of the voice Hold;
Classified calculating is carried out to the emotion recognition feature by preset SVM multi-classification algorithm, obtains the emotion of the voice Information.
7. the people-car interaction method identified as claimed in claim 6 based on microserver and intelligent sound, which is characterized in that institute Stating emotion information is k class, including it is glad, indignation, fear, be sad, is surprised and neutral.
8. the people-car interaction method identified as claimed in claim 7 based on microserver and intelligent sound, which is characterized in that institute It states and classified calculating is carried out to the emotion recognition feature by preset SVM multi-classification algorithm, specifically:
By designing k (k-1)/2 SVM, classified between any two classification sample using a SVM, and will gained vote Most classifications is as final classification.
9. the people-car interaction method identified as claimed in claim 8 based on microserver and intelligent sound, which is characterized in that In When carrying out classified calculating to the emotion recognition feature, using big data Spark memory computing platform, quickly to obtain calculating knot Fruit.
CN201910758860.4A 2019-08-16 2019-08-16 A kind of people-car interaction method identified based on microserver and intelligent sound Pending CN110534091A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910758860.4A CN110534091A (en) 2019-08-16 2019-08-16 A kind of people-car interaction method identified based on microserver and intelligent sound

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910758860.4A CN110534091A (en) 2019-08-16 2019-08-16 A kind of people-car interaction method identified based on microserver and intelligent sound

Publications (1)

Publication Number Publication Date
CN110534091A true CN110534091A (en) 2019-12-03

Family

ID=68663448

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910758860.4A Pending CN110534091A (en) 2019-08-16 2019-08-16 A kind of people-car interaction method identified based on microserver and intelligent sound

Country Status (1)

Country Link
CN (1) CN110534091A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111128178A (en) * 2019-12-31 2020-05-08 上海赫千电子科技有限公司 Voice recognition method based on facial expression analysis
CN111785294A (en) * 2020-06-12 2020-10-16 Oppo广东移动通信有限公司 Audio detection method and device, terminal and storage medium
CN111968622A (en) * 2020-08-18 2020-11-20 广州市优普科技有限公司 Attention mechanism-based voice recognition method, system and device
CN114141239A (en) * 2021-11-29 2022-03-04 江南大学 Voice short instruction identification method and system based on lightweight deep learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544963A (en) * 2013-11-07 2014-01-29 东南大学 Voice emotion recognition method based on core semi-supervised discrimination and analysis
CN106601231A (en) * 2016-12-22 2017-04-26 深圳市元征科技股份有限公司 Vehicle control method and apparatus
CN106803423A (en) * 2016-12-27 2017-06-06 智车优行科技(北京)有限公司 Man-machine interaction sound control method, device and vehicle based on user emotion state
CN106874016A (en) * 2017-03-07 2017-06-20 长江大学 A kind of new customizable big data platform architecture method
CN109712681A (en) * 2018-12-21 2019-05-03 河海大学常州校区 A kind of vehicle-mounted analysis system based on sign big data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544963A (en) * 2013-11-07 2014-01-29 东南大学 Voice emotion recognition method based on core semi-supervised discrimination and analysis
CN106601231A (en) * 2016-12-22 2017-04-26 深圳市元征科技股份有限公司 Vehicle control method and apparatus
CN106803423A (en) * 2016-12-27 2017-06-06 智车优行科技(北京)有限公司 Man-machine interaction sound control method, device and vehicle based on user emotion state
CN106874016A (en) * 2017-03-07 2017-06-20 长江大学 A kind of new customizable big data platform architecture method
CN109712681A (en) * 2018-12-21 2019-05-03 河海大学常州校区 A kind of vehicle-mounted analysis system based on sign big data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FLORIAN EYBEN等: ""The geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing"", 《IEEE TRANSACTIONS ON AFFECTIVE COMPUTING》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111128178A (en) * 2019-12-31 2020-05-08 上海赫千电子科技有限公司 Voice recognition method based on facial expression analysis
CN111785294A (en) * 2020-06-12 2020-10-16 Oppo广东移动通信有限公司 Audio detection method and device, terminal and storage medium
CN111785294B (en) * 2020-06-12 2024-04-02 Oppo广东移动通信有限公司 Audio detection method and device, terminal and storage medium
CN111968622A (en) * 2020-08-18 2020-11-20 广州市优普科技有限公司 Attention mechanism-based voice recognition method, system and device
CN114141239A (en) * 2021-11-29 2022-03-04 江南大学 Voice short instruction identification method and system based on lightweight deep learning

Similar Documents

Publication Publication Date Title
Bhavan et al. Bagged support vector machines for emotion recognition from speech
CN101599271B (en) Recognition method of digital music emotion
CN110534091A (en) A kind of people-car interaction method identified based on microserver and intelligent sound
Shaw et al. Emotion recognition and classification in speech using artificial neural networks
EP2418643A1 (en) Computer-implemented method and system for analysing digital speech data
Ismail et al. Mfcc-vq approach for qalqalahtajweed rule checking
Yu et al. Sparse cepstral codes and power scale for instrument identification
KR20200088263A (en) Method and system of text to multiple speech
Kandali et al. Vocal emotion recognition in five native languages of Assam using new wavelet features
Rahman et al. Dynamic time warping assisted svm classifier for bangla speech recognition
CN108369803A (en) The method for being used to form the pumping signal of the parameter speech synthesis system based on glottal model
Revathy et al. Performance comparison of speaker and emotion recognition
Khanna et al. Application of vector quantization in emotion recognition from human speech
Kadyan et al. Prosody features based low resource Punjabi children ASR and T-NT classifier using data augmentation
Pratama et al. Human vocal type classification using MFCC and convolutional neural network
Gaudani et al. Comparative study of robust feature extraction techniques for ASR for limited resource Hindi language
KR100766170B1 (en) Music summarization apparatus and method using multi-level vector quantization
Bouchakour et al. Noise-robust speech recognition in mobile network based on convolution neural networks
Płonkowski Using bands of frequencies for vowel recognition for Polish language
Dharini et al. CD-HMM Modeling for raga identification
Camarena-Ibarrola et al. Speaker identification using entropygrams and convolutional neural networks
Bansod et al. Speaker Recognition using Marathi (Varhadi) Language
Boonthong et al. Fisher feature selection for emotion recognition
Fahmeeda et al. Voice Based Gender Recognition Using Deep Learning
Hosain et al. Deep-learning-based speech emotion recognition using synthetic bone-conducted speech

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191203

RJ01 Rejection of invention patent application after publication