CN113035203A

CN113035203A - Control method for dynamically changing voice response style

Info

Publication number: CN113035203A
Application number: CN202110326327.8A
Authority: CN
Inventors: 焦其意; 陆涛; 郭杰
Original assignee: Hefei Meiling Union Technology Co Ltd
Current assignee: Hefei Meiling Union Technology Co Ltd
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-06-25

Abstract

The invention discloses a control method for dynamically changing a voice response style, and relates to the field of intelligent household appliances. The invention comprises the following steps: collecting voice data of different age groups and establishing a sample database; processing audio files in the sample database to obtain an audio frame sequence; performing Fourier change on each frame in the audio frame sequence to obtain spectrogram information feature extraction of the frame and obtain a voiceprint feature vector; training the voiceprint feature vectors to obtain a voiceprint feature model; the MIC detects the user voice and inputs the voice to a voice print characteristic model after voice print collection; and dynamically changing the voice response style according to the user type. According to the invention, the voice of the refrigerator user is collected, the age bracket of the user is judged according to the voiceprint of the user, and the voice response style corresponding to the age bracket is started, so that the intelligent equipment is more convenient for the old and children to use, and the intelligent degree of the equipment is improved.

Description

Control method for dynamically changing voice response style

Technical Field

The invention belongs to the technical field of intelligent household appliances, and particularly relates to a control method for dynamically changing a voice response style.

Background

With the development of artificial intelligence, more and more devices can realize the function of voice interaction with users, for example, an intelligent robot can perform dialogue communication with users.

In the prior art, various devices can recognize the voice of a user through a voice recognition technology, determine the dialogue content with the user according to a pre-trained voice dialogue model, and finally play the audio of the dialogue content through a terminal, thereby completing the voice interaction with the user.

Along with the continuous development of equipment, intelligent device is constantly updated, and voice response content is more and more complicated, and to old man or children, it is big to understand the degree of difficulty, uses inconveniently, becomes the problem that needs to solve.

Disclosure of Invention

The invention aims to provide a control method for dynamically changing a voice response style, which solves the problems of large understanding difficulty, inconvenient use and insufficient interest of the existing intelligent equipment for old people and children by collecting the voice of a refrigerator user, judging the age group of the user according to the voiceprint of the user and starting the voice response style corresponding to the age group.

In order to solve the technical problems, the invention is realized by the following technical scheme:

the invention relates to a control method for dynamically changing a voice response style, which comprises the following steps:

step S1: collecting voice data of different age groups and establishing a sample database;

step S2: processing audio files in the sample database to obtain an audio frame sequence;

step S3: fourier transformation is carried out on each frame in the audio frame sequence to obtain spectrogram information of the frame;

step S4: extracting the features of the spectrogram information to obtain a voiceprint feature vector;

step S5: inputting the voiceprint feature vector into a convolutional neural network model for training to obtain a voiceprint feature model;

step S6: the MIC detects the user voice and inputs the voice to a voice print characteristic model after voice print collection;

step S7: the intelligent refrigerator determines the type of the user according to the voiceprint recognition result;

step S8: and dynamically changing the voice response style according to the user type.

Preferably, in the step S1, the age of the user is divided into three stages: children, young and old people, and children, young and old people are alive, humorous and traditional corresponding to the voice response style respectively.

Preferably, in step S2, the audio frame sequence obtaining step includes:

step S21: sampling and quantizing the audio file;

step S22: converting the audio frequency digital signal into an audio frequency digital signal with fixed bit number according to a fixed sampling frequency;

step S23: pre-emphasis processing is carried out on the audio digital signal;

step S24: performing framing and windowing processing on a voice signal;

step S25: and obtaining a speech frame sequence.

Preferably, in step S3, fourier transform is performed on each frame in the sequence of audio frames to obtain a frequency spectrum of each frame of audio sequence, and a power spectrum of the audio wash is obtained by taking a square of a modulus of the frequency spectrum of each frame of audio sequence; filtering the power spectrum of the audio sequence through a preset filter to obtain the logarithmic energy of the audio sequence; and carrying out discrete cosine change on the logarithmic energy of the audio sequence to obtain the characteristic vector of the audio.

Preferably, in step S4, the time domain information and the frequency domain information of the spectrogram information are input into a two-dimensional convolutional neural network, so as to obtain the time domain feature and the frequency domain feature of the sound data; and after the time domain characteristics and the frequency domain characteristics of the sound data are subjected to characteristic aggregation, inputting the aggregated characteristics into a full connection layer to obtain a voiceprint characteristic vector.

Preferably, in step S5, inputting the voiceprint feature vector into a convolutional neural network model for training, and obtaining a voiceprint model for identifying a voiceprint includes:

extracting local voiceprint information of the voiceprint characteristic vector through a convolution layer of the convolution neural network model;

connecting the extracted local voiceprint information through a full connection layer of the convolutional neural network model to obtain multi-dimensional local voiceprint information;

and performing dimensionality reduction processing on the multi-dimensional local voiceprint information through a pooling layer of the convolutional neural network model to obtain a voiceprint characteristic model.

Preferably, in step S7, after the user type is determined, matching is performed according to a pre-trained interaction style training model and a response text library, a target interactive character is determined, the target interactive character is converted into a response voice audio consistent with a voice response style according to an adjustment parameter corresponding to the current user type, and the response voice audio is played.

The invention has the following beneficial effects:

processing an audio file in a sample database to obtain an audio sequence, performing Fourier transform processing on each frame of the audio sequence, extracting a voiceprint characteristic vector, inputting the voiceprint characteristic vector into a convolutional neural network model for training to obtain a voiceprint characteristic model; the voice data of the user is input into the voiceprint feature model, the age of the user is judged according to the voiceprint of the user, and the voice response style corresponding to the age is started, so that the old and children can use the intelligent device more conveniently, and the intelligent degree of the device is improved.

Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a step diagram of a control method for dynamically changing the voice response style according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention is a control method for dynamically changing a voice response style, comprising the following steps:

In step S1, the age of the user is divided into three stages: children, young adults and old adults are lively, humorous and traditional respectively corresponding to the voice response style; this document defines children under 14 years of age, young children between 14 and 60 years of age, and elderly people above 60 years of age; when a child starts the refrigerator and gives a voice instruction, the refrigerator can converse with the child through lively tone, and guides the child to use the refrigerator, so that the child can be guided to correctly operate and use the refrigerator while understanding is facilitated, and for example, a child needs to eat less ice cream and remember to close a refrigerator door with the help of the child; similarly, when the old people use the refrigerator, the refrigerator can communicate with the old people through steady and temperate words, and remind and care the old people, for example, "the meal just taken out of the refrigerator can not be eaten directly, and the old people are advised to eat again after being heated".

Voiceprints can extract physiological or behavioral aspects of a speaker from speech waveforms, and then feature matching. To implement voiceprint recognition, a speaker first needs to input multiple age-based voice samples into the system and extract personal features using voiceprint feature extraction techniques. The data are finally put into a database through a voiceprint modeling technology, the identification objects are models stored in the database and voiceprint characteristics needing to be verified, and finally the age bracket corresponding to the speaker is identified.

In step S2, the audio frame sequence obtaining step includes:

step S21: sampling and quantizing the audio file;

step S23: pre-emphasis processing is carried out on the audio digital signal;

step S24: performing framing and windowing processing on a voice signal;

step S25: and obtaining a speech frame sequence.

The fundamental frequency of the voice is about 100Hz for men and about 200Hz for women, the conversion period is 10ms and 5ms, the audio frame contains a plurality of periods, generally at least 20ms, and the gender of the speaker can be judged through the audio frame.

In step S23, pre-emphasis is performed to increase the high frequency component of the audio signal so that the audio signal becomes relatively flat from low frequency to high frequency; using a high-pass filter to boost the high-frequency component, the filter having a response characteristic such as

H(z)＝1-uz^-1

Wherein, the value range of the coefficient u is [0.9, 1], and u is a pre-emphasis coefficient;

pre-emphasis (Pre-emphasis) is a method of compensating for high frequency components of a transmission signal in advance at a transmitting end. Pre-emphasis is performed because the signal energy distribution is not uniform, and the signal-to-noise ratio (SNR) at the high frequency end of the speech signal may drop to the threshold range. The power spectrum of the voice signal is in inverse proportion to the frequency, the energy of the low-frequency region is high, the energy of the high-frequency region is low, and the reason of uneven distribution is considered, so that the signal amplitude generating the maximum frequency deviation can be speculatively judged to be mostly in the low frequency. And the noise power spectrum is pre-emphasized by changing the expression mode. This is an undesirable result for both people and therefore counter-balancing pre-emphasis and de-emphasis occurs. The pre-emphasis is to improve the high-frequency signal, remove the influence of glottis and lips, and facilitate the research on the influence of sound channels. However, in order to restore the original signal power distribution as much as possible, it is necessary to perform a reverse process, that is, a de-emphasis technique for de-emphasizing a high-frequency signal. In the process of the step, the high-frequency component of the noise is reduced, and it is unexpected that pre-emphasis has no influence on the noise, so that the output signal-to-noise ratio (SNR) is effectively improved.

In step S3, performing fourier transform on each frame in the sequence of audio frames to obtain a frequency spectrum of each frame of audio sequence, and performing a modulo square on the frequency spectrum of each frame of audio sequence to obtain an audio-washed power spectrum; filtering the power spectrum of the audio sequence through a preset filter to obtain the logarithmic energy of the audio sequence; and carrying out discrete cosine change on the logarithmic energy of the audio sequence to obtain the characteristic vector of the audio.

In step S24, the data x (n) after sampling and normalizing the audio signal is subjected to frame windowing, and a window function w (n) with a certain length is multiplied by the audio signal x (n) to obtain a signal x after each frame is windowed_i(n) commonly used window functions are hamming, hanning and rectangular windows; the formula is as follows:

x_i(n)＝w(n)*x(n)

hamming window:

hanning Window:

rectangular window:

in step S4, inputting the time domain information and the frequency domain information of the spectrogram information into a two-dimensional convolutional neural network, so as to obtain the time domain feature and the frequency domain feature of the sound data; and after the time domain characteristics and the frequency domain characteristics of the sound data are subjected to characteristic aggregation, inputting the aggregated characteristics into a full connection layer to obtain a voiceprint characteristic vector.

In step S5, inputting the voiceprint feature vector into the convolutional neural network model for training, and obtaining a voiceprint model for identifying a voiceprint includes:

In step S7, after the user type is determined, matching is performed according to the pre-trained interaction style training model and the response text library to determine the target interaction text, the target interaction text is converted into a response voice audio consistent with the voice response style according to the adjustment parameter corresponding to the current user type, and the response voice audio is played.

It should be noted that, in the above system embodiment, each included unit is only divided according to functional logic, but is not limited to the above division as long as the corresponding function can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

In addition, it is understood by those skilled in the art that all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing associated hardware, and the corresponding program may be stored in a computer-readable storage medium.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A control method for dynamically changing a voice response style is characterized by comprising the following steps:

2. The control method according to claim 1, wherein in step S1, the age of the user is divided into three stages: children, young and old people, and children, young and old people are alive, humorous and traditional corresponding to the voice response style respectively.

3. The control method for dynamically changing the style of a voice response according to claim 1, wherein in step S2, the audio frame sequence obtaining step comprises:

step S21: sampling and quantizing the audio file;

step S23: pre-emphasis processing is carried out on the audio digital signal;

step S24: performing framing and windowing processing on a voice signal;

step S25: and obtaining a speech frame sequence.

4. The method as claimed in claim 1, wherein in step S3, fourier transform is performed on each frame of the sequence of audio frames to obtain the frequency spectrum of each frame of audio sequence, and the square of the frequency spectrum of each frame of audio sequence is taken to obtain the power spectrum of audio washing; filtering the power spectrum of the audio sequence through a preset filter to obtain the logarithmic energy of the audio sequence; and carrying out discrete cosine change on the logarithmic energy of the audio sequence to obtain the characteristic vector of the audio.

5. The method as claimed in claim 1, wherein in step S4, the time domain information and the frequency domain information of the spectrogram information are input into a two-dimensional convolutional neural network, so as to obtain the time domain features and the frequency domain features of the sound data; and after the time domain characteristics and the frequency domain characteristics of the sound data are subjected to characteristic aggregation, inputting the aggregated characteristics into a full connection layer to obtain a voiceprint characteristic vector.

6. The method as claimed in claim 1, wherein in step S5, the input of the voiceprint feature vector into the convolutional neural network model for training, and obtaining the voiceprint model for recognizing the voiceprint includes:

7. The method as claimed in claim 1, wherein in step S7, after determining the user type, matching is performed according to a pre-trained interaction style training model and a response text library to determine the target interactive text, and according to the adjustment parameter corresponding to the current user type, the target interactive text is converted into a response voice audio consistent with the voice response style, and the response voice audio is played.