CN112201277B

CN112201277B - Voice response method, device, equipment and computer readable storage medium

Info

Publication number: CN112201277B
Application number: CN202011052933.7A
Authority: CN
Inventors: 申亚坤
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2024-03-22
Anticipated expiration: 2040-09-29
Also published as: CN112201277A

Abstract

The application provides a method, a device, equipment and a computer readable storage medium for voice response, comprising the following steps: and acquiring the user voice, determining the intonation type corresponding to the user voice according to the voice characteristics and the voice content of the user voice, generating a response voice corresponding to the user voice based on the intonation type corresponding to the user voice and the voice content, and finally broadcasting the response voice. Because the broadcasted response voice is obtained according to the intonation type of the user voice and the voice content, the broadcasted response voice can be different as long as the intonation type of the user voice is different, personalized response according to the user voice is realized, and therefore the experience of the user can be improved. In addition, the intonation type corresponding to the user voice is determined according to the voice characteristics and the two dimensions of the voice content of the user voice, so that the intonation type corresponding to the user voice has higher accuracy, and the accuracy of the broadcasted response voice can be improved.

Description

Voice response method, device, equipment and computer readable storage medium

Technical Field

The present invention relates to the field of speech processing, and in particular, to a method and apparatus for speech response, an electronic device, and a computer readable storage medium.

Background

In many service scenarios, intelligent voice response devices are provided for voice interaction with a user. However, at present, the response modes of many intelligent voice response devices are relatively single, for example, a unified intonation response mode is adopted for response, personalized response cannot be performed according to different user voices, and service experience of users cannot be improved.

Disclosure of Invention

The application provides a voice response method and device, electronic equipment and a computer readable storage medium, and aims to solve the problem of how to perform personalized response according to user voice in application of voice response equipment.

In order to achieve the above object, the present application provides the following technical solutions:

a method of voice response, comprising:

acquiring user voice;

determining the intonation type corresponding to the user voice according to the voice characteristics and the voice content of the user voice;

generating response voice corresponding to the user voice based on the intonation type corresponding to the user voice and the voice content;

and broadcasting the response voice.

In the above method, optionally, the intonation types include at least two designated intonation types, and any one of the intonation types is preset according to the voice characteristics and the voice content of the voice of the historical user;

the speech features include at least pitch features and amplitude features.

In the above method, optionally, the determining the intonation type corresponding to the user voice according to the voice feature of the user voice and the voice content includes:

inputting the user voice into a pre-trained Bayesian classification model, and enabling the Bayesian classification model to determine the intonation type corresponding to the user voice according to the voice characteristics of the user voice;

recognizing and obtaining the voice content corresponding to the user voice;

inputting the voice content of the user voice into a pre-trained voice classification model; the voice classification model determines the intonation type corresponding to the user voice according to the voice content of the user voice;

respectively acquiring the intonation types corresponding to the user voice output by the Bayesian model and the voice classification model;

and if the intonation type output by the Bayesian classification model and the intonation type output by the voice classification model are the same intonation type, the same intonation type is used as the intonation type corresponding to the voice of the user.

The method, optionally, further comprises:

and if the intonation types output by the Bayesian classification model and the intonation types output by the voice classification model are different intonation types, determining the intonation type corresponding to the user voice as a preset default intonation type.

In the above method, optionally, the bayesian classification model is obtained by training according to a voice training sample, where the voice training sample carries the voice features;

the process of determining the intonation type corresponding to the user voice by the Bayesian classification model is as follows: and the Bayesian classification model calculates the probability that the user voice respectively belongs to each intonation type according to the voice characteristics of the user voice, and determines the intonation type corresponding to the maximum probability value as the intonation type corresponding to the user voice.

In the above method, optionally, the voice classification model is a GA-BP neural network model, and the GA-BP neural network model is a model obtained by optimizing an initial BP neural network model;

the number of input layer nodes of the initial BP neural network model is determined according to the voice content length of a voice training sample, the number of output layer nodes is determined according to the intonation type, and the number of hidden layer nodes is determined based on a trial-and-error method;

the optimizing of the initial BP neural network model is as follows: training and learning the initial weight and the threshold value of each layer in the input layer, the hidden layer and the output layer of the initial BP neural network model according to preset sample data and a genetic algorithm, and determining the optimal initial weight and the threshold value of each layer to obtain the optimized BP neural network model.

The method, optionally, generating a response voice corresponding to the user voice based on the intonation type corresponding to the user voice and the voice content, includes:

determining responsive voice content based on the voice content;

generating the response voice with voice content as the response voice content and tone type as the response voice of the tone type corresponding to the user voice.

An apparatus for voice response, comprising:

the acquisition unit is used for acquiring the voice of the user;

the determining unit is used for determining the intonation type corresponding to the user voice according to the voice characteristics and the voice content of the user voice;

a generating unit, configured to generate a response voice corresponding to the user voice based on the intonation type corresponding to the user voice and the voice content;

and the broadcasting unit is used for broadcasting the response voice.

A voice response apparatus comprising: a processor and a memory for storing a program; the processor is configured to run the program to implement the method of voice response described above.

A computer readable storage medium having instructions stored therein which, when executed on a computer, cause the computer to perform the method of voice response described above.

The method and the device disclosed by the application comprise the following steps: and acquiring the user voice, determining the intonation type corresponding to the user voice according to the voice characteristics and the voice content of the user voice, generating a response voice corresponding to the user voice based on the intonation type corresponding to the user voice and the voice content, and finally broadcasting the response voice. Because the broadcasted response voice is obtained according to the intonation type of the user voice and the voice content, the broadcasted response voice can be different as long as the intonation type of the user voice is different, personalized response according to the user voice is realized, and therefore the experience of the user can be improved.

In addition, the intonation type corresponding to the user voice is determined according to the voice characteristics and the two dimensions of the voice content of the user voice, so that the intonation type corresponding to the user voice has higher accuracy, and the accuracy of the broadcasted response voice can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method of providing a voice response according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for determining intonation types corresponding to user speech provided in an example of the present application;

fig. 3 is a schematic structural diagram of a voice response device according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a voice response device according to an embodiment of the present application.

Detailed Description

In many occasions, intelligent voice broadcasting equipment is adopted to carry out voice interaction on users, however, at present, many intelligent voice response equipment only pay attention to the content of the voice of the users and does not pay attention to intonation for the voice, so that the unified intonation response mode is generally adopted to carry out response, personalized response cannot be carried out according to different voices of the users, and service experience of the users cannot be improved.

Therefore, the embodiment of the application provides a voice response method, which aims to respond to a user by combining the voice of the user and the voice content of the voice of the user so as to realize personalized response according to different voices of the user.

In this application, the voice content of the user voice refers to voice text content corresponding to the user voice.

In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments that are illustrated in the appended drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The execution subject of this embodiment is an intelligent voice broadcasting device, such as an intelligent voice robot, with a voice processing function.

Fig. 1 is a method for providing a voice response according to an embodiment of the present application, which may include the following steps:

s101, acquiring user voice.

The user voice is user voice, and the intelligent voice broadcasting equipment can acquire the user voice in a voice acquisition range under the running state.

S102, determining the intonation type corresponding to the user voice according to the voice characteristics and the voice content of the user voice.

In this embodiment, the voice features are information that can be used to describe the mood and emotion attitudes of the user's voice, and include pitch features, amplitude features, tone features, and the like.

The intonation type comprises at least two appointed intonation types, the voice characteristics and voice contents of the historical user voice are preset, namely the appointed intonation types are set according to voice information of the mood and emotion attitudes of the historical user voice and the voice contents of the historical user voice, the appointed intonation types can be a fun-like interaction intonation type, a mild formal interaction intonation type and the like, for example, the fun-like interaction intonation type can be a type that the tone or the voice amplitude of the voice is greatly changed, and the relativity of the voice contents and service inquiry type problems is weak. The gentle formal interactive intonation type may be one where the pitch or amplitude of the speech is less variable and the content of the speech is more relevant to the service query.

For a specific embodiment of this step, reference may be made to the flowchart shown in fig. 2.

S103, generating response voice corresponding to the user voice based on the intonation type and the voice content corresponding to the user voice.

The specific implementation mode of the step comprises the step A1 and the step A2:

and A1, determining response voice content based on the voice content of the user voice.

Responsive voice content corresponding to the voice content of the user voice is determined based on the voice content of the user voice, and for example, the corresponding responsive voice content may be determined based on keywords included in the voice content.

Of course, in the step, the response voice content may be determined based on the voice content and the intonation type of the user voice, that is, the response voice content of the response voice is not only related to the voice content of the user voice, but also related to the intonation type of the user voice, that is, the voice content of the user voice is the same, and the voice content of the response voice may be different under the condition of different intonation types, so that the method has better individuation characteristics.

And A2, generating response voice with voice content as response voice content and tone type as tone type corresponding to the voice of the user.

The intonation type of the response voice is the same as that of the user voice, and the personalized effect of the response voice can be enhanced.

S104, broadcasting response voice.

For example, the intelligent voice broadcasting device invokes a preset voice broadcasting device to broadcast response voice.

The method provided by the embodiment comprises the following steps: and acquiring the user voice, determining the intonation type corresponding to the user voice according to the voice characteristics and the voice content of the user voice, generating a response voice corresponding to the user voice based on the intonation type corresponding to the user voice and the voice content, and finally broadcasting the response voice. Because the broadcasted response voice is obtained according to the intonation type of the user voice and the voice content, the broadcasted response voice can be different as long as the intonation type of the user voice is different, personalized response according to the user voice is realized, and therefore the experience of the user can be improved.

Fig. 2 is a specific implementation manner of determining the intonation type corresponding to the user voice according to the voice characteristics and the voice content of the user voice in S102 of the foregoing embodiment, which may include the following steps:

s201, inputting the user voice into a pre-trained Bayesian classification model, so that the Bayesian classification model determines the intonation type corresponding to the user voice according to the voice characteristics of the user voice.

In the step, a Bayesian classification model is obtained by training according to a voice training sample. The speech training sample carries a plurality of speech features, wherein the training method for obtaining the Bayesian classification model by training the training sample can refer to the prior art.

The pre-trained Bayesian classification model can extract the tone characteristics of the user voice and determine the intonation type corresponding to the user voice based on the tone characteristics of the user voice.

The method comprises the following steps: the Bayesian classification model calculates the probability that the user voice belongs to each appointed intonation type respectively according to the voice characteristics of the user voice, and determines the appointed intonation type corresponding to the maximum probability value as the intonation type corresponding to the user voice.

For example, the feature set of all intonation features of the user voice is represented by X, and the first intonation type is represented by Y1, and the probability that the user voice belongs to the first intonation type is calculated by introducing all intonation features of the user voice into a probability formula.

Wherein, the probability formula is:

p (Y1|X) is the probability that the user's speech belongs to the common first intonation type Y1 under the condition of the feature set X of the user's speech, A _i Representing the ith feature in the feature set X corresponding to the user voice, n is the number of features in the feature set X, P (Y1) is the probability that any intonation type belongs to the first intonation type Y1, and P (A) _i The corresponding feature is A under the condition that Y1) is intonation type Y1 _i P (X) is the probability of the user's speech occurring in all specified intonation types, P (A) _i ) With features A for any one speech _i Is a function of the probability of (1),

wherein P (Y1), P (A) _i Y1), and P (a) _i ) Is estimated in advance from a plurality of feature sets X of determined intonation types. The larger the number of feature sets X, the more accurate the intonation types corresponding to the feature sets X, and the estimated P (Y1), P (A _i Y1), and P (a) _i ) The more accurate.

S202, recognizing and obtaining voice content corresponding to the voice of the user.

The step can adopt the existing voice recognition method to obtain the voice content of the user voice.

S203, inputting the voice content of the user voice into a pre-trained voice classification model, so that the voice classification model determines the intonation type corresponding to the user voice according to the voice content of the user voice.

The Bayesian classification model is used for determining the intonation type corresponding to the user voice based on the voice characteristics of the voice, and the voice classification model is used for determining the intonation type corresponding to the user voice according to the voice content of the user voice.

Optionally, the voice classification model is a GA-BP neural network model. The GA-BP neural network model is obtained by optimizing the initial BP neural network model. The trained voice classification model can obtain the intonation type corresponding to the input voice content.

The number of the input layer nodes of the initial BP neural network model is determined according to the voice content length of the voice training sample, the number of the output layer nodes is determined according to the intonation type, and the number of the hidden layer nodes is determined based on a trial-and-error method. The voice training sample is voice content of historical user voice carrying intonation types.

The initial BP neural network model is optimized as follows: training and learning the initial weight and the threshold value of each layer in the input layer, the hidden layer and the output layer of the initial BP neural network model according to preset sample data and a genetic algorithm, and determining the optimal initial weight and the threshold value of each layer to obtain the optimized BP neural network model. For a specific optimization procedure, reference is made to the prior art.

S204, respectively acquiring intonation types corresponding to the user voice output by the Bayesian model and the voice classification model.

S205, judging whether the intonation types corresponding to the user voice output by the Bayesian model and the voice classification model are the same. If the same, S206 is executed, and if not, S207 is executed.

S206, using the same intonation type as the intonation type corresponding to the user voice.

And if the intonation types corresponding to the user voice output by the Bayesian model and the voice classification model are the same, the probability that the same intonation type is the correct intonation type of the user voice is high.

S207, determining the intonation type corresponding to the voice of the user as a preset default intonation type.

For example, the default intonation type may be preset to be a gentle intonation type, so that the intonation type corresponding to the user voice output by the bayesian model and the voice classification model is determined to be a gentle formal interactive intonation type when the intonation types corresponding to the user voice are different.

According to the method provided by the embodiment, the voice classification model is used for determining the intonation type corresponding to the user voice based on the voice characteristics of the voice, the voice classification model is used for determining the intonation type corresponding to the user voice according to the voice content of the user voice, and the method is equivalent to determining the intonation type for the voice from different dimensions, so that the intonation type of the user voice is determined jointly by combining the trained Bayesian classification model and the voice classification model, and the accuracy of the obtained intonation type of the user voice can be improved.

Fig. 3 is a schematic structural diagram of a voice response device according to an embodiment of the present application, including: a processor 301 and a memory 302, the memory being for storing a program, the processor being for running the program to implement the method of voice response provided herein.

Intelligent voice response devices may be placed at various service points for providing automatic voice response services to users. For example, the intelligent voice response device may be used in a service website for transacting business to enhance the user's service experience by providing a fun-in-cheering interaction with the user, as well as providing a business transacting class interaction.

For example, considering that the user's voice is of the intonation type and is of the cheerful intonation type, the user is likely to wish to perform an informal fun-like interaction with the intelligent voice device, and when the user's voice is of the flat intonation type, the user is likely to wish to perform a formal business interaction with the intelligent voice device.

Correspondingly, the intonation type of the user voice is designated as a cheerful intonation type or a gentle intonation type in advance, and the intelligent voice response device is preconfigured to respond to the user by adopting the intonation of cheerful fun when the intonation type of the user voice is determined as the cheerful intonation type, and to respond to the user by adopting the gentle formal intonation when the intonation type of the user voice is determined as the gentle intonation type. The intelligent voice response device can improve the service experience of the user by providing two different interaction modes.

Fig. 4 is a schematic structural diagram of a voice response device according to an embodiment of the present application, including:

an acquisition unit 401 for acquiring a user voice;

a determining unit 402, configured to determine a intonation type corresponding to the user voice according to the voice feature and the voice content of the user voice;

a generating unit 403, configured to generate a response voice corresponding to the user voice based on the intonation type corresponding to the user voice and the voice content;

and the broadcasting unit 404 is used for broadcasting the response voice.

The intonation type comprises at least two appointed intonation types, wherein any one intonation type is preset according to voice characteristics and voice contents of historical user voices, and the voice characteristics at least comprise tone characteristics and sound amplitude characteristics.

The determining unit 402 determines, according to the voice characteristics and the voice content of the user voice, a specific implementation manner of the intonation type corresponding to the user voice as follows:

recognizing and obtaining the voice content corresponding to the user voice;

if the intonation type output by the Bayesian classification model and the intonation type output by the voice classification model are the same intonation type, the same intonation type is used as the intonation type corresponding to the user voice;

Optionally, the bayesian classification model is obtained through training according to a voice training sample, and the voice training sample carries the voice characteristics; the process of determining the intonation type corresponding to the user voice by the Bayesian classification model is as follows: and the Bayesian classification model calculates the probability that the user voice respectively belongs to each intonation type according to the voice characteristics of the user voice, and determines the intonation type corresponding to the maximum probability value as the intonation type corresponding to the user voice.

Optionally, the voice classification model is a GA-BP neural network model, and the GA-BP neural network model is obtained by optimizing an initial BP neural network model;

Optionally, the specific implementation manner of generating, by the generating unit 403, the answer speech corresponding to the user speech based on the intonation type corresponding to the user speech and the speech content is:

determining responsive voice content based on the voice content;

generating response voice content, wherein the voice content is response voice content, and the intonation type is the response voice of the intonation type corresponding to the user voice.

The device provided by the embodiment of the application comprises: and acquiring the user voice, determining the intonation type corresponding to the user voice according to the voice characteristics and the voice content of the user voice, generating a response voice corresponding to the user voice based on the intonation type corresponding to the user voice and the voice content, and finally broadcasting the response voice. Because the broadcasted response voice is obtained according to the intonation type of the user voice and the voice content, the broadcasted response voice can be different as long as the intonation type of the user voice is different, personalized response according to the user voice is realized, and therefore the experience of the user can be improved.

The present application also provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the method of voice response of the present application, namely to perform the steps of:

acquiring user voice;

and broadcasting the response voice.

The functions described in the methods of the present application, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computing device readable storage medium. Based on such understanding, a portion of the embodiments of the present application that contributes to the prior art or a portion of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computing device (which may be a personal computer, a server, a mobile computing device or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of voice response comprising:

acquiring user voice;

broadcasting the response voice;

the intonation type comprises at least two appointed intonation types, wherein any one of the intonation types is preset according to the voice characteristics and the voice content of the voice of the historical user, namely the appointed intonation type is preset according to the voice information of the mood and emotion attitudes of the voice of the historical user and the voice content of the voice of the historical user; the appointed intonation type is a fun-amusing interactive intonation type or a gentle formal interactive intonation type; the fun-cheering interactive intonation type is a type that the tone or the sound amplitude of the voice is changed greatly and the relativity between the voice content and the service inquiry type problem is weak; the gentle formal interactive intonation type is the type with small tone or amplitude variation of voice and strong correlation between voice content and service inquiry problems;

the voice features at least comprise tone features and sound amplitude features;

wherein, the determining the intonation type corresponding to the user voice according to the voice characteristics of the user voice and the voice content includes:

recognizing and obtaining the voice content corresponding to the user voice;

respectively acquiring the intonation types corresponding to the user voice output by the Bayesian classification model and the voice classification model;

wherein, still include:

if the intonation type output by the Bayesian classification model and the intonation type output by the voice classification model are different intonation types, determining the intonation type corresponding to the user voice as a preset default intonation type;

the Bayesian classification model is obtained by training according to a voice training sample, wherein the voice training sample carries the voice characteristics;

the process of determining the intonation type corresponding to the user voice by the Bayesian classification model is as follows: the Bayesian classification model calculates probability values of the user voice belonging to the intonation types respectively according to the voice characteristics of the user voice, and determines the intonation type corresponding to the user voice as the intonation type corresponding to the user voice with the largest probability value;

the voice classification model is a GA-BP neural network model, and the GA-BP neural network model is obtained by optimizing an initial BP neural network model;

the optimizing of the initial BP neural network model is as follows: training and learning the initial weight and the threshold value of each layer in the input layer, the hidden layer and the output layer of the initial BP neural network model according to preset sample data and a genetic algorithm, and determining the optimal initial weight and the threshold value of each layer to obtain an optimized BP neural network model;

wherein the generating, based on the intonation type corresponding to the user voice and the voice content, a response voice corresponding to the user voice includes:

determining responsive voice content based on the voice content;

and generating the response voice with the voice content as the response voice content and the tone type as the response voice of the tone type corresponding to the user voice, wherein the tone type of the response voice is the same as the tone type of the user voice so as to enhance the personalized effect of the response voice.

2. A voice response apparatus, comprising:

the acquisition unit is used for acquiring the voice of the user;

the broadcasting unit is used for broadcasting the response voice;

recognizing and obtaining the voice content corresponding to the user voice;

wherein, still include:

determining responsive voice content based on the voice content;

3. A voice response apparatus, comprising: a processor and a memory for storing a program; the processor is configured to run the program to implement the method of voice response of claim 1.

4. A computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the method of voice response of claim 1.