CN112992138A

CN112992138A - TTS-based voice interaction method and system

Info

Publication number: CN112992138A
Application number: CN202110156600.7A
Authority: CN
Inventors: 娄鑫
Original assignee: Icsoc Beijing Communication Technology Co ltd
Current assignee: Icsoc Beijing Communication Technology Co ltd
Priority date: 2021-02-04
Filing date: 2021-02-04
Publication date: 2021-06-18

Abstract

The invention belongs to the technical field of voice interaction, and particularly relates to a method and a system for voice interaction based on TTS. The invention provides a new method and a system for voice interaction based on TTS.A speech database and a speech model are established, when a client replies, the speech model and the speech model are used for carrying out speech synthesis of an expert, when a merchant and the client carry out speech interaction, the synthesized expert speech is firstly automatically interacted with the client, and when a certain specific keyword in the client speech is identified, the client automatically jumps to a specified customer service belonging to a company for replying; the design optimizes a speech synthesis model and ensures the compatibility of the accuracy of the answer content and the accuracy of the answer sound.

Description

TTS-based voice interaction method and system

Technical Field

The invention belongs to the technical field of voice interaction, and particularly relates to a method and a system for voice interaction based on TTS.

Background

In IVR (interactive voice response), ASR (voice recognition) \ NLP (natural language processing) \ TTS (voice synthesis) is used to constitute an intelligent customer service. The process of sending and receiving speech by client is the process of repeatedly alternating ASR → NLP → TTS and TTS ← NLP ← ASR.

However, on the one hand, TTS speech heard by the client is machine speech, which is acceptable to the client for procedural consultations, but less acceptable to the client for expert consultations. Although the phoneme base can be built by using expert speech, some service scenes are not only one expert, and the expert speech is virtual in nature, that is, a real person is not necessarily responsible for the work. It is conceivable to model the recording for each expert, but this is not affordable by the business for the more mobile business scenario of the team. For example, the client asks "how waterproof performance of this outdoor AP is? "you can hear" this outdoor AP reaches the waterproof standard of IP68, you can use it with confidence ". But this is a relatively hard voice, and even if the simulation is good, the client will not feel the real expert.

Disclosure of Invention

Aiming at the problems, the invention provides a new method and a system for voice interaction based on TTS.

The specific technical scheme of the invention is as follows:

the invention provides a voice interaction method based on TTS, which comprises the following steps:

a voice recognition step, which is used for recognizing the heard consultation voice through a voice recognition module;

s2: a natural language processing step, which is used for carrying out natural language processing on the recognized voice through a natural language processing module;

s3: a voice synthesis step, which is used for carrying out voice synthesis on the replied voice through a voice synthesis module;

s4: a voice interaction step, which is used for establishing voice channels between the merchant terminal and a plurality of client terminals belonging to each merchant through a voice interaction module, the customer proposes problems to the corresponding merchant terminal through the voice channels, the problems proposed by the customer are processed in steps S1-S3 and then sent to the corresponding customer, and the voice interaction between the merchant terminal and the client terminals is realized;

step S3 specifically includes the following steps:

s31: an expert database establishing step, which is used for acquiring the voice of an expert through an expert database establishing unit to establish a voice model and establishing an expert database, and synthesizing based on the expert database and the voice model when performing voice synthesis;

step S4 specifically includes the following steps:

s41: and a voice interaction substep, which is used for carrying out voice synthesis on the questions proposed by the customers through the voice interaction unit based on the expert database and the voice model after natural language processing of the step S2, sending the speech synthesized questions to corresponding merchant terminals, and establishing a voice channel between the client terminals and the appointed customer service when corresponding keywords are recognized in the questions proposed by the customers.

The invention has the following beneficial effects:

the invention provides a new method and a system for voice interaction based on TTS.A speech database and a speech model are established, when a client replies, the speech model and the speech model are used for carrying out speech synthesis of an expert, when a merchant and the client carry out speech interaction, the synthesized expert speech is firstly automatically interacted with the client, and when a certain specific keyword in the client speech is identified, the client automatically jumps to a specified customer service belonging to a company for replying; the design optimizes a speech synthesis model and ensures the compatibility of the accuracy of the answer content and the accuracy of the answer sound.

Drawings

FIG. 1 is a block diagram of a TTS-based voice interaction system in some embodiments;

FIG. 2 is a flow diagram of a TTS-based voice interaction method in some embodiments;

FIG. 3 is a block diagram of a TTS-based voice interaction system in further embodiments;

fig. 4 is a flowchart of step S31 in other embodiments.

Detailed Description

The present invention will be described in further detail with reference to the following examples and drawings.

In some embodiments, in order to ensure the accuracy of the answering content and the accuracy of the answering sound, the system first replies with a universal voice robot (ID: 001), and when a certain voice (12389) of the client (UID: 987) is recognized and NLP (natural language processing) is carried out, the system automatically jumps to the appointed Mr. custom service (ID: 019) of the company to reply, and the Mr. custom will be used as UID: 987 specified consultants of the client.

In a general speech robot (ID: 001), a speech model and an expert database are constructed as the basis of speech synthesis, and speech synthesized based on the speech model and the expert database can be used as ID: 001, in the figure. As shown in fig. 1 and fig. 2, the specific steps are, for example, as follows:

s1: a voice recognition step, which is used for recognizing the heard consultation voice through a voice recognition module; the speech recognition is implemented by using a conventional algorithm, and the invention is not particularly limited, for example: an algorithm based on dynamic time warping, a hidden markov model based on a parametric model, a vector quantization method based on a non-parametric model, and the like are all within the selection range.

S2: a natural language processing step, which is used for carrying out natural language processing on the recognized voice through a natural language processing module; the natural language processing is implemented by a conventional algorithm, and the invention is not particularly limited, for example: natural language processing techniques based on conventional machine learning, natural language processing techniques based on deep learning, and the like, for example, methods such as an SVM (support vector machine model), a Markov (Markov model), a CRF (conditional random field model) and the like in the natural language processing techniques based on conventional machine learning, and a convolutional neural network, a cyclic neural network and the like in the natural language processing techniques based on deep learning are all within a selection range.

S3: a voice synthesis step, which is used for carrying out voice synthesis on the replied voice through a voice synthesis module; the speech synthesis is implemented by using a conventional algorithm, and the invention is not particularly limited, for example: TACORTON-an end-to-end deep learning TTS model, a TTS model is directly trained by a deep learning method, and after the model training is finished, given input, the model can generate corresponding audio and LPC synthesis technology, PSOLA synthesis technology, a voice synthesis method based on an LMA vocal tract model and the like, and all the parameters are in a selection range; and the speech synthesis recognition, natural language processing and speech synthesis described above may be based on a third party or integrated in the server of the present invention, all within the concept.

wherein, step S3 specifically includes the following steps:

s31: an expert database establishing step, which is used for acquiring the voice of an expert through an expert database establishing unit to establish a voice model and establishing an expert database, and synthesizing based on the expert database and the voice model when performing voice synthesis; the expert database refers to a material database of expert voices after the voices correspond to the semantics in the natural language processing module; the speech model refers to a model for speech synthesis constructed by a machine learning algorithm (e.g., deep learning, such as a convolutional neural network, a cyclic neural network, etc.) based on an expert database.

Step S4 specifically includes the following steps:

s41: and a voice interaction substep, which is used for carrying out voice synthesis on the questions proposed by the customers through the voice interaction unit based on the expert database and the voice model after natural language processing of the step S2, sending the speech synthesized questions to corresponding merchant terminals, and establishing a voice channel between the client terminals and the appointed customer service when corresponding keywords are recognized in the questions proposed by the customers. In this embodiment, the recognition of the keyword needs to be recognized after being processed by the speech recognition step of step S1.

The corresponding keywords in step S41 include, but are not limited to, human, customer service.

The invention provides a new TTS-based voice interaction method, which comprises the steps of establishing an expert database and a voice model, carrying out expert voice synthesis based on the voice model and the expert database when replying to a customer, carrying out automatic interaction with the customer by the synthesized expert voice when carrying out voice interaction with the customer by a merchant, and automatically jumping to a specified customer service to which the company belongs to reply when recognizing a certain specific keyword in the customer voice; the design optimizes a speech synthesis model and ensures the compatibility of the accuracy of the answer content and the accuracy of the answer sound.

In other embodiments, the specifically establishing the speech model and the expert database includes initial speech acquisition and more speech acquisition, wherein the initial speech acquisition is as follows:

the business selected any one of the experts (30 total) as the standard voice modeling as the basis for speech synthesis. Including word level and word level, sentence level and paragraph level, etc. The sentence-level and paragraph-level content materials are standard answers formulated by the business part of the merchant, wherein the standard answers include standard answers of 100 common questions, and the standard answers can be used as ID: 001, in the figure.

More speech is collected as follows:

first, responses of the remaining 29 experts are collected for 100 common questions, and an expert library is built, so that the requirements of embodiment 1 can be satisfied. The specific establishing steps are as follows:

as shown in fig. 3 and 4, step S31 specifically includes the following steps:

s311: an initial voice acquisition step, which is used for acquiring the voice of any one of the experts belonging to the merchant as standard voice through an initial voice acquisition submodule to establish a voice model as the basis of voice synthesis;

s312: and a multi-voice acquisition step, which is used for acquiring the voice of the rest experts through the multi-voice acquisition submodule, establishing a corresponding expert database according to the voice of each expert, and training the voice model based on the expert database.

The expert voices collected in step S311 in this embodiment include, but are not limited to, a word level and a word level, and a sentence level and a paragraph level, wherein the content material at the sentence level and the paragraph level is the standard answer of the common question, and the expert voices collected in step S312 are the responses of each expert to the common question.

In other embodiments, a new expert may be trained and collected:

in the learning stage, the trainee TID: 009 require repeated listening to 100 common questions, and a phonetic answer to the common questions. For example, the trainee may simulate the customer asking 100 frequently asked questions to the system, listening repeatedly to the ID: the solution of 001. In the test phase, the test questions may include some of the 100 common questions, or newly added common questions (examine the strain capacity), until the test is passed, the system has also collected TID: 009 sufficient phonemes that can be translated directly into expert identity, ID: 111; the specific steps are as follows:

as shown in fig. 3 and 4, step S31 further includes the following steps:

s313: and a new expert training step, which is used for sending common problems to a trained person at a merchant end through a new expert training submodule so as to train a new expert, collecting the sound of the trained person, and storing the collected sound as the expert sound to an expert database after the collected sound is compared and passed.

In this embodiment, when training a new expert in step S313, the method specifically includes the following steps:

a learning step, which is used for sending common questions and voice answers aiming at the common questions to a training end of a trained person;

and the testing step is used for collecting the responses of the trainee to the common questions, comparing and judging the responses, sending the responses to the trainee after the comparison and judgment are passed, converting the responses into expert identities, and storing the collected responses in a corresponding expert database. In the test step, when the answers are compared and judged, firstly, the answer sound of the trainer is subjected to voice recognition and semantic processing in the natural language, and then the answers to the answers are matched with the standard answers stored in the natural language processing module.

In other embodiments, the plum executive (manager) LID: 005 can also listen to the answers of the common questions frequently and update their own voice library, so that the trainee can prefer the answers pushed to the plum supervisor; for example, the following steps:

as shown in fig. 3 and 4, step S31 further includes the following steps:

s314: the expert database establishing sub-step is used for acquiring the response voice of the manager to the common questions, to which the merchant belongs, through an expert database establishing sub-module and establishing a corresponding expert database;

s315: and an expert database updating step, which is used for collecting the response to the common questions sent by the manager through an expert database updating sub-module, updating the expert database to which the manager belongs, and updating the answer database in the natural language processing module.

In step S315 in this embodiment, a monitoring channel is established in a voice channel between the client and the merchant or between the client and the specified customer service based on a monitoring instruction sent by a manager at the merchant, the manager monitors the voice call, marks a corresponding position in the voice when receiving a marking instruction sent by the merchant during the monitoring process, and when recognizing that the voice call is stopped, disconnects the monitoring channel and sends a common problem corresponding to the marked position to the manager at the merchant. The voice marks can be voice marks, text marks input by managers, time marks and the like, the marks are finished in a label mode, a label list is formed after the marks are completely finished, and each higher leader of other supervisors can see the marks; the common problem corresponding to the mark position sent to the manager can be voice or text after voice recognition and natural language processing are carried out on the voice.

In step S315 in this embodiment, after the expert database is updated, the responses of the manager to the common questions are sent to the experts at the merchant end. And sending the updated voice to each expert, wherein each expert can hear or see the latest answer to the common question, and can update own answer in real time in the voice interaction with the client.

In this embodiment, the voice answer to the common question sent to the trainee in step S313 is preferably a response to the common question from the manager; in the training step, the trainee replies to the common questions, wherein the common questions comprise the common questions sent to the trainee in the learning stage and newly added common questions, and the newly added common questions are the common questions randomly downloaded on the internet or new common questions sent by the received manager.

The invention also provides a system for voice interaction based on TTS, which comprises a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the steps of the method.

The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solution of the present invention by those skilled in the art should fall within the protection scope defined by the claims of the present invention without departing from the spirit of the present invention.

Claims

1. A voice interaction method based on TTS comprises the following steps:

s1: a voice recognition step, which is used for recognizing the heard consultation voice through a voice recognition module;

it is characterized in that step S3 specifically includes the following steps:

step S4 specifically includes the following steps:

2. The TTS-based voice interaction method according to claim 1, wherein step S31 specifically includes the following steps:

3. The TTS-based speech interaction method of claim 1, wherein the expert voices collected in step S311 include, but are not limited to, word level and word level, and sentence level and paragraph level, wherein the content material at sentence level and paragraph level is a standard answer to common questions, and the expert voices collected in step S312 are responses of each expert to common questions.

4. The TTS-based voice interaction method of claim 2, wherein step S31 further comprises the following steps:

5. The TTS-based voice interaction method of claim 4, wherein the step S313, when training a new expert, comprises the following steps:

and the testing step is used for collecting the responses of the trainee to the common questions, comparing and judging the responses, sending the responses to the trainee after the comparison and judgment are passed, converting the responses into expert identities, and storing the collected responses in a corresponding expert database.

6. The TTS-based voice interaction method of claim 5, wherein step S31 further comprises the following steps:

7. The TTS-based voice interaction method according to claim 6, wherein in step S315, a monitoring channel is established in a voice channel between the client and the merchant or between the client and a designated customer service based on a monitoring instruction sent by a manager at the merchant, the manager monitors the voice call, marks a corresponding position in the voice when receiving a marking instruction sent by the merchant in the monitoring process, and when recognizing that the voice call is stopped, disconnects the monitoring channel and sends a common problem corresponding to the marked position to the manager at the merchant.

8. The TTS-based voice interaction method of claim 7, wherein in step S315, after the expert database is updated, the responses of the management personnel to the common questions are sent to the experts of the merchant.

9. The TTS-based voice interaction method of claim 5, wherein the voice answer of the common question sent to the trainee in step S313 is preferably a response of a manager to the common question; in the training step, the trainee replies to the common questions, wherein the common questions comprise the common questions sent to the trainee in the learning stage and newly added common questions, and the newly added common questions are the common questions randomly downloaded on the internet or new common questions sent by the received manager.

10. A system of TTS-based speech interaction, comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the steps of the method of claim 1.