CN112489642B

CN112489642B - Method, device, equipment and storage medium for controlling voice robot response

Info

Publication number: CN112489642B
Application number: CN202011130332.3A
Authority: CN
Inventors: 刘彦华; 邓锐涛; 王艺霏
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2020-10-21
Filing date: 2020-10-21
Publication date: 2024-05-03
Anticipated expiration: 2040-10-21
Also published as: CN112489642A

Abstract

The application relates to a method, a device, equipment and a storage medium for controlling a voice robot response. The method comprises the following steps: during the voice communication process between the voice robot and the user terminal, voice collection is carried out on the user terminal; determining the current call state scene type according to the voice acquisition result; the call state scene type is used for representing states of a user corresponding to the user terminal and the voice robot in voice call; acquiring a robot response scheme corresponding to the call state scene type; and controlling the voice robot to respond according to the robot response scheme. By adopting the method, the response accuracy of the control robot can be improved.

Description

Method, device, equipment and storage medium for controlling voice robot response

Technical Field

The present application relates to the field of artificial intelligence and voice communication technologies, and in particular, to a method, an apparatus, a device, and a storage medium for controlling a voice robot response.

Background

With the development of artificial intelligence technology, many robots are presented to replace artificial scenes. The voice robot is a common intelligent robot, and can replace manual customer service to communicate with a user, so that part of customer service transactions are executed. For example, it is a more common scenario to use a voice robot for outbound calls. The outbound call refers to the active call of the user through the voice robot, and the voice call is established.

In the traditional method, the voice robot only plays fixed voice to communicate with the user according to a preset fixed flow. However, different users have different reactions to the same voice content played by the voice robot, so that the unified response to the user according to a single fixed voice may be too limited, resulting in the response accuracy of the robot. Therefore, the low response accuracy of the conventional method is a problem to be solved.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, apparatus, computer device, and storage medium for controlling a voice robot response that can improve response accuracy.

A method of controlling a voice robot response, the method comprising:

during the voice communication process between the voice robot and the user terminal, voice collection is carried out on the user terminal;

Determining the current call state scene type according to the voice acquisition result; the call state scene type is used for representing states of a user corresponding to the user terminal and the voice robot in voice call;

Acquiring a robot response scheme corresponding to the call state scene type;

and controlling the voice robot to respond according to the robot response scheme.

In one embodiment, the voice acquisition of the user terminal includes:

recording voice data from the beginning of detecting the initial voice signal of the user terminal until the voice signal of the user terminal is not detected within a continuous preset time length, and stopping to obtain the recorded voice data.

In one embodiment, the determining the current call state scene type according to the voice collection result includes:

Converting the collected voice data into text content;

And analyzing the text content and determining the current conversation state scene type.

In one embodiment, the parsing the text content to determine the current call state scene type includes:

if the text content is empty, judging that the current conversation state scene type is a silence scene;

the obtaining a robot response scheme corresponding to the call state scene type comprises the following steps:

Acquiring user state confirmation voice corresponding to the silence scene;

the method for controlling the voice robot to respond according to the robot response scheme comprises the following steps:

and controlling the voice robot to play the user state confirmation voice.

In one embodiment, the method further comprises:

Acquiring the call state of the voice robot;

the analyzing the text content and determining the current conversation state scene type comprise the following steps:

If the text content is not empty and the call state of the voice robot is a broadcasting state, judging that the current call state scene type is an abnormal breaking scene.

In one embodiment, the acquiring a robot response scheme corresponding to the call state scene type includes:

Performing preset label detection processing on the voice played by the voice robot in a broadcasting state;

judging whether the played voice is allowed to be interrupted or not according to the label detection result;

if the played voice is allowed to be interrupted, acquiring a stop broadcasting scheme corresponding to the abnormal interruption scene;

if the played voice is not allowed to be interrupted, a play maintaining scheme corresponding to the abnormal interrupt scene is obtained.

In one embodiment, the performing preset tag detection processing on the voice played by the voice robot in the playing state includes:

Determining a voice file in which the played voice is located;

Detecting whether the voice file carries the continuous playing label or not;

Judging whether the played voice is allowed to be interrupted or not according to the label detection result comprises the following steps:

If the continuous playing label is detected to be carried, judging that the played voice is not allowed to be interrupted;

And if the continuous playing label is not detected to be carried, judging that the played voice is allowed to be interrupted.

In one embodiment, if the carrying of the continuous playing tag is not detected, determining that the played voice is allowed to be interrupted includes:

if the continuous playing label is not detected to be carried, determining a time node to which the voice robot plays in the played voice currently;

detecting whether voice content under the time node carries a forbidden interrupt tag;

if the interrupt prohibition tag is not carried, judging that the played voice is allowed to be interrupted;

The method further comprises the steps of:

If the interrupt prohibition tag is carried, judging that the played voice is not allowed to be interrupted.

If the voice robot is not in a broadcasting state and the text content is not empty, then

Repeating content analysis on the text content;

If the text content is analyzed to have continuously repeated content in the preset time period, judging that the current conversation state scene type is the redundant scene;

and controlling the voice robot to play the active interrupt voice corresponding to the redundant scene.

If the voice signal of the user disappears but the voice signal of the user is detected again within the preset time length, judging that the current conversation state scene type is a long sentence waiting scene;

splicing the voice signals collected before re-detection and the voice signals collected after re-detection;

converting the spliced voice signals into text contents;

carrying out semantic recognition on the text content, and acquiring response information corresponding to a semantic recognition result;

and converting the response information into response voice.

Detecting the signal intensity of the collected voice signal; if the signal strength is lower than a preset threshold value, judging that the current conversation state scene type is an inaudible scene;

Or alternatively

Detecting the continuity of the collected voice signals;

if the continuity detection result is that the signal is discontinuous, the current conversation state scene type is judged to be an inaudible scene.

An apparatus for controlling a voice robot response, the apparatus comprising:

the voice acquisition module is used for acquiring voice of the user terminal in the voice communication process of the voice robot and the user terminal;

The call state scene recognition module is used for determining the current call state scene type according to the voice acquisition result; the call state scene type is used for representing the call state of the user between the current voice robot and the voice robot;

The robot response module is used for acquiring a robot response scheme corresponding to the call state scene type; and controlling the voice robot to respond according to the robot response scheme.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

Acquiring a robot response scheme corresponding to the call state scene type;

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

Acquiring a robot response scheme corresponding to the call state scene type;

The method, the device, the computer equipment and the storage medium for controlling the response of the voice robot are used for collecting the voice of the user terminal in the voice communication process of the voice robot and the user terminal. And determining the current call state scene type according to the voice acquisition result, wherein the call state scene type is used for representing the states of the user corresponding to the user terminal and the voice robot in the voice call. Thus, the voice robot can be controlled to respond according to the robot response scheme corresponding to the call state scene type. Therefore, the control robot can flexibly respond differently according to different call state scenes, and the accuracy of response control of the voice robot is improved.

Drawings

FIG. 1 is an application environment diagram of a method of controlling voice robot responses in one embodiment;

FIG. 2 is a flow diagram of a method of controlling a voice robot response in one embodiment;

FIG. 3 is a flow diagram of a method of controlling a voice robot response in one embodiment;

FIG. 4 is a flow diagram of a method of controlling a voice robot response in one embodiment;

FIG. 5 is a flow diagram of a method of controlling a voice robot response in one embodiment;

FIG. 6 is a flow diagram of a method of controlling a voice robot response in one embodiment;

FIG. 7 is a block diagram of an apparatus for controlling a voice robot response in one embodiment;

FIG. 8 is a block diagram of an apparatus for controlling a voice robot response in another embodiment;

fig. 9 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The method for controlling the response of the voice robot can be applied to an application environment shown in fig. 1. Wherein the call platform 102 communicates with the user terminal 104 via a network. The intelligent robot in the call platform 102 may conduct a voice call with the user terminal. The user terminal 104 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The call platform 102 may be implemented as a stand-alone server or as a cluster of servers. The voice robot is an intelligent calling and answering module in the calling platform, and can automatically conduct voice conversation with a user in voice conversation. The call platform 102 may be an outbound platform that actively initiates a call to a user terminal, or may be a platform that receives a call initiated by a user terminal, which is not limited.

The call platform 102 performs voice collection on the user terminal 104 during a voice call between the voice robot and the user terminal 104. The call platform 102 may determine the current call state scene type according to the voice collection result. The call platform 104 may obtain a robot response scheme corresponding to the call state scene type, and control the voice robot to respond according to the robot response scheme. If the voice robot has generated the response voice, the response voice may be transmitted to the user terminal 102.

It should be noted that fig. 1 is only a schematic illustration, and in other embodiments, the voice robot may be a stand-alone computer device (for example, a humanoid robot with voice call capability), and is not limited to an intelligent module in the call platform, and communication may be performed between the voice robot itself and the user terminal. Then, the method of controlling the response of the voice robot in the embodiments of the present application may be performed by the voice robot itself.

In one embodiment, as shown in fig. 2, a method for controlling a voice robot response is provided, and the method is applied to the call platform in fig. 1 for illustration, and includes the following steps:

Step 202, in the process of voice communication between the voice robot and the user terminal, voice collection is performed on the user terminal.

The voice robot is an artificial intelligent robot which is arranged in a call platform and can autonomously communicate with a user in a user terminal.

Specifically, the voice robot can establish a voice call with the user terminal, and in the voice call process, the call platform can collect voice of the user terminal so as to collect voice data of the user at the user terminal side.

In one embodiment, the call platform may be an outbound platform, and the voice robot in the outbound platform may actively initiate a call to the user terminal to establish a voice call with the user terminal. In the voice communication process, the voice robot can collect voice of the user terminal so as to collect voice data of the user at the user terminal side. In other embodiments, the voice collection device in the call platform may also collect voice from the user terminal.

In one embodiment, the call platform may also be a platform that receives calls initiated by user terminals. That is, the user terminal actively initiates a call request to the call platform to establish a voice call with the voice robot that responds in the call platform. It can be appreciated that the voice robot in this embodiment is equivalent to an artificial intelligence customer service with voice call function.

Step 204, determining the current call state scene type according to the voice acquisition result.

The call state scene type refers to the type of the call state scene.

The call state refers to a state of a user corresponding to the user terminal and the voice robot in a voice call, for example, a state of the user in the call or a state of the voice robot, which is different from a call state represented by a traditional signal quality.

In one embodiment, the call state scene type may include at least one of a silence scene, an abnormal interrupt scene, a redundant scene, a long sentence waiting scene, an inaudible scene, and the like. For example, the redundant scenario indicates that the user is in a redundant state in the call.

In one embodiment, the call platform may determine the current call state scene type directly from the voice data in the voice acquisition result. Specifically, the call platform may determine the current call state scene type according to the voice waveform in the voice data.

In one embodiment, the call platform may also recognize the current call state scene type after text conversion of the voice acquisition result.

Step 206, acquiring a robot response scheme corresponding to the call state scene type.

The robot response scheme is a scheme for responding to the voice collected from the user terminal in the voice call process of the robot.

In one embodiment, the robot response scheme may include at least one of a stop play scheme, a maintain play scheme, an active interrupt scheme, and a scheme of generating a reply voice, etc.

Specifically, corresponding robot response schemes are preset for different call state scene types in the call platform. The call platform may find a robot response scheme corresponding to the determined call state scenario type.

And step 208, controlling the voice robot to respond according to the robot response scheme.

Specifically, the call platform may control the voice robot to respond in a voice call process between the voice robot and the user terminal according to a robot response scheme.

It can be appreciated that there are different robot response schemes for different call state scene types, and thus, the voice robot can be controlled to perform different flexible responses.

According to the method for controlling the response of the voice robot, the voice collection is carried out on the user terminal in the voice communication process of the voice robot and the user terminal. And determining the current call state scene type according to the voice acquisition result, wherein the call state scene type is used for representing the states of the user corresponding to the user terminal and the voice robot in the voice call. Thus, the voice robot can be controlled to respond according to the robot response scheme corresponding to the call state scene type. Therefore, the control robot can flexibly respond differently according to different call state scenes, and the accuracy of response control of the voice robot is improved. Furthermore, the naturalness, the authenticity and the accuracy of man-machine interaction are improved.

In one embodiment, voice acquisition is performed on a user terminal, including: recording voice data from the beginning of detecting the initial voice signal of the user terminal until the voice signal of the user terminal is not detected within a continuous preset time length, and stopping to obtain the recorded voice data.

The initial speech signal is a speech signal generated from the beginning of silence detection by the user terminal.

Specifically, during a voice call, the call platform may perform voice collection on the user terminal through a voice endpoint detection device (VAD, voice Activity Detection, VAD). At the time of acquisition, the recording of voice data may be started from the silence detection when the initial voice signal of the user terminal is detected (i.e., when the user is detected to start speaking), and then the recording is stopped after no voice signal of the user terminal is detected for a continuous preset period of time, so as to obtain the recorded voice data.

For example, from silence detection, the recording of speech is started when the user is detected, and then no speech signal is continuously detected for 20ms, and then the recording can be stopped to obtain a piece of speech information when a sentence is considered to be finished.

In the embodiment, a section of speech uttered by a user can be accurately collected, so that the calculation pressure of analyzing the total collected speech is avoided, and the speech data with information can be accurately collected by the method.

In one embodiment, step 204 determines the current call state scene type according to the voice collection result, including: converting the collected voice data into text content; and analyzing the text content and determining the current scene type of the call state.

Specifically, the call platform may perform a voice recognition process on the collected voice data through a voice recognition device to convert the collected voice data into text content. The call platform can send the converted text content to the central control equipment through the voice recognition equipment, the central control equipment analyzes the text content, and the current conversation state scene type is judged according to the analysis result.

In the above embodiment, the voice data is converted into the text content to perform the scene analysis of the call state, so that the analysis difficulty is reduced, the scene analysis efficiency is improved, and the analysis resources of the system are saved compared with the case of directly performing the scene analysis according to the voice.

As shown in fig. 3, in one embodiment, a method for controlling a voice robot response is provided, which specifically includes the following steps:

Step 302, in the process of voice communication between the voice robot and the user terminal, voice collection is performed on the user terminal.

Step 304, the collected voice data is converted into text content, and the text content is parsed.

In step 306, if the text content is empty, it is determined that the current call state scene type is a silence scene.

The silence scene refers to a silence state in which a user is not speaking during a call. The user state confirmation voice is a voice for inquiring the current state of the user.

Specifically, after the collected voice data is converted into text content, the call platform can detect whether the text content is empty or not, namely, whether text information exists in the text content or not through the central control device, and when the text content is empty, the fact that the user at the user terminal side does not speak in the process of collecting the voice data is described, so that the current conversation state scene type can be judged to be a silence scene.

Step 308, obtaining user state confirmation voice corresponding to the silence scene; the user state confirmation voice is used for confirming whether the user is in the answering state or not.

Step 310, controlling the voice robot to play the user state confirmation voice.

The call platform is provided with user state confirmation voice for the silence scene in advance, and can acquire the user state confirmation voice corresponding to the silence scene and control the voice robot to play the user state confirmation voice so as to inquire whether the user is in an answering state currently.

For example, in a silence scenario, the voice robot can be controlled to play "please ask you still listen? "this user status confirms speech.

It can be appreciated that unified user state confirmation voice can be set in advance for all silence scenes in the call platform. In other embodiments, the call platform may further determine the timing of the generation of the silence scene after determining the silence scene, and play different user status confirmation voices according to different timings of the generation of the silence scene. For example, when the opportunity of generating the silence scene is after the voice robot plays the active query voice (i.e., the silence scene is generated after the voice robot plays the active query voice), the voice for querying whether the user maintains the listening state may be played. For example, "please ask you also listen? "is a voice for asking the user whether to keep the listening state. When the opportunity of the silence scene is generated after the voice robot plays the answer voice (i.e., the silence scene is generated after the voice robot plays the voice for answering the user's question), the voice for asking the user whether there are other questions or whether the user's question is successfully answered may be played. For example, play "please ask you if there are other questions? "voice, or play" ask if there is a question to answer you? "voice.

In the above embodiment, if the current silence scene is detected, the voice robot is controlled to play the user state confirmation voice so as to inquire the current state of the user, thereby improving the accuracy and the authenticity of man-machine interaction.

In one embodiment, the method further comprises: and acquiring the call state of the voice robot. In this embodiment, analyzing text content and determining a current call state scene type includes: if the text content is not empty and the call state of the voice robot is the broadcasting state, judging that the current call state scene type is an abnormal breaking scene. The call state of the voice robot refers to a state of the voice robot in a call. The play state means that the voice robot is playing voice to the user terminal. The abnormal breaking scene refers to a scene which is broken by a user when the voice robot plays normally.

Specifically, if the text content is not empty and the call state of the voice robot is a broadcast state, it is indicated that the user speaks when the voice robot is playing voice, and it is indicated that the user wants to interrupt the broadcast of the voice robot, so that it can be determined that the current call state scene type is an abnormal interrupt scene.

It can be understood that, for the abnormal breaking scene, the call platform can judge whether to be allowed to be broken according to the current broadcasting content of the voice robot, so as to flexibly control the voice robot to respond correspondingly.

In the above embodiment, the current call state scene type can be conveniently determined to be an abnormal interrupt scene according to the fact that the text content is empty and the voice robot is in the broadcasting state, and complex scene analysis and processing are not needed.

As shown in fig. 4, in one embodiment, a method for controlling a voice robot response is provided, which specifically includes the following steps:

step 402, in the process of voice communication between the voice robot and the user terminal, voice collection is performed on the user terminal.

Step 404, converting the collected voice data into text content and parsing the text content.

If the text content is empty, step 406 is performed. If the text content is not empty and the voice robot call state is the play state, step 412 is performed.

Step 406, determining that the current call state scene type is a silence scene.

Step 408, obtaining user status confirmation voice corresponding to the silence scene.

Step 410, controlling the voice robot to play the user state confirmation voice.

Step 412, determining that the current call state scene type is an abnormal interrupt scene.

Step 414, performing preset tag detection processing on the voice played by the voice robot in the playing state.

Step 416, according to the label detection result, it is determined whether the played voice is allowed to be interrupted. If the played speech is allowed to be interrupted, step 418 is performed, and if the played speech is not allowed to be interrupted, step 420 is performed.

Specifically, after determining that the current call state scene type is an abnormal interrupt scene, the call platform may perform preset tag detection processing on the voice played by the voice robot when the voice robot is in a broadcasting state, and determine whether the played voice is allowed to be interrupted according to a tag detection result.

It can be understood that when the preset tag is detected, it is determined that the played voice is not allowed to be interrupted, and when the preset tag is not detected, it is determined that the played voice is allowed to be interrupted. And when the preset label is detected, the playing voice is judged to be allowed to be interrupted, and when the preset label is not detected, the playing voice is judged to be not allowed to be interrupted. This is not limited thereto.

And 418, acquiring a stop broadcasting scheme corresponding to the abnormal breaking scene, and controlling the voice robot to stop current broadcasting according to the stop broadcasting scheme.

Step 420, a play maintaining scheme corresponding to the abnormal breaking scene is obtained, and the voice robot is controlled to continue the current broadcasting.

Further, if the played voice is allowed to be interrupted, the call platform can acquire a stop broadcasting scheme corresponding to the abnormal interruption scene, so that the voice robot is controlled to stop current broadcasting according to the stop broadcasting scheme. It can be understood that the call platform can pause the current broadcasting of the voice robot according to the broadcasting stopping scheme, and resume broadcasting after the abnormal breaking scene is processed. The call platform can also end the current broadcasting according to the broadcasting stopping scheme.

If the played voice is not allowed to be interrupted, the call platform can acquire a maintenance playing scheme corresponding to the abnormal interruption scene, so that the voice robot is controlled to continue the current playing according to the maintenance playing scheme without being interrupted.

In the embodiment, by detecting the tag carried in the voice played by the voice robot, whether the voice is allowed to be interrupted or not can be accurately and rapidly detected, so that the voice robot can be accurately controlled to respond.

In one embodiment, performing preset tag detection processing on a voice played by a voice robot in a playing state includes: determining a voice file in which the played voice is located; and detecting whether the voice file carries a continuous playing label. In this embodiment, according to the tag detection result, determining whether the played voice is allowed to be interrupted includes: if the continuous playing label is detected to be carried, judging that the played voice is not allowed to be interrupted; if the carrying of the continuous playing label is not detected, judging that the played voice is allowed to be interrupted.

The continuous playing label is used for indicating that the voice needs to be continuously played.

Specifically, a continuous playing label is added to a key voice file to be played by the voice robot in advance for the dimension of the whole voice file in the call platform. The calling platform can determine the voice file in which the played voice is located; and detecting whether the voice file carries a continuous playing label. If the continuous playing label is detected, the played voice is required to be continuously played and is not allowed to be interrupted. That is, the voice content in the entire voice file is not allowed to be interrupted at the time of playback. If the continuous playing label is not detected, the voice which is not appointed to be played needs to be continuously played, and therefore the voice is allowed to be interrupted.

In the above embodiment, by adding the continuous playing tag to the key voice file which is not allowed to be interrupted, whether the currently played voice file is allowed to be interrupted or not can be rapidly judged, so that the voice robot can be accurately controlled to respond.

As shown in fig. 5, in one embodiment, a method for controlling a voice robot response is provided, which specifically includes the following steps:

Step 502, in the process of voice communication between the voice robot and the user terminal, voice collection is performed on the user terminal.

Step 504, the collected voice data is converted into text content, and the text content is parsed.

If the text content is empty, step 506 is performed. If the text content is not empty and the call state of the voice robot is the play state, step 512 is performed.

Step 506, determining that the current call state scene type is a silence scene.

Step 508, obtaining user status confirmation voice corresponding to the silence scene.

Step 510, controlling the voice robot to play the user state confirmation voice.

Step 512, determining the current call state scene type as an abnormal interrupt scene.

Step 514, it is detected whether the voice file in which the played voice is located carries a continuous play tag.

If the continuous playing tag is detected, it is determined that the played voice is not allowed to be interrupted, and step 516 is performed. If the carry-over continuous play label is not detected, step 518 is performed.

Step 516, a play maintaining scheme corresponding to the abnormal breaking scene is obtained, and the voice robot is controlled to continue the current broadcasting.

Step 518, determining a time node to which the voice robot is currently playing in the played voice.

Step 520, it is detected whether the voice content under the time node carries a break prohibition tag. If the interrupt prohibition tag is carried, it is determined that the played voice is not allowed to be interrupted, and step 516 is performed. If the disable interrupt tag is not carried, then it is determined that the played voice is allowed to be interrupted, step 522 is performed.

And 522, acquiring a stop broadcasting scheme corresponding to the abnormal breaking scene, and controlling the voice robot to stop current broadcasting according to the stop broadcasting scheme.

The interrupt prohibition tag is a tag which is added for voice content under the granularity of a time node and is not allowed to be interrupted, and is used for indicating that the voice content under the time node is not allowed to be interrupted.

Specifically, a disruption prohibition tag is added in the call platform in advance for key voice content in a voice file to be played by the voice robot, so that the key voice content is prohibited from being abnormally disrupted when being played. When the call platform does not detect that the voice file where the played voice is located carries the continuous playing tag, the call platform can further perform secondary tag detection, namely, determining a time node to which the voice robot plays currently in the played voice, positioning voice content under the time node in the played voice, and detecting whether the voice content under the time node carries the interrupt prohibition tag. If the interrupt prohibition tag is not carried, the played voice permission is judged to be interrupted.

It will be appreciated that if the interrupt disabled tag is carried, it is determined that the played speech is not allowed to be interrupted. I.e. the speech content under the current time node is not allowed to be interrupted. If the voice content under the following time node does not carry the interrupt prohibition tag, the broadcasting can be stopped after the voice content under the current node which is not allowed to be interrupted is completely played.

It should be noted that the interrupt prohibition tag is added to the local voice content in the whole voice file, so that the interrupted voice content is not allowed to be played from the fine dimension of the time node, but is not allowed to be interrupted in the playing process of the whole voice file, so that the control response to the voice robot is more accurate and flexible, the bad experience brought to the user by the fact that the whole voice is not allowed to be interrupted is avoided, and the accuracy of the control response is improved.

In one embodiment, parsing the text content to determine a current call state scene type includes: if the voice robot is not in the broadcasting state and the text content is not empty, repeating content analysis on the text content; if the text content is analyzed to have continuously repeated content in the preset time period, the current conversation state scene type is judged to be a redundant scene. In this embodiment, according to a robot response scheme, controlling a voice robot to respond includes: and controlling the voice robot to play the active interrupt voice corresponding to the redundant scene.

Wherein, the repeated content analysis refers to a process of analyzing whether the text content has the repeated content. The active speech interruption is the speech played by the speech robot to interrupt the speech of the user.

Specifically, if the voice robot is not in the broadcast state and the text content is not empty, it is indicated that the user is speaking alone, and the repeated content analysis can be performed on the text content. If the text content is analyzed to have continuously repeated content in the preset time period, the current conversation state scene type is judged to be a redundant scene. The active interrupt voice is prestored in the call platform, and the call platform can control the voice robot to play the active interrupt voice corresponding to the repeated scene so as to interrupt repeated speech of the user.

For example, in an application scenario of the voice robot outbound call to the user to collect accounts, the user can be controlled to play preset voice information to interrupt the user when the user constantly speaks 'no money'.

In the embodiment, repeated content analysis is performed on the text content, so that redundant scenes can be rapidly and accurately determined, and the accuracy and efficiency of the response of the robot are improved.

As shown in fig. 6, in one embodiment, a method for controlling a voice robot response is provided, which specifically includes the following steps:

Step 602, starting to record voice data from the time when the initial voice signal of the user terminal is detected until the voice signal of the user terminal is not detected within a continuous preset time.

Step 604, converting the collected voice data into text content, analyzing the text content and obtaining the current call state of the voice robot. If the text content is empty, step 606 is executed, and if the text content is not empty and the call state of the voice robot is the play state, step 610 is executed. If the voice robot is not in the play state and the text content is not empty, step 622 is performed.

Step 606, determining that the current call state scene type is a silence scene.

Step 608, obtaining user state confirmation voice corresponding to the silence scene; and controlling the voice robot to play the user state confirmation voice.

Step 610, determining that the current call state scene type is an abnormal interrupt scene, and determining a voice file in which the played voice is located.

Step 612, it is detected whether the voice file carries a continuous play label.

If the continuous playing tag is detected, it is determined that the played voice is not allowed to be interrupted, and step 620 is executed. If it is not detected that the continuous play label is carried, step 614 is performed.

Step 614, determining a time node to which the voice robot is currently playing in the played voice.

At step 616, it is detected whether the voice content under the time node carries a break-prohibited tag.

If the interrupt prohibition tag is not carried, it is determined that the played voice is allowed to be interrupted, and step 618 is performed. If the interrupt prohibition tag is carried, it is determined that the played voice is not allowed to be interrupted, and step 620 is performed.

Step 618, obtain the stop broadcasting scheme corresponding to the abnormal breaking scene, so as to control the voice robot to stop the current broadcasting.

Step 620, a play maintaining scheme corresponding to the abnormal breaking scene is obtained to control the voice robot to continue the current broadcasting.

Step 622, performing repeated content analysis on the text content; if the text content is analyzed to have continuously repeated content in the preset time period, judging that the current conversation state scene type is a redundant scene; and controlling the voice robot to play the active interrupt voice corresponding to the redundant scene.

In one embodiment, determining the current call state scene type according to the voice collection result includes: if the voice signal of the user disappears but the voice signal of the user is detected again within the preset time, the current conversation state scene type is judged to be a long sentence waiting scene. In this embodiment, acquiring a robot response scheme corresponding to a call state scene type includes: splicing the voice signals collected before re-detection and the voice signals collected after re-detection; converting the spliced voice signals into text contents; carrying out semantic recognition on the text content, and acquiring response information corresponding to the semantic recognition result; the response information is converted into response speech.

The long sentence refers to a sentence which needs to be stopped in the expression to express the complete sentence comprising a plurality of sentences. The long sentence waiting scene refers to a scene that the voice robot needs to wait for the user to express the long sentence.

Specifically, if the voice signal of the user disappears but the voice signal of the user is detected again within the preset time, it is indicated that the user only pauses after speaking a sentence, and still needs to continue speaking after not speaking, and then the current conversation state scene type can be determined to be a long sentence waiting scene. Then, after the user is detected to finish speaking, the voice signal collected before re-detection (i.e. the user pauses before speaking) and the voice signal collected after re-detection (i.e. the user pauses and continues to speak) can be spliced, so that a long sentence capable of expressing the complete expression of the user can be spliced. The call platform can then convert the spliced voice signal (i.e. the long sentence voice after the splicing) into text content, and perform semantic recognition on the text content. The call platform can acquire response information corresponding to the semantic recognition result; the response information is converted into response speech. The generated response voice is the robot response scheme corresponding to the long sentence waiting scene. The call platform can control the voice robot to play the response voice.

In the embodiment, the long sentence waiting scene can be accurately detected, the situation that the user plays voice without speaking the voice to cause error response is avoided, moreover, the voice signal collected before re-detection and the voice signal collected after re-detection are spliced, and the response is carried out according to the spliced content, so that the response accuracy is improved.

In one embodiment, determining the current call state scene type according to the voice collection result includes: detecting the signal intensity of the collected voice signal; if the signal strength is lower than the preset threshold, judging the current conversation state scene type as an inaudible scene.

Here, the inaudible scene refers to a scene in which the content of the voice call is inaudible.

Specifically, the call platform can detect the signal strength of the collected voice signal, if the signal strength is lower than a preset threshold value, the voice signal is weak, and the call quality is poor, so that the current call state scene type can be judged to be an inaudible scene.

In one embodiment, determining the current call state scene type according to the voice collection result includes: detecting continuity of the collected voice signals; if the continuity detection result is that the signal is discontinuous, the current conversation state scene type is judged to be an inaudible scene.

Wherein, the signal interruption discontinuity refers to signal interruption and repeated interruption discontinuity.

Specifically, the call platform can detect the continuity of the collected voice signals, if the continuity detection result is that the signals are discontinuous, the voice signals are weak, the call quality is poor, and the current call state scene type is judged to be an inaudible scene.

It can be appreciated that the call platform can control the voice robot to play a preset voice for an inaudible scene. For example, can play "do the signal not good, do you exchange places to answer a call? The "bad signal prompting voice" or the "play late and play your" voice to end the call.

In the embodiment, the inaudible scene can be accurately identified according to the voice signal, so that the response of the robot can be accurately controlled.

It should be understood that, although the steps in the flowcharts of the embodiments of the present application are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts of the embodiments of this application may include steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the steps or stages in other steps or other steps.

In one embodiment, as shown in fig. 7, there is provided an apparatus for controlling a voice robot response, comprising: a voice acquisition module 702, a call state scene recognition module 704, and a robot response module 706, wherein:

The voice acquisition module 702 is configured to perform voice acquisition on the user terminal during a voice call between the voice robot and the user terminal.

The call state scene recognition module 704 is configured to determine a current call state scene type according to a voice acquisition result; the call state scene type is used for representing the call state of the user between the current voice robot and the voice robot.

A robot response module 706, configured to obtain a robot response scheme corresponding to the call state scene type; and controlling the voice robot to respond according to the robot response scheme.

In one embodiment, the voice acquisition module 702 is further configured to start recording voice data from the time when the initial voice signal of the user terminal is detected until the voice signal of the user terminal is not detected within a continuous preset time period, and obtain the recorded voice data.

In one embodiment, the call state scene recognition module 704 is further configured to convert the collected voice data into text content; and analyzing the text content and determining the current conversation state scene type.

In one embodiment, the call state scene recognition module 704 is further configured to determine that the current call state scene type is a silence scene if the text content is empty; the robot response module 706 is further configured to obtain a user status confirmation voice corresponding to the silence scene; and controlling the voice robot to play the user state confirmation voice.

In one embodiment, the call state scene recognition module 704 is further configured to obtain a call state of the voice robot; if the text content is not empty and the call state of the voice robot is a broadcasting state, judging that the current call state scene type is an abnormal breaking scene.

As shown in fig. 8, in one embodiment, the robotic response module 706 includes:

the tag detection module 706a is configured to perform preset tag detection processing on a voice played by the voice robot when the voice robot is in a playing state; and judging whether the played voice is allowed to be interrupted or not according to the label detection result.

A response scheme obtaining module 706b, configured to obtain a stop broadcasting scheme corresponding to the abnormal interruption scene if the played voice is allowed to be interrupted; if the played voice is not allowed to be interrupted, a play maintaining scheme corresponding to the abnormal interrupt scene is obtained.

In one embodiment, the tag detection module 706a is further configured to determine a voice file in which the played voice is located; detecting whether the voice file carries the continuous playing label or not; if the continuous playing label is detected to be carried, judging that the played voice is not allowed to be interrupted; and if the continuous playing label is not detected to be carried, judging that the played voice is allowed to be interrupted.

In one embodiment, the tag detection module 706a is further configured to determine, if the continuous playing tag is not detected to be carried, a time node to which the voice robot is currently playing in the played voice; detecting whether voice content under the time node carries a forbidden interrupt tag; if the interrupt prohibition tag is not carried, judging that the played voice is allowed to be interrupted; if the interrupt prohibition tag is carried, judging that the played voice is not allowed to be interrupted.

In one embodiment, the call state scene recognition module 704 is further configured to perform repeated content analysis on the text content if the voice robot is not in a broadcast state and the text content is not empty; if the text content is analyzed to have continuously repeated content in the preset time period, judging that the current conversation state scene type is the redundant scene; the robot response module 706 is further configured to control the voice robot to play the active interrupt voice corresponding to the redundant scenario.

In one embodiment, the call state scene recognition module 704 is further configured to determine that the current call state scene type is a long sentence waiting scene if the voice signal of the user disappears but the voice signal of the user is detected again within a preset duration; the robot response module 706 is further configured to splice the voice signal collected before re-detection and the voice signal collected after re-detection; converting the spliced voice signals into text contents; carrying out semantic recognition on the text content, and acquiring response information corresponding to a semantic recognition result; and converting the response information into response voice.

In one embodiment, the call state scene recognition module 704 is further configured to detect a signal strength of the collected voice signal; if the signal strength is lower than a preset threshold value, judging that the current conversation state scene type is an inaudible scene; or detecting continuity of the collected voice signal; if the continuity detection result is that the signal is discontinuous, the current conversation state scene type is judged to be an inaudible scene.

For specific limitations on the means for controlling the voice robot response, reference may be made to the limitations on the method for controlling the voice robot response hereinabove, and no further description is given here. The above-described means for controlling the voice robot response may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server of a call platform, and the internal structure of which may be as shown in fig. 9. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing the voices played by the voice robot. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of controlling a voice robot response.

It will be appreciated by persons skilled in the art that the architecture shown in fig. 9 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of: during the voice communication process between the voice robot and the user terminal, voice collection is carried out on the user terminal; determining the current call state scene type according to the voice acquisition result; the call state scene type is used for representing states of a user corresponding to the user terminal and the voice robot in voice call; acquiring a robot response scheme corresponding to the call state scene type; and controlling the voice robot to respond according to the robot response scheme.

In one embodiment, the voice acquisition of the user terminal includes: recording voice data from the beginning of detecting the initial voice signal of the user terminal until the voice signal of the user terminal is not detected within a continuous preset time length, and stopping to obtain the recorded voice data.

In one embodiment, the determining the current call state scene type according to the voice collection result includes: converting the collected voice data into text content; and analyzing the text content and determining the current conversation state scene type.

In one embodiment, the parsing the text content to determine the current call state scene type includes: if the text content is empty, judging that the current conversation state scene type is a silence scene; the obtaining a robot response scheme corresponding to the call state scene type comprises the following steps: acquiring user state confirmation voice corresponding to the silence scene; the method for controlling the voice robot to respond according to the robot response scheme comprises the following steps: and controlling the voice robot to play the user state confirmation voice.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring the call state of the voice robot; the analyzing the text content and determining the current conversation state scene type comprise the following steps: if the text content is not empty and the call state of the voice robot is a broadcasting state, judging that the current call state scene type is an abnormal breaking scene.

In one embodiment, the acquiring a robot response scheme corresponding to the call state scene type includes: performing preset label detection processing on the voice played by the voice robot in a broadcasting state; judging whether the played voice is allowed to be interrupted or not according to the label detection result; if the played voice is allowed to be interrupted, acquiring a stop broadcasting scheme corresponding to the abnormal interruption scene; if the played voice is not allowed to be interrupted, a play maintaining scheme corresponding to the abnormal interrupt scene is obtained.

In one embodiment, the performing preset tag detection processing on the voice played by the voice robot in the playing state includes: determining a voice file in which the played voice is located; detecting whether the voice file carries the continuous playing label or not; judging whether the played voice is allowed to be interrupted or not according to the label detection result comprises the following steps: if the continuous playing label is detected to be carried, judging that the played voice is not allowed to be interrupted; and if the continuous playing label is not detected to be carried, judging that the played voice is allowed to be interrupted.

In one embodiment, if the carrying of the continuous playing tag is not detected, determining that the played voice is allowed to be interrupted includes: if the continuous playing label is not detected to be carried, determining a time node to which the voice robot plays in the played voice currently; detecting whether voice content under the time node carries a forbidden interrupt tag; if the interrupt prohibition tag is not carried, judging that the played voice is allowed to be interrupted. In this embodiment, the computer program when executed by the processor further implements the steps of: if the interrupt prohibition tag is carried, judging that the played voice is not allowed to be interrupted.

In one embodiment, the parsing the text content to determine the current call state scene type includes: if the voice robot is not in a broadcasting state and the text content is not empty, repeating content analysis on the text content; if the text content is analyzed to have continuously repeated content in the preset time period, judging that the current conversation state scene type is the redundant scene; the method for controlling the voice robot to respond according to the robot response scheme comprises the following steps: and controlling the voice robot to play the active interrupt voice corresponding to the redundant scene.

In one embodiment, the determining the current call state scene type according to the voice collection result includes: if the voice signal of the user disappears but the voice signal of the user is detected again within the preset time length, judging that the current conversation state scene type is a long sentence waiting scene; the obtaining a robot response scheme corresponding to the call state scene type comprises the following steps: splicing the voice signals collected before re-detection and the voice signals collected after re-detection; converting the spliced voice signals into text contents; carrying out semantic recognition on the text content, and acquiring response information corresponding to a semantic recognition result; and converting the response information into response voice.

In one embodiment, the determining the current call state scene type according to the voice collection result includes: detecting the signal intensity of the collected voice signal; if the signal strength is lower than a preset threshold value, judging that the current conversation state scene type is an inaudible scene; or detecting continuity of the collected voice signal; if the continuity detection result is that the signal is discontinuous, the current conversation state scene type is judged to be an inaudible scene.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1.A method of controlling a voice robot response, the method comprising:

Converting the collected voice data into text content;

Analyzing the text content to determine the current conversation state scene type, including: if the voice robot is not in a broadcasting state and the text content is not empty, repeating content analysis on the text content; if the text content is analyzed to have continuously repeated content in the preset time period, judging that the current conversation state scene type is the redundant scene; the call state scene type is used for representing states of a user corresponding to the user terminal and the voice robot in voice call; the redundant scene is used for indicating that the user is in a redundant state in the call;

Acquiring a robot response scheme corresponding to the call state scene type; the robot response scheme is a scheme for responding to the voice collected from the user terminal in the voice communication process of the voice robot;

According to the robot response scheme, controlling the voice robot to respond, including: controlling the voice robot to play active interrupt voice corresponding to the redundant scene under the condition that the conversation state scene type is the redundant scene; under the condition that the conversation state scene type is an abnormal breaking scene, determining a voice file in which the played voice is located; detecting whether the voice file carries a continuous playing label or not; if the continuous playing label is detected to be carried, judging that the played voice is not allowed to be interrupted; if the continuous playing label is not detected to be carried, determining a time node to which the voice robot plays in the played voice currently; detecting whether voice content under the time node carries a forbidden interrupt tag; if the interrupt prohibition tag is not carried, judging that the played voice is allowed to be interrupted; if the interrupt prohibition tag is carried, judging that the played voice is not allowed to be interrupted; if the played voice is allowed to be interrupted, acquiring a stop broadcasting scheme corresponding to the abnormal interruption scene; if the played voice is not allowed to be interrupted, a play maintaining scheme corresponding to the abnormal interruption scene is obtained;

the continuous playing label is used for representing that the voice content in the whole voice file is not allowed to be interrupted when being played; the interrupt prohibition tag is a tag which is added for voice content under the granularity of a time node and is not allowed to be interrupted, and is used for indicating that the voice content under the time node is not allowed to be interrupted.

2. The method of claim 1, wherein the voice acquisition of the user terminal comprises:

3. The method of claim 1, wherein the call state scene type further comprises at least one of a silence scene, a long sentence waiting scene, or an inaudible scene.

4. A method according to claim 3, wherein said parsing said text content to determine a current call state context type comprises:

Acquiring user state confirmation voice corresponding to the silence scene;

and controlling the voice robot to play the user state confirmation voice.

5. A method according to claim 3, characterized in that the method further comprises:

Acquiring the call state of the voice robot;

6. A method according to claim 3, wherein the active speech interruption is speech for the speech robot to play to interrupt the user's speech.

7. The method of claim 1, wherein the determining the current call state context type comprises:

converting the spliced voice signals into text contents;

and converting the response information into response voice.

8. The method according to any one of claims 1 to 7, wherein said determining a current call state scenario type comprises:

Or alternatively

Detecting the continuity of the collected voice signals;

9. An apparatus for controlling a voice robot response, the apparatus comprising:

The call state scene recognition module is used for converting the collected voice data into text content; analyzing the text content to determine the current conversation state scene type, including: if the voice robot is not in a broadcasting state and the text content is not empty, repeating content analysis on the text content; if the text content is analyzed to have continuously repeated content in the preset time period, judging that the current conversation state scene type is the redundant scene; the call state scene type is used for representing the call state of the user between the current voice robot and the voice robot; the redundant scene is used for indicating that the user is in a redundant state in the call; if the text content is not empty and the call state of the voice robot is a broadcasting state, judging that the current call state scene type is an abnormal breaking scene;

The robot response module is used for acquiring a robot response scheme corresponding to the call state scene type; according to the robot response scheme, controlling the voice robot to respond, including: controlling the voice robot to play active interrupt voice corresponding to the redundant scene under the condition that the conversation state scene type is the redundant scene; under the condition that the conversation state scene type is an abnormal breaking scene, determining a voice file in which the played voice is located; detecting whether the voice file carries a continuous playing label or not; if the continuous playing label is detected to be carried, judging that the played voice is not allowed to be interrupted; if the continuous playing label is not detected to be carried, determining a time node to which the voice robot plays in the played voice currently; detecting whether voice content under the time node carries a forbidden interrupt tag; if the interrupt prohibition tag is not carried, judging that the played voice is allowed to be interrupted; if the interrupt prohibition tag is carried, judging that the played voice is not allowed to be interrupted; if the played voice is allowed to be interrupted, acquiring a stop broadcasting scheme corresponding to the abnormal interruption scene; if the played voice is not allowed to be interrupted, a play maintaining scheme corresponding to the abnormal interruption scene is obtained; the continuous playing label is used for representing that the voice content in the whole voice file is not allowed to be interrupted when being played; the interrupt prohibition tag is a tag which is added for voice content under the granularity of a time node and is not allowed to be interrupted, and is used for indicating that the voice content under the time node is not allowed to be interrupted.

10. The apparatus of claim 9, wherein the voice acquisition module is configured to begin recording voice data from the time when the initial voice signal of the user terminal is detected until the voice signal of the user terminal is not detected within a continuous preset time period, and stop to obtain the recorded voice data.

11. The apparatus of claim 9, wherein the call state scene type further comprises at least one of a silence scene, a long sentence waiting scene, or an inaudible scene.

12. The apparatus of claim 11, wherein the call state scene recognition module is configured to determine that a current call state scene type is a silence scene if the text content is empty; the robot response module is used for acquiring user state confirmation voice corresponding to the silence scene; and controlling the voice robot to play the user state confirmation voice.

13. The apparatus of claim 11, wherein a call state scene recognition module is configured to obtain a call state of the voice robot; if the text content is not empty and the call state of the voice robot is a broadcasting state, judging that the current call state scene type is an abnormal breaking scene.

14. The apparatus of claim 11, wherein the active interrupt speech is speech that is played by a speech robot to interrupt user speech.

15. The apparatus of claim 9, wherein the call state scene recognition module is configured to determine that the current call state scene type is a long sentence waiting scene if the voice signal of the user disappears but the voice signal of the user is detected again within a preset duration; the robot response module is used for splicing the voice signals collected before re-detection and the voice signals collected after re-detection; converting the spliced voice signals into text contents; carrying out semantic recognition on the text content, and acquiring response information corresponding to a semantic recognition result; and converting the response information into response voice.

16. The apparatus according to any one of claims 9 to 15, wherein a call state scene recognition module is configured to detect a signal strength of the collected voice signal; if the signal strength is lower than a preset threshold value, judging that the current conversation state scene type is an inaudible scene; or detecting continuity of the collected voice signal; if the continuity detection result is that the signal is discontinuous, the current conversation state scene type is judged to be an inaudible scene.

17. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.

18. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 8.