CN112767931A

CN112767931A - Voice interaction method and device

Info

Publication number: CN112767931A
Application number: CN202011458115.7A
Authority: CN
Inventors: 谢家晖; 刘永红
Original assignee: Midea Group Co Ltd; Guangdong Midea White Goods Technology Innovation Center Co Ltd
Current assignee: Midea Group Co Ltd; Guangdong Midea White Goods Technology Innovation Center Co Ltd
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-05-07

Abstract

The application discloses a voice interaction method and device. The voice interaction method comprises the following steps: responding to the voice interaction user entering the second space from the first space, and completing the speech recognition context inheritance, wherein the speech recognition context inheritance is realized by collecting the first voice of the voice interaction user by a first voice device positioned in the first space; and interacting with a second voice device located in the second space, and recognizing and/or interacting with a second voice of the voice interaction user through the voice recognition context inheritance, wherein the second voice is collected by the second voice device. The voice interaction method can ensure cross-space continuity voice recognition and/or interaction.

Description

Voice interaction method and device

Technical Field

The present application relates to the field of voice interaction technologies, and in particular, to a voice interaction method and apparatus.

Background

As speech recognition technology matures, the speech interaction functions of speech devices are rapidly increasing and improving. At present, the far-field interaction problem of a user in the same space is mainly solved by a single-device far-field speech recognition method. The far-field voice recognition of single equipment requires that no shielding exists between a user and the voice equipment as far as possible, and the voice equipment can pick up sound under the condition of maintaining a certain signal-to-noise ratio. However, when a user walks from the adjacent space A to the adjacent space B and a wall partition is formed in the middle of the adjacent space A, a direct propagation path is cut off, so that the pickup signal-to-noise ratio of the equipment is greatly reduced, and far-field voice recognition cannot work normally.

Disclosure of Invention

The application provides a voice interaction method and a voice interaction device, which are used for realizing cross-space continuity voice interaction and/or recognition.

In order to achieve the above object, the present application provides a voice interaction method, including:

responding to the voice interaction user entering the second space from the first space, and completing the speech recognition context inheritance, wherein the speech recognition context inheritance is realized by collecting the first voice of the voice interaction user by a first voice device positioned in the first space;

and interacting with a second voice device located in the second space, and recognizing and/or interacting with a second voice of the voice interaction user through the voice recognition context inheritance, wherein the second voice is collected by the second voice device.

Wherein before the user enters the second space from the first space in response to the voice interaction, the method comprises the following steps:

acquiring first behavior information of a voice interaction user through first voice equipment, and acquiring second behavior information of the voice interaction user through second voice equipment;

and confirming whether the voice interaction user enters the second space from the first space or not based on the first behavior information and the second behavior information.

Wherein confirming whether the voice interaction user enters the second space from the first space based on the first behavior information and the second behavior information comprises:

confirming whether the voice interaction user has cross-space behavior or not based on the first behavior information and the second behavior information;

responding to the cross-space behavior of the voice interaction user, and acquiring a second voice; confirming that the voice interaction users corresponding to the second voice and the first voice are the same based on the second voice and the first voice, and then enabling the voice interaction users to enter a second space from the first space; or the like, or, alternatively,

responding to the cross-space behavior of the voice interaction user, and acquiring the identity of the voice interaction user corresponding to the second voice; confirming that the voice interaction user corresponding to the second voice is the same as the voice interaction user corresponding to the first voice, and enabling the voice interaction user to enter a second space from the first space; or the like, or, alternatively,

and responding to the cross-space behavior of the voice interaction user, acquiring the identity of the voice interaction user corresponding to the second voice, and confirming that the voice interaction user enters the second space from the first space.

The first behavior information is the time when the voice interaction user leaves the first space, and the second behavior information is the time when the voice interaction user enters the second space;

confirming whether the voice interaction user enters the second space from the first space or not based on the first behavior information and the second behavior information, comprising:

calculating a difference between a time of leaving the first space and a time of entering the second space;

and when the difference value meets a preset condition, the voice interaction user performs a cross-space behavior.

The first voice equipment and the second voice equipment are provided with camera devices, the first behavior information is a first image containing a voice interaction user, and the second behavior information is a second image containing the voice interaction user;

confirming whether the voice interaction user enters the second space from the first space or not based on the first behavior information and the second behavior information, comprising: detecting a voice interaction user from the first image; the voice interaction user is tracked based on the first image and the second image acquired in real time to determine whether the voice interaction user enters the second space from the first space.

Confirming that the voice interaction users corresponding to the second voice and the first voice are the same based on the second voice and the first voice, wherein the method comprises the following steps:

detecting the second voice by using a voiceprint detection method of an irrelevant text to determine the identity of the voice interaction user corresponding to the second voice;

and confirming whether the voice interaction user corresponding to the second voice is the same as the voice interaction user corresponding to the first voice.

Wherein, the method further comprises:

responding to the cross-space behavior of the voice interaction user, and sending a wake-up instruction to the second voice equipment; and/or the presence of a gas in the gas,

and responding to the voice interaction that the user enters the second space from the first space, and sending a closing instruction to the first voice equipment so as to enable the first voice equipment to recover to-be-awakened state.

Wherein, interacting with a second speech device located in a second space, and recognizing and/or interacting with a second speech of the speech interaction user by speech recognition context inheritance, comprises:

recognizing the second speech by the speech recognition context relay; sending an operation instruction to equipment related to the recognition result based on the recognition result of the second voice so as to enable the equipment related to the recognition result to perform corresponding operation based on the operation instruction; or the like, or, alternatively,

and sending the speech recognition context inheritance to the second speech device so that the second speech device can recognize and/or interact with the second speech through the speech recognition context inheritance.

In order to achieve the above object, the present application further provides a voice interaction method, including:

responding to the voice interaction user entering the second space from the first space, and collecting second voice of the voice interaction user;

and interacting with the server, and recognizing and/or interacting the second voice through the voice recognition context succession, wherein the voice recognition context succession is completed by the server through the first voice of the voice interaction user collected by the first voice device positioned in the first space.

and sending the second behavior information of the voice interaction user to the server so that the server confirms whether the voice interaction user enters the second space from the first space or not based on the second behavior information.

Sending the second behavior information of the voice interaction user to a server, and then:

detecting the second voice to determine the identity of the voice interaction user corresponding to the second voice;

sending the identity of the voice interaction user corresponding to the second voice to the server so that the server can confirm whether the voice interaction user corresponding to the second voice is the same as the voice interaction user corresponding to the first voice; or

Acquiring the identity of a voice interaction user corresponding to a first voice from a first voice device, and confirming whether a voice interaction user corresponding to a second voice is the same as the voice interaction user corresponding to the first voice; and if so, sending the identity of the voice interaction user corresponding to the second voice to the server.

Wherein, the method further comprises:

and responding to the fact that the voice interaction user corresponding to the second voice is the same as the voice interaction user corresponding to the first voice, and sending a closing instruction to the first voice equipment so as to enable the first voice equipment to recover to-be-awakened state.

Wherein interacting with the server and recognizing and/or interacting with the second speech through the speech recognition context relay comprises:

obtaining speech recognition context inheritance from a server; recognizing and/or interacting the second speech through the speech recognition context relay; or the like, or, alternatively,

sending the second voice to a server; and responding to the operation instruction, and executing corresponding operation based on the operation instruction, wherein the operation instruction is issued by the server based on the recognition result of the second voice, and the recognition result is obtained by the server by recognizing the second voice through the voice recognition context.

To achieve the above object, the present application provides an electronic device including a processor for executing instructions to implement the above method.

To achieve the above object, the present application provides a computer-readable storage medium for storing instructions/program data that can be executed to implement the above-described method.

In the embodiment, in response to the voice interaction user entering the second space from the first space, the server acquires the second voice of the voice interaction user acquired by the second voice device, completes the inheritance of the voice recognition context, and recognizes the second voice through the voice recognition context, and gives an instruction to the device related to the recognition result of the second voice, so that when the voice interaction user enters the second space from the first space, the voice of the voice interaction user can be acquired through the second voice device to ensure that the signal-to-noise ratio of the acquired voice is higher, the recognition efficiency of the voice interaction user is ensured, and the second voice acquired by the second voice device can be recognized through the interactive content of the voice interaction user and the first voice device to ensure the cross-space continuous voice recognition and/or interaction.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a voice interaction method according to the present application;

FIG. 2 is a schematic flowchart illustrating a working process of a server according to an embodiment of the present invention;

FIG. 3 is a schematic flowchart of a second speech device in an embodiment of a speech interaction method according to the present application;

FIG. 4 is a schematic flow chart diagram illustrating another embodiment of a voice interaction method according to the present application;

FIG. 5 is a schematic view of a workflow of a server according to another embodiment of the voice interaction method of the present application;

FIG. 6 is a schematic view of a workflow of a second speech device in another embodiment of the speech interaction method of the present application;

FIG. 7 is a schematic structural diagram of an embodiment of an electronic device of the present application;

FIG. 8 is a schematic structural diagram of an embodiment of a computer storage medium according to the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, a voice interaction method and apparatus provided by the present application are described in further detail below with reference to the accompanying drawings and the detailed description.

The voice interaction method is applied to a situation that a voice interaction user enters a second space from a first space, wherein due to the fact that the first space and the second space are far away from each other or a wall is blocked, the signal to noise ratio of sound of the voice interaction user in the second space picked up by a first voice device in the first space is greatly reduced, and therefore the first voice device cannot normally interact with the voice interaction user in the second space.

Based on the method, the server can respond to the voice interaction user to enter the second space from the first space, and complete the inheritance of the voice recognition context; and interacting with a second voice device located in a second space, and continuously recognizing second voice through the voice recognition context between the first voice device and the voice interaction user, wherein the second voice is collected by the second voice device, so that the voice interaction user can realize continuous voice interaction in cross space. Referring to fig. 1 in detail, fig. 1 is a schematic flow chart of a first embodiment of a voice interaction method according to the present application. The voice interaction method of the embodiment comprises the following steps. It should be noted that the following numbers are only used for simplifying the description, and are not intended to limit the execution order of the steps, and the execution order of the steps in the present embodiment may be arbitrarily changed without departing from the technical idea of the present application.

S101: in response to the voice interaction user entering the second space from the first space, the second voice device captures a second voice of the voice interaction user.

The voice interaction user enters the second space from the first space, the second voice equipment can collect the second voice of the voice interaction user, and therefore the second voice equipment can send the second voice to the server, so that the server can recognize the second voice through the voice recognition context between the first voice equipment and the voice interaction user under the condition that the voice recognition context inherits, and send an instruction to the second voice equipment based on the recognition result, and therefore cross-space continuous voice interaction is achieved.

In an application scenario, when the voice interaction user enters the second space, the second voice device is awakened, so that as long as the voice interaction user speaks in the second space within a certain time, the second voice device does not need to judge whether the user crosses the space, and the second voice can be directly collected to obtain the second voice of the voice interaction user. It is understood that, in this scenario, the step of collecting the second voice may be performed before the step of confirming that the voice-interactive user enters the second space from the first space, may be performed simultaneously with the step of confirming that the voice-interactive user enters the second space from the first space, or may be performed after the step of confirming that the voice-interactive user enters the second space from the first space.

In another application scenario, when the voice interaction user enters the second space, the second voice device is not awakened, and at this time, the second voice device may cooperate with the server to confirm that the voice interaction user enters the second space from the first space or confirm that the voice interaction user has a cross-space behavior; and then responding to the voice interaction user entering the second space from the first space or confirming that the voice interaction user has cross-space behavior, and automatically waking up the second voice equipment, so that the second voice equipment can acquire the second voice of the voice interaction user. Of course, in the application scenario, the second speech device may also perform the wake-up operation when the wake-up word spoken by the second speech device is collected, and then collect the second speech of the speech interaction user. The "automatic wake-up of the second speech device" may mean that the server confirms that the speech interaction user enters the second space from the first space or performs a cross-space behavior, the server sends a wake-up instruction to the second speech device, and the second speech device executes a wake-up operation in response to the wake-up instruction. Of course, "the second voice device automatically wakes up" may also mean that the second voice device confirms that the voice interaction user enters the second space from the first space or performs a cross-space behavior, and the second voice device autonomously performs a wake-up operation.

It will be appreciated that when the voice interaction user enters the second space from the first space, the second voice device or some devices in the second space cooperate with the server to complete the confirmation of whether the voice interaction user entered the second space from the first space. As shown below, there are a number of ways to confirm whether a voice interaction user enters the second space from the first space.

In a first implementation manner, the server may have a function of calculating a signal-to-noise ratio, and after the first voice device collects the first voice of the voice interaction user and transmits the first voice to the server, the server may calculate the signal-to-noise ratio of the first voice in real time or at intervals based on the first voice; if the first voice signal-to-noise ratio is lower than the first threshold, the server may first acquire the voice acquired by at least one voice device other than the first voice device; then, whether the voice collected by at least one voice device has the voice of the voice interaction user is confirmed by means of voiceprint detection and the like; then confirming that at least one voice device acquires the signal-to-noise ratio of the voice interaction user; and the voice equipment corresponding to the voice with the signal-to-noise ratio larger than the second threshold value can be used as second voice equipment, and the voice interaction user is confirmed to enter a space where the second voice equipment is located, namely a second space, from the first space. The first threshold and the second threshold may be preset, and the first threshold is smaller than or equal to the second threshold.

It can be understood that the step of calculating the signal-to-noise ratio in the first implementation manner may be performed by the voice device which acquires the voice, at this time, after the first voice device calculates the signal-to-noise ratio of the first voice and confirms that the signal-to-noise ratio of the first voice is lower than the first threshold, the first voice may directly send a notification carrying the voice interaction user identity to the other voice devices, so that the other voice devices confirm whether the voice acquired by the voice devices includes the voice of the voice interaction user, and whether the signal-to-noise ratio of the voice of the acquired voice interaction user is greater than the second threshold; if the signal-to-noise ratio of the voice interaction user acquired by one voice device is larger than a second threshold value, determining the voice device larger than the second threshold value as a second voice device, and determining that the voice interaction user enters a second space from the first space; in addition, the second voice device can send a cross-border instruction to the server to inform the server of the voice interaction user entering the second space from the first space, so that the server can complete the inheritance of the voice recognition context, and the cross-space continuity voice interaction can be realized.

In a second implementation manner, cameras may be installed in the first space and the second space to respectively collect images of the first space and the second space, the server may acquire the images of the first space and the second space in real time, the server may first detect the image of the first space by using an image detection technology to confirm a voice interaction user interacting with the first voice device in the first space, then track the voice interaction user by using a target tracking algorithm to determine a real-time position of the voice interaction user, and the server may confirm that the voice interaction user enters the second space with the second voice device from the first space based on the image of the second space. The camera device in the first space can be an internal camera device or an external camera device of the first voice device, or the camera device in the second space can be an internal camera device or an external camera device of the second voice device.

In a third implementation manner, the server may obtain the location information of the voice interaction user through wearable devices such as a bracelet, a watch, and a shoe on the voice interaction user, and then the server may determine that the voice interaction user enters the second space from the first space based on the location information of the voice interaction user.

In a fourth implementation manner, both the first voice device and the second voice device can acquire first behavior information and second behavior information of a voice interaction user through some sensors; the first voice device and the second voice device can respectively send the first behavior information and the second behavior information to the server, so that the server can confirm whether the voice interaction user enters the second space from the first space or not based on the first behavior information and the second behavior information.

Specifically, the server may confirm whether the voice interaction user has cross-space behavior based on the first behavior information and the second behavior information; then, in response to the occurrence of the cross-space behavior of the voice interaction user, the second voice device or the server may determine whether the voice interaction user corresponding to the first voice is the same as the voice interaction user corresponding to the second voice; if the two spaces are the same, confirming that the voice interaction user enters the second space from the first space; and if not, the voice interaction user does not enter the second space from the first space.

The first behavior information may be a distance between a mobile of the first space and the first voice device, and the second behavior information may be a distance between the voice interaction user and the second voice device. The server may take a time when the first behavior information is greater than a first threshold as a time when the mobile of the first space leaves the first space; taking the time when the second behavior information is smaller than a second threshold value as the time when the voice interaction user enters a second space; then calculating the difference between the time of leaving the first space and the time of entering the second space; and when the difference value meets a preset condition, confirming that the voice interaction user has a cross-space behavior. In the present embodiment, the preset condition is satisfied when the difference is greater than the lower limit value. Wherein, the lower limit value can be preset or can be the ratio of the length of the shortest path from the first space to the second space to the fastest pace speed of the normal person. In addition, when the difference value is larger than the lower limit value and smaller than the upper limit value, a preset condition is met so as to avoid the involuntary inheritance of the speech recognition context. The upper limit value is not limited, and may be, for example, 5min or 6 min. In other embodiments, the first behavior information may be directly a time when a mobile of the first space leaves the first space, and the second behavior information may be directly a time when the voice interaction user enters the second space. Or, the first behavior information may be directly state information of the first space from which the mobile device leaves the first space, and when the server receives the first behavior information, the server may use a time of receiving the first behavior information as a time of the first space from which the mobile device leaves the first space; the second behavior information may be directly state information of the voice interaction user entering the second space, and when the server receives the second behavior information, the time of receiving the second behavior information may be used as the time of the voice interaction user entering the second space. Wherein the second behavior information of the second voice device may be collected in response to a collection instruction, wherein the collection instruction may be issued to the second voice device by the server or the first voice device when the first voice device confirms that the mobile person in the first space leaves the first space. Wherein, first speech equipment and second speech equipment can install infrared pyroelectric, ultrasonic ranging sensor, TOF laser rangefinder type's distance sensor such as to first speech equipment passes through distance sensor and confirms first action information, and second speech equipment passes through distance sensor and confirms second action information.

In addition, whether the voice interaction user corresponding to the first voice is the same as the voice interaction user corresponding to the second voice can be determined through various methods.

For example, the second voice equipment detects the second voice to determine the identity of the voice interaction user corresponding to the second voice; acquiring the identity of a voice interaction user corresponding to the first voice from the first voice equipment; and confirming whether the voice interaction user corresponding to the second voice is the same as the voice interaction user corresponding to the first voice.

For another example, the second voice equipment detects the second voice to determine the identity of the voice interaction user corresponding to the second voice; the second voice equipment sends the identity of the voice interaction user corresponding to the second voice to the server; and the server confirms whether the voice interaction user corresponding to the second voice is the same as the voice interaction user corresponding to the first voice.

For another example, the second voice device sends the second voice to the server; the server detects the second voice to determine the identity of the voice interaction user corresponding to the second voice; and the server confirms whether the voice interaction user corresponding to the second voice is the same as the voice interaction user corresponding to the first voice. It is to be understood that the first speech may be speech captured by the first speech device prior to confirming the voice interaction user entered the second space from the first space. The voice print detection method and the voice print detection device can perform voice print detection on the second voice to determine the identity of the voice interaction user corresponding to the second voice. Furthermore, the method and the device can perform voiceprint detection on the irrelevant text on the second voice so as to accurately confirm the identity of the voice interaction user corresponding to the second voice.

It is understood that the first space and the second space may be co-located within an interactive space, a voice interactive user may have multiple spatial conversions within the interactive space, and at least a portion of the multiple spaces are provided with voice devices. All voice devices of the interaction space may be connected to the server. And at least part of voice devices in the interactive space can be positioned in the same local area network and can transmit information point to point. Wherein, the interactive space can be a suite, a building or a floor of a house.

In addition, if the second voice device or the server confirms that the voice interaction user enters the second space from the first space, the second voice device or the server can send a closing instruction to the first voice device, so that the first voice device can be restored to a state to be awakened in response to the closing instruction, the first voice device is prevented from being in the state of being awakened under the condition that the second voice device is in normal voice interaction with the voice interaction user, energy is saved, and interference caused by the first voice device on voice interaction between the second voice device and the voice interaction user is also avoided.

S102: and the second voice equipment sends the second voice to the server.

After the second voice of the voice interaction user is collected by the second voice device, the second voice can be sent to the server so that the server can recognize the second voice.

S103: the server completes the speech recognition context inheritance.

After confirming that the voice interaction user enters the second space from the first space, the server can complete the speech recognition context inheritance, wherein the speech recognition context inheritance is realized by acquiring the first speech of the voice interaction user through the first speech device located in the first space, so that the second speech of the voice interaction user is recognized and/or interacted through the speech recognition context inheritance by matching with the second speech device.

In an implementation manner, the first voice device recognizes the first voice and interacts with the voice interaction user based on a recognition result of the first voice, so that the server can acquire the voice recognition context inheritance of the interaction between the voice interaction user and the first voice device from the first voice device to complete the voice recognition context inheritance.

In another implementation manner, the first speech device does not recognize the first speech, the first speech device sends the first speech to the server, and the server recognizes the first speech device, so that the server stores the speech recognition context inheritance for the speech interaction user to interact with the first speech device. Therefore, after the server confirms that the voice interaction user enters the second space from the first space, the server calls the self-stored voice recognition context inheritance for interaction between the voice interaction user and the first voice device, so as to complete the voice recognition context inheritance.

S104: the server continues to recognize the second voice through the voice recognition context and sends an operation instruction to the equipment related to the recognition result of the second voice.

After the inheritance of the speech recognition context is completed, the second speech of the speech interaction user can be recognized through the inheritance of the speech recognition context. The acquisition time of the second voice is not limited, for example, the second voice may be acquired by the second voice device before step S103, or may be acquired by the second voice device after step S103.

Optionally, after completing the inheritance of the speech recognition context, the server may continue to recognize the second speech through the speech recognition context; and then, sending an operation instruction to the equipment related to the recognition result based on the recognition result of the second voice so as to enable the related equipment to execute the operation corresponding to the operation instruction.

It is to be appreciated that recognizing the second speech by the speech recognition context relay can include: performing voice recognition on the second voice to obtain a voice recognition result of the second voice; and contacts the speech recognition context inheritance to determine a recognition result of the second speech. For example, assuming that the speech recognition result obtained by recognizing the second speech is "what weather", and the speech recognition context inherits the speech recognition context including "no traffic congestion in shanghai mountain today", the speech recognition result obtained by connecting the speech recognition context inherits the speech recognition context is "what weather in shanghai mountain".

When the second voice is interactive contents such as 'how the weather of Shanghai Baoshan' and 'what the newly uploaded music is' and the like, the equipment related to the recognition result is the second voice equipment, the operation instruction can carry answer contents of the second voice, and therefore the answer contents can be played when the second voice equipment receives the operation instruction. And when the second voice is a control instruction for the equipment, such as "adjusting the temperature of the refrigerating chamber of the refrigerator to 2 ℃" and "adjusting the temperature of the air conditioner to 27 ℃", the equipment related to the recognition result executes the operation corresponding to the operation instruction, such as "switch", "switch temperature", in response to the operation instruction, wherein the equipment related to the recognition result may be the second voice equipment or equipment other than the second voice equipment.

For example, assuming that the second speech device is an air conditioner in the second space, the speech recognition result obtained by recognizing the second speech is "the temperature of the freezer compartment is adjusted to-10 ℃, and the speech recognition context inherits the speech recognition context including" the temperature of the refrigerator compartment is adjusted to 2 ℃ ", the recognition result of the second speech obtained by associating with the speech recognition context inherits the speech recognition context is" the temperature of the freezer compartment of the refrigerator is adjusted to-10 ℃, and the server may send an operation instruction to the refrigerator to cause the refrigerator to adjust the temperature of the freezer compartment thereof to-10 ℃.

Referring to fig. 2, steps of a method for implementing voice interaction are shown in fig. 2, where fig. 2 is a schematic diagram of a workflow of a server in a first embodiment of the voice interaction method.

S201: and responding to the voice interaction user to enter the second space from the first space, and acquiring second voice of the voice interaction user.

Wherein the second voice is collected by a second voice device located in the second space. The second voice device can be an intelligent household appliance and the like, such as a refrigerator or an air conditioner and the like.

S202: completing speech recognition context inheritance.

The speech recognition context inheritance is realized by collecting first speech of a speech interaction user through a first speech device positioned in a first space.

The first voice device can be an intelligent household appliance and the like, such as a refrigerator or an air conditioner and the like.

S203: and successively recognizing the second voice through the voice recognition context, and sending an operation instruction to equipment related to the recognition result of the second voice.

The above steps are similar to the related steps in the embodiment shown in fig. 1, and detailed description is omitted. Responding to the fact that the voice interaction user enters the second space from the first space, the server obtains the second voice of the voice interaction user, completing context inheritance of voice recognition, recognizing the second voice through the context inheritance of the voice recognition, and giving an instruction to equipment related to the recognition result of the second voice.

Referring to fig. 3, steps of a voice interaction method implemented by the second voice device are shown, and fig. 3 is a schematic workflow diagram of the second voice device in the first embodiment of the voice interaction method.

S301: and collecting second voice of the voice interaction user in response to the voice interaction user entering the second space from the first space.

S302: and sending the second voice to the server so as to enable the server to continue to recognize the second voice through the voice recognition context.

The speech recognition context inheritance is completed by the server responding to the first speech of the speech interaction user, which is acquired by the first speech equipment located in the first space, when the speech interaction user enters the second space from the first space.

It is understood that after step S302, the second speech device may also obtain the operation instruction from the server, and then the second speech device may perform the corresponding operation based on the operation instruction. Wherein the instruction is issued to the second voice device by the server based on the recognition result.

The above steps are similar to the related steps in the embodiment shown in fig. 1, and detailed description is omitted. Responding to the voice interaction user entering the second space from the first space, collecting second voice of the voice interaction user, sending the second voice to the server, so that the server continues to recognize the second voice through the voice recognition context, wherein the speech recognition context inheritance is completed by the server through the first speech of the speech interaction user collected by the first speech device located in the first space, so that when the speech interaction user enters the second space from the first space, the voice of the voice interaction user can be collected through the second voice device, so that the signal-to-noise ratio of the collected voice is high, the voice recognition efficiency of the voice interaction user is guaranteed, and the second voice collected by the second voice device can be recognized through the interactive content of the voice interaction user and the first voice device, so that the cross-space continuity voice recognition and/or interaction is guaranteed.

Referring to fig. 4, fig. 4 is a flowchart illustrating a voice interaction method according to a second embodiment of the present application.

S401: in response to the voice interaction user entering the second space from the first space, the second voice device captures a second voice of the voice interaction user.

The specific method can refer to step S101, which is not described herein.

S402: the server completes the speech recognition context inheritance.

The specific method can refer to step S103, which is not described herein.

S403: the server sends the speech recognition context inheritance to the second speech device.

After completing the speech recognition context inheritance by the method of step S402, the server may send the speech recognition context inheritance to the second speech device, so that the second speech device itself may continue to recognize and/or interact with the second speech by the speech recognition context.

S404: the second speech device inherits recognition and/or interaction of the second speech through the speech recognition context.

After the second voice device obtains the speech recognition context inheritance, the second voice of the voice interaction user can be recognized through the speech recognition context inheritance.

Optionally, the second speech device may inherit recognition of the second speech through the speech recognition context; then, voice interaction is carried out with a voice interaction user based on the recognition result; if the recognition result is an operation instruction for the second voice equipment, the second voice equipment executes the operation related to the recognition result; or when the recognition result is an operation instruction of a device other than the second voice device, the operation instruction is sent to the device related to the recognition result based on the recognition result of the second voice, so that the related device executes the operation corresponding to the operation instruction.

Referring to fig. 5, steps of a method for implementing voice interaction by a server are shown, and fig. 5 is a schematic workflow diagram of the server in a second embodiment of the voice interaction method.

S501: and responding to the voice interaction user to enter the second space from the first space, and completing the inheritance of the voice recognition context.

S502: and sending the speech recognition context inheritance to the second speech equipment.

After the server completes the speech context inheritance, the speech recognition context inheritance can be sent to the second speech device, so that the second speech device can recognize and/or interact with the second speech through the speech recognition context inheritance.

The above steps are similar to the related steps in the embodiment shown in fig. 4, and detailed description is omitted. The server completes the inheritance of the speech recognition context and sends the inheritance of the speech recognition context to the second speech equipment in response to the speech interaction user entering the second space from the first space, so that the second speech equipment can recognize and/or interact the second speech through the inheritance of the speech recognition context, and thus when the speech interaction user enters the second space from the first space, the speech of the speech interaction user can be collected through the second speech equipment to ensure that the signal-to-noise ratio of the collected speech is higher, the recognition efficiency of the speech interaction user is ensured, and the second speech collected by the second speech equipment can be recognized through the interactive content of the speech interaction user and the first speech equipment, so that the cross-space continuity speech recognition and/or interaction is ensured.

Referring to fig. 6, steps of a voice interaction method implemented by the second voice device are shown, and fig. 6 is a schematic workflow diagram of the second voice device in a second embodiment of the voice interaction method.

S601: in response to the voice interaction user entering the second space from the first space, a second voice of the voice interaction user is collected.

S602: and acquiring the speech recognition context inheritance from the server, and recognizing and/or interacting the second speech through the speech recognition context inheritance.

The above steps are similar to the related steps in the embodiment shown in fig. 4, and detailed description is omitted. Responding to the voice interaction user entering the second space from the first space, the second voice equipment collects the second voice of the voice interaction user, acquires the voice recognition context inheritance from the server, and then recognizes and/or interacts the second voice through the voice recognition context inheritance, wherein the voice recognition context inheritance is completed by the server through the first voice of the voice interaction user collected by the first voice equipment positioned in the first space, so that when the voice interaction user enters the second space from the first space, the voice of the voice interaction user can be collected by the second voice equipment to ensure that the signal-to-noise ratio of the collected voice is higher, the voice recognition efficiency of the voice interaction user is ensured, and the second voice collected by the second voice equipment can be recognized through the interactive content of the voice interaction user and the first voice equipment, to ensure continued speech recognition and/or interaction across space.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an embodiment of an electronic device according to the present application. The present electronic device 10 includes a processor 12, and the processor 12 is configured to execute instructions to implement the voice interaction method described above. For a specific implementation process, please refer to the description of the foregoing embodiment, which is not repeated herein. The electronic device 10 is capable of ensuring continuous speech recognition and/or interaction across space.

The processor 12 may also be referred to as a CPU (Central Processing Unit). The processor 12 may be an integrated circuit chip having signal processing capabilities. The processor 12 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor 12 may be any conventional processor or the like.

The speech device 10 may further include a memory 11 for storing instructions and data required for the processor 12 to operate.

The processor 12 is configured to execute instructions to implement the methods provided by any of the embodiments of the voice interaction method of the present application and any non-conflicting combinations thereof.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present disclosure. The computer readable storage medium 20 of the embodiments of the present application stores instructions/program data 21 that when executed enable the methods provided by any of the embodiments of the voice interaction method of the present application, as well as any non-conflicting combinations. The instructions/program data 21 may form a program file stored in the storage medium 20 in the form of a software product, so as to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods according to the embodiments of the present application. And the aforementioned storage medium 20 includes: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The above embodiments are merely examples and are not intended to limit the scope of the present disclosure, and all modifications, equivalents, and flow charts using the contents of the specification and drawings of the present disclosure or those directly or indirectly applied to other related technical fields are intended to be included in the scope of the present disclosure.

Claims

1. A method of voice interaction, the method comprising:

responding to a voice interaction user entering a second space from a first space, and completing voice recognition context inheritance, wherein the voice recognition context inheritance is realized by acquiring a first voice of the voice interaction user through a first voice device located in the first space;

and interacting with a second voice device located in the second space, and recognizing and/or interacting second voice of the voice interaction user through the voice recognition context inheritance, wherein the second voice is collected by the second voice device.

2. The method of claim 1, wherein before the user enters the second space from the first space in response to the voice interaction, the method comprises:

obtaining first behavior information of the voice interaction user through the first voice equipment, and obtaining second behavior information of the voice interaction user through the second voice equipment;

confirming whether the voice interaction user enters the second space from the first space based on the first behavior information and the second behavior information.

3. The method of claim 2, wherein the confirming whether the voice interaction user enters the second space from the first space based on the first behavior information and the second behavior information comprises:

confirming whether the voice interaction user has cross-space behavior based on the first behavior information and the second behavior information;

responding to the cross-space behavior of the voice interaction user, and acquiring the second voice; confirming that the voice interaction users corresponding to the second voice and the first voice are the same based on the second voice and the first voice, and then enabling the voice interaction users to enter the second space from the first space; or the like, or, alternatively,

responding to the cross-space behavior of the voice interaction user, and acquiring the identity of the voice interaction user corresponding to the second voice; confirming that the voice interaction user corresponding to the second voice is the same as the voice interaction user corresponding to the first voice, and enabling the voice interaction user to enter the second space from the first space; or the like, or, alternatively,

4. A voice interaction method according to claim 2 or 3,

the confirming whether the voice interaction user enters the second space from the first space based on the first behavior information and the second behavior information comprises:

and when the difference value meets a preset condition, the voice interaction user has a cross-space behavior.

5. The voice interaction method according to claim 2, wherein the first voice device and the second voice device are provided with a camera, the first behavior information is a first image containing the voice interaction user, and the second behavior information is a second image containing the voice interaction user;

the confirming whether the voice interaction user enters the second space from the first space based on the first behavior information and the second behavior information comprises: detecting the voice interaction user from the first image; tracking the voice interaction user based on the first image and the second image acquired in real time to determine whether the voice interaction user enters the second space from the first space.

6. The voice interaction method of claim 3,

the voice interaction users corresponding to the second voice and the first voice are confirmed to be the same based on the second voice, and the method comprises the following steps:

7. The voice interaction method of claim 3, further comprising:

and responding to the voice interaction user entering the second space from the first space, and sending a closing instruction to the first voice equipment so as to enable the first voice equipment to recover to-be-awakened state.

8. The voice interaction method of claim 1, wherein the interacting with a second voice device located in the second space and recognizing and/or interacting with a second voice of the voice interaction user through the speech recognition context inheritance comprises:

relaying recognition of the second speech by the speech recognition context; sending an operation instruction to the equipment related to the recognition result based on the recognition result of the second voice so as to enable the equipment related to the recognition result to perform corresponding operation based on the operation instruction; or the like, or, alternatively,

and the speech recognition context inheritance is sent to the second speech equipment, so that the second speech equipment can recognize and/or interact with the second speech through the speech recognition context inheritance.

9. A method of voice interaction, the method comprising:

interacting with a server, and recognizing and/or interacting the second voice through a voice recognition context inheriting, wherein the voice recognition context inheriting is completed by the server through a first voice device located in the first space for collecting the first voice of the voice interaction user.

10. The method of claim 9, wherein before the user enters the second space from the first space in response to the voice interaction, the method comprises:

and sending second behavior information of the voice interaction user to the server so that the server confirms whether the voice interaction user enters the second space from the first space or not based on the second behavior information.

11. The voice interaction method of claim 10, wherein the sending the second behavior information of the voice interaction user to the server comprises:

sending the identity of the voice interaction user corresponding to the second voice to the server, so that the server confirms whether the voice interaction user corresponding to the second voice is the same as the voice interaction user corresponding to the first voice; or

Acquiring the identity of the voice interaction user corresponding to the first voice from the first voice equipment, and confirming whether the voice interaction user corresponding to the second voice is the same as the voice interaction user corresponding to the first voice; and if the voice interaction user identity is the same as the first voice, sending the identity of the voice interaction user corresponding to the second voice to the server.

12. The voice interaction method of claim 10, further comprising:

13. The method of voice interaction according to claim 9, wherein said interacting with the server and recognizing and/or interacting with the second voice via a voice recognition context relay comprises:

obtaining the speech recognition context inheritance from the server; relaying recognition and/or interaction of the second speech through the speech recognition context; or the like, or, alternatively,

sending the second voice to the server; responding to an operation instruction, and executing corresponding operation based on the operation instruction, wherein the operation instruction is issued by the server based on the recognition result of the second voice, and the recognition result is obtained by the server recognizing the second voice through the voice recognition context relay.

14. An electronic device, characterized in that the electronic device comprises a processor; the processor is configured to execute instructions to implement the voice interaction method of any one of claims 1-13.

15. A computer-readable storage medium for storing instructions/program data executable to implement the voice interaction method of any of claims 1-13.