CN111694433A

CN111694433A - Voice interaction method and device, electronic equipment and storage medium

Info

Publication number: CN111694433A
Application number: CN202010530888.5A
Authority: CN
Inventors: 陈世伟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Apollo Zhilian Beijing Technology Co Ltd
Priority date: 2020-06-11
Filing date: 2020-06-11
Publication date: 2020-09-22
Anticipated expiration: 2040-06-11
Also published as: CN111694433B

Abstract

The application discloses a voice interaction method, a voice interaction device, electronic equipment and a storage medium, and relates to the technologies of voice, natural language processing and image processing. The specific implementation scheme is as follows: under the condition that the voice signals are detected to contain interaction information, determining a plurality of voice interaction users sending the voice signals according to the sound source positions of the voice signals and auxiliary information detected by a sensor; and setting a label for the interactive information in the voice signal, wherein the label corresponds to the user sending the voice signal. Generating feedback information of the interaction information; and playing feedback information to the voice interaction user corresponding to the label. The problem that multiple persons cannot perform voice interaction simultaneously is solved, the voice interaction efficiency under the condition of multiple persons is improved, and the intelligence of voice interaction is also improved.

Description

Voice interaction method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of signal processing technologies, and in particular, to a method and an apparatus for voice interaction, an electronic device, and a storage medium.

Background

In the current vehicle-mounted voice system on the market, voice interaction can be realized only by the next passenger at the same time. When other passengers in the vehicle also have the intention of voice interaction, a new voice interaction process can be started only by waiting for the completion of the previous voice interaction or performing voice awakening again.

Disclosure of Invention

The application provides a voice interaction method, a voice interaction device, electronic equipment and a storage medium, and relates to the fields of voice technology, natural language processing, image processing and the like.

According to an aspect of the present application, there is provided a method of voice interaction, comprising the steps of:

under the condition that the voice signals are detected to contain interaction information, determining a plurality of voice interaction users sending the voice signals according to the sound source positions of the voice signals and auxiliary information detected by a sensor;

setting a label for interactive information in the voice signal, wherein the label corresponds to a voice interactive user sending the voice signal;

generating feedback information of the interaction information;

and playing feedback information to the voice interaction user corresponding to the label.

According to another aspect of the application, there is provided an apparatus for voice interaction, comprising the following components:

the voice interaction user determining module is used for determining a plurality of voice interaction users sending voice signals according to the sound source position of the voice signals and the auxiliary information detected by the sensor under the condition that the voice signals contain interaction information;

the tag setting module is used for setting tags for the interactive information in the voice signals, and the tags correspond to voice interactive users sending the voice signals;

the feedback information generating module is used for generating feedback information of the interaction information;

and the feedback information playing module is used for playing the feedback information to the voice interaction user corresponding to the label.

In a third aspect, an embodiment of the present application provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a method provided by any one of the embodiments of the present application.

In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method provided by any one of the embodiments of the present application.

According to the technology of the application, the problem that multiple persons cannot perform voice interaction simultaneously is solved, the voice interaction efficiency under the condition of multiple persons is improved, and the intelligence of voice interaction is also improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a flow chart of a method of voice interaction according to a first embodiment of the present application;

FIG. 2 is a flow chart of assistance information determination according to a first embodiment of the present application;

FIG. 3 is a flow chart of voice interaction user determination according to a first embodiment of the present application;

fig. 4 is a flowchart of playing feedback information according to a first embodiment of the present application;

FIG. 5 is a schematic diagram of an apparatus for voice interaction according to a second embodiment of the present application;

FIG. 6 is a schematic diagram of a voice interaction user determination module according to a second embodiment of the present application;

FIG. 7 is a schematic diagram of a voice interaction user determination module according to a second embodiment of the present application;

fig. 8 is a schematic diagram of a feedback information playing module according to a second embodiment of the present application;

FIG. 9 is a block diagram of an electronic device for implementing a method of voice interaction of an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

As shown in fig. 1, the present application provides a method of voice interaction, comprising the steps of:

s101: and under the condition that the voice signals contain interaction information, determining a plurality of voice interaction users sending the voice signals according to the sound source positions of the voice signals and the auxiliary information detected by the sensor.

S102: and setting a label for the interactive information in the voice signal, wherein the label corresponds to the voice interactive user sending the voice signal.

S103: and generating feedback information of the interaction information.

S104: and playing feedback information to the voice interaction user corresponding to the label.

The method and the process can be applied to a riding scene, a meeting scene or a home scene and the like. Taking a riding scene as an example, the execution main body of the method can be a vehicle-mounted computer. It is assumed that four occupants are included in the vehicle, namely a driver on the left side of the front row, a first passenger on the right side of the front row, a second passenger on the left side of the rear row and a third passenger on the right side of the rear row.

The interaction information may be a wake-up word that initiates a voice interaction, or a statement with an explicit intent to interact. For example, statements with explicit intent to interact may include: "turn down the air conditioner temperature a little", "open the window", "where the current position is", and the like.

In case it is detected that the speech signal contains interactive information, a confirmation procedure for the speech interactive user may be initiated. For example, sound source localization may be performed based on sound waves detected by a microphone array provided in the vehicle, and the approximate position of the sound source of the voice signal including the interactive information may be obtained.

The auxiliary information detected by the sensor may be information that the seat detected by the in-vehicle seat sensor is occupied, or the like.

In addition, the sensor-detected auxiliary information may also include the talking vehicle occupant detected from the (dynamic) image captured by the image sensor using image recognition techniques.

Comprehensive judgment is carried out according to the sound source position of the voice signal and the auxiliary information detected by the sensor, and at least one voice interaction user sending the voice signal containing the interaction information can be determined.

For example, two voice interaction users, namely the first passenger on the right side of the front row and the third passenger on the right side of the rear row, are determined, and tags can be set for the two voice interaction users respectively. The tag may be information relating to a seat or a position in the vehicle, etc. After the label is set, each piece of interaction information of the user can load the corresponding label.

For each piece of interaction information, feedback information can be obtained locally or through cloud communication. For example, if the interactive information included in the voice signal of the first passenger on the right side of the front row is "where the present position is", the current position of the vehicle may be determined by the on-vehicle GPS or by communication with the satellite positioning server, and the current position is used as the feedback information. If the interactive information contained in the voice signal of the third passenger on the right side of the rear row is 'turn the air-conditioning temperature down a little', a temperature control instruction can be directly generated to adjust the air-conditioning temperature of the vehicle.

And under the condition that the feedback information is information capable of being broadcasted, the feedback information can be broadcasted to the voice interaction user corresponding to the label. For example, the interactive information included in the voice signal of the first passenger on the right side of the front row is "where is the present position". According to the label, the passenger who gives out the problem can be determined to be the first passenger on the right side of the front row, so that the loudspeaker closest to the passenger can be selected to play the feedback information.

By the scheme, simultaneous interaction of a plurality of voice interaction users can be realized, the voice interaction efficiency under the condition of a plurality of persons is improved, and the intelligence of voice interaction is also improved.

As shown in fig. 2, in one embodiment, the sensor includes an image sensor, and the determining of the auxiliary information includes:

s201: the image detected by the image sensor is recognized, and each user in the image is confirmed.

S202: and obtaining the probability of the voice signal sent by each user according to the facial features of each user.

S203: and determining the probability of the voice signal sent by each user as the auxiliary information.

An image, which may be a dynamic image, detected by an image sensor is acquired. Each user in the dynamic image can be confirmed by using a face recognition technology. In a ride scenario, the user may be a rider. Furthermore, the probability of the voice signal sent by each user can be obtained according to the facial features of each user. For example, the facial features may be the frequency of mouth movements, or facial expressions, or the like. And obtaining the probability of the voice signal sent by each user according to the facial features of each passenger.

In addition, a probability threshold may be set in advance. And determining the probability of the voice signal sent by the user as the auxiliary information under the condition that the obtained probability of the voice signal sent by the user exceeds a probability threshold. Therefore, the calculation amount of the subsequent process can be saved, and the calculation time can be saved.

By the scheme, the image detected by the image sensor is used for carrying out auxiliary recognition on the voice interaction user, so that the accuracy of the recognition of the voice interaction user can be improved.

In one embodiment, the sensor comprises a seat sensor, and the determining of the auxiliary information further comprises:

the information that the seat detected by the seat sensor is occupied is determined as the auxiliary information.

The seat sensor is a film type contact sensor, and contacts of the sensor are uniformly distributed on a stress surface of the automobile seat. When the seat is subjected to a sufficient weight from the outside, a trigger signal is generated. The trigger signal is used as the information that the seat is occupied, i.e., determined as the auxiliary information.

Through the scheme, the voice interaction user is subjected to auxiliary identification by utilizing the occupied information of the seat detected by the seat sensor, and the accuracy of voice interaction user identification can be improved.

As shown in fig. 3, in one embodiment, step S101 includes:

s1011: and determining a first probability that the user at each position is the voice interaction user according to the sound source position of the voice signal.

S1012: and determining a second probability that the user positioned at each position is the voice interaction user according to the auxiliary information.

S1013: a weighted sum of the first probability and the second probability for each location is calculated using pre-assigned weights.

S1014: and in the case that the weighted sum is larger than the preset threshold value, determining the user positioned at the corresponding position as the voice interaction user sending the voice signal.

Sound source localization is performed by using sound waves detected by the microphone array, and the approximate position of the sound source of the voice signal containing the interactive information can be obtained. For example, a microphone array is provided at a center position of the vehicle, and a first passenger on the right side of the front row and a third passenger on the right side of the rear row simultaneously sound. The microphone array is used as a directional array, the directional array can firstly carry out grid division on a positioning area, relative sound pressure of each grid can be obtained through time delay of a received sound source, and a holographic color image for positioning the sound source is finally determined based on the relative sound. And sending the holographic color image to a pre-trained positioning probability model to obtain the probability of the sound source at different positions. The positioning probability model can be trained by adopting a holographic color pattern sample and a sound source positioning sample, so that the trained positioning probability model can obtain the probability that a sound source is positioned at different positions according to the holographic color pattern. That is, a first probability that the user located at each location is a voice interaction user is obtained.

In connection with the foregoing passenger example, for example, it is possible to obtain a probability that the sound source position is on the right side of the front row of 95%, a probability that the sound source position is on the right side of the rear row of 90%, and a probability that the sound source position is in the middle of the rear row of 40%. In addition, the probability that the sound source position is the other seat is obtained as 0.

In addition, a second probability that the user located at the first location is a voice interaction user may be determined from the image detected by the image sensor. In connection with the foregoing examples of the vehicle occupant, for example, the probability that the right side of the front row is a user who utters voice is 85%, and the probability that the right side of the rear row is a user who utters voice is 88% in accordance with the moving image output. The second probability of the user on the right of the front row being a voice interactive user is 85% and the second probability of the user on the right of the back row being a voice interactive user is 88%. Since the middle position in the back row does not actually have a passenger, the probability that the voice interaction user is detected to be located at the middle position in the back row through the moving image is 0. Further, since the occupant of the other seat does not sound, the second probability of detecting the occupant of the other seat from the moving image is negligible.

Further, the information of whether the seat detected by the seat sensor is occupied can be used as a second probability that the user at the first position is a voice interaction user. In combination with the foregoing examples of the vehicle occupant, for example, in the case where it is detected that the seats on the right side of the front row and the right side of the rear row are occupied, it is found that the probability that the user in the above two positions is a voice interaction user is 100%. The second probability that the voice interaction user is located on the right side of the front row and the right side of the back row is 100%. In addition, since there is no occupant in the middle position in the back row, it is found that the probability that the voice interaction user is located in the middle position in the back row is 0.

Weights are assigned to the first probability and the second probability in advance. For example, the first probability may be weighted more heavily than the second probability. For another example, the weight of the second probability obtained from the moving image is larger than the weight of the second probability obtained from the seat sensor.

The weighted sum E of the first probability and the second probability for the first location is represented as: e ═ q₁*W₁+q₂*W₂+q₃*W₃. In the formula, W₁Representing according to speechDetermining a first probability, W, that a voice interaction user is located at a first position₂Indicating a second probability, W, of determining that the voice interactive user is located at the first position based on the auxiliary information detected from the dynamic image₃Indicating a second probability that the voice interactive user is located at the first location by indicating that the voice interactive user is located according to the supplementary information detected by the seat sensor. q. q.s₁、q₂、q₃Respectively represent W₁、W₂、W₃The weight of (c).

And in the case that the weighted sum E is larger than a preset threshold value, determining the voice interaction user at the first position as the voice interaction user which sends out the voice signal. For example, in the above example, in the case where the calculated weighted sum of the front and rear right sides is greater than the predetermined threshold, the user on the front and rear right sides may be determined to be a voice interactive user who utters voice.

By the scheme, the voice interaction user can be determined by integrating the voice sound source position and the auxiliary information detected by the sensor. The accuracy of the determination result is improved.

As shown in fig. 4, in one embodiment, step S104 includes:

s1041: the location of each voice interaction user is determined.

S1042: and determining the loudspeaker closest to each voice interaction user according to the position distribution of each loudspeaker.

S1043: and respectively sending the feedback information to the loudspeaker closest to each voice interaction user for playing according to the label.

And under the condition of determining the voice interaction user, determining the position of the voice interaction user according to the sound source position and the auxiliary information. Such as the front right and rear right in the previous rider example, etc.

According to the position distribution of the loudspeakers acquired in advance, the loudspeaker closest to each voice interaction user can be determined. The position distribution of the speakers acquired in advance may be acquired by position information input by a user, configuration information obtained by a download vehicle, or the like.

From the tags, the voice interaction users and their locations can be identified. Therefore, the feedback information can be sent to the loudspeaker which is closest to each voice interaction user for playing so as to realize voice interaction.

By the scheme, a plurality of voice interaction users can be supported to perform simultaneous interaction, and the interference between the voice interaction users is reduced.

As shown in fig. 5, the present application provides a voice interaction apparatus, including:

the voice interaction user determining module 501 is configured to determine, when it is detected that the voice signal includes interaction information, a plurality of voice interaction users who send the voice signal according to a sound source position of the voice signal and the auxiliary information detected by the sensor.

A tag setting module 502, configured to set a tag for the interactive information in the voice signal, where the tag corresponds to a voice interaction user who sends the voice signal.

A feedback information generating module 503, configured to generate feedback information for the interaction information.

And a feedback information playing module 504, configured to play the feedback information to the voice interaction user corresponding to the tag.

In one embodiment, the sensor includes an image sensor;

as shown in fig. 6, the voice interaction user determination module 501 includes:

the user recognition sub-module 5011 is configured to recognize the image detected by the image sensor and confirm each user in the image.

The assistant information confirming submodule 5012 is configured to obtain a probability that each user sends a voice signal according to the facial features of each user.

And determining the probability of the voice signal sent by each user as the auxiliary information.

In one embodiment, the sensor comprises a seat sensor;

the assistance information confirmation submodule 5012 is also configured to: the information that the seat detected by the seat sensor is occupied is determined as the auxiliary information.

As shown in fig. 7, in an embodiment, the voice interaction user determination module 501 further includes:

the first probability determination submodule 5013 is configured to determine a first probability that a user located at each position is a voice interaction user according to a sound source position of the voice signal.

The second probability determining submodule 5014 is configured to determine, according to the auxiliary information, a second probability that the user located at each location is the voice interaction user.

The weighted sum calculation submodule 5015 is configured to calculate a weighted sum of the first probability and the second probability at each location by using the pre-assigned weights.

The voice interaction user determination execution sub-module 5016 is configured to determine the user located at the corresponding position as the voice interaction user who utters the voice signal if the weighted sum is greater than the predetermined threshold.

As shown in fig. 8, in an embodiment, the feedback information playing module 504 includes:

and the position determining submodule 5041 is used for determining the position of each voice interaction user.

And the loudspeaker determining submodule 5042 is used for determining the loudspeaker which is closest to each voice interaction user according to the position distribution of each loudspeaker.

And the feedback information playing execution submodule 5043 is configured to send the feedback information to the speaker closest to each voice interaction user for playing according to the label.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 9 is a block diagram of an electronic device according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 9, the electronic apparatus includes: one or more processors 910, memory 920, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). One processor 910 is illustrated in fig. 9.

The memory 920 is a non-transitory computer readable storage medium provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method of voice interaction provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of voice interaction provided herein.

The memory 920 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the voice interaction method in the embodiment of the present application (for example, the voice interaction user determination module 501, the tag setting module 502, the feedback information generation module 503, and the feedback information playing module 504 shown in fig. 5). The processor 910 executes various functional applications of the server and data processing, i.e., a method of implementing voice interaction in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 920.

The memory 920 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device according to the method of voice interaction, and the like. Further, the memory 920 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 920 may optionally include memory located remotely from the processor 910, which may be connected to the electronic device of the method of voice interaction via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method of voice interaction may further comprise: an input device 930 and an output device 940. The processor 910, the memory 920, the input device 930, and the output device 940 may be connected by a bus or other means, and fig. 9 illustrates an example of a connection by a bus.

The input device 930 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus of the method of voice interaction, such as an input device like a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, etc. The output devices 940 may include a display device, an auxiliary lighting device (e.g., an LED), a haptic feedback device (e.g., a vibration motor), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of voice interaction, comprising:

under the condition that a voice signal is detected to contain interaction information, determining a plurality of voice interaction users sending the voice signal according to the sound source position of the voice signal and auxiliary information detected by a sensor;

setting a label for the interactive information in the voice signal, wherein the label corresponds to a voice interactive user sending the voice signal;

generating feedback information for the interaction information;

and playing the feedback information to the voice interaction user corresponding to the label.

2. The method of claim 1, wherein the sensor comprises an image sensor;

the determination mode of the auxiliary information comprises the following steps:

identifying the image detected by the image sensor and confirming each user in the image;

obtaining the probability of the voice signal sent by each user according to the facial features of each user;

3. The method of claim 2, wherein the sensor comprises a seat sensor;

the determination method of the auxiliary information further comprises:

determining information that the seat detected by the seat sensor is occupied as the auxiliary information.

4. The method according to any one of claims 1 to 3, wherein the determining a plurality of voice interaction users who emit the voice signal according to the sound source position of the voice signal and the auxiliary information detected by the sensor comprises:

determining a first probability that the user at each position is the voice interaction user according to the sound source position of the voice signal;

determining a second probability that the user at each position is the voice interaction user according to the auxiliary information;

calculating a weighted sum of the first probability and the second probability for each location using pre-assigned weights;

and in the case that the weighted sum is larger than a preset threshold value, determining the user at the corresponding position as the voice interaction user which sends the voice signal.

5. The method of claim 1, playing the feedback information to a voice interaction user corresponding to the tag, comprising:

determining a location of each of the voice interaction users;

determining the loudspeaker closest to each voice interaction user according to the position distribution of each loudspeaker;

and according to the labels, respectively sending the feedback information to the loudspeaker closest to each voice interaction user for playing.

6. An apparatus for voice interaction, comprising:

the voice interaction user determining module is used for determining a plurality of voice interaction users sending the voice signals according to the sound source position of the voice signals and the auxiliary information detected by the sensor under the condition that the voice signals contain interaction information;

the tag setting module is used for setting a tag for the interactive information in the voice signal, wherein the tag corresponds to a voice interactive user sending the voice signal;

7. The apparatus of claim 6, wherein the sensor comprises an image sensor;

the voice interaction user determination module comprises:

the user identification submodule is used for identifying the image detected by the image sensor and confirming each user in the image;

the auxiliary information confirming submodule is used for obtaining the probability of the voice signal sent by each user according to the facial features of each user;

8. The apparatus of claim 7, wherein the sensor comprises a seat sensor;

the assistant information confirming submodule is further configured to: determining information that the seat detected by the seat sensor is occupied as the auxiliary information.

9. The apparatus of any of claims 6 to 8, wherein the voice interaction user determination module further comprises:

a first probability determination submodule, configured to determine, according to a sound source position of the voice signal, a first probability that a user located at each position is the voice interaction user;

a second probability determination submodule, configured to determine, according to the auxiliary information, a second probability that a user located at each location is the voice interaction user;

a weighted sum calculation submodule for calculating a weighted sum of the first probability and the second probability for each position using a pre-assigned weight;

and the voice interaction user determination execution sub-module is used for determining the user at the corresponding position as the voice interaction user sending the voice signal under the condition that the weighted sum is greater than a preset threshold value.

10. The apparatus of claim 6, wherein the feedback information playing module comprises:

the position determining submodule is used for determining the position of each voice interaction user;

the loudspeaker determining submodule is used for determining the loudspeaker which is closest to each voice interaction user according to the position distribution of each loudspeaker;

and the feedback information playing execution submodule is used for respectively sending the feedback information to the loudspeaker closest to each voice interaction user for playing according to the label.

11. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 5.

12. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 5.