CN111694433A - Voice interaction method and device, electronic equipment and storage medium - Google Patents

Voice interaction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111694433A
CN111694433A CN202010530888.5A CN202010530888A CN111694433A CN 111694433 A CN111694433 A CN 111694433A CN 202010530888 A CN202010530888 A CN 202010530888A CN 111694433 A CN111694433 A CN 111694433A
Authority
CN
China
Prior art keywords
user
voice
voice interaction
information
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010530888.5A
Other languages
Chinese (zh)
Other versions
CN111694433B (en
Inventor
陈世伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Apollo Zhilian Beijing Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202010530888.5A priority Critical patent/CN111694433B/en
Publication of CN111694433A publication Critical patent/CN111694433A/en
Application granted granted Critical
Publication of CN111694433B publication Critical patent/CN111694433B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/165Detection; Localisation; Normalisation using facial parts and geometric relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Engineering & Computer Science (AREA)
  • Geometry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The application discloses a voice interaction method, a voice interaction device, electronic equipment and a storage medium, and relates to the technologies of voice, natural language processing and image processing. The specific implementation scheme is as follows: under the condition that the voice signals are detected to contain interaction information, determining a plurality of voice interaction users sending the voice signals according to the sound source positions of the voice signals and auxiliary information detected by a sensor; and setting a label for the interactive information in the voice signal, wherein the label corresponds to the user sending the voice signal. Generating feedback information of the interaction information; and playing feedback information to the voice interaction user corresponding to the label. The problem that multiple persons cannot perform voice interaction simultaneously is solved, the voice interaction efficiency under the condition of multiple persons is improved, and the intelligence of voice interaction is also improved.

Description

Voice interaction method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of signal processing technologies, and in particular, to a method and an apparatus for voice interaction, an electronic device, and a storage medium.
Background
In the current vehicle-mounted voice system on the market, voice interaction can be realized only by the next passenger at the same time. When other passengers in the vehicle also have the intention of voice interaction, a new voice interaction process can be started only by waiting for the completion of the previous voice interaction or performing voice awakening again.
Disclosure of Invention
The application provides a voice interaction method, a voice interaction device, electronic equipment and a storage medium, and relates to the fields of voice technology, natural language processing, image processing and the like.
According to an aspect of the present application, there is provided a method of voice interaction, comprising the steps of:
under the condition that the voice signals are detected to contain interaction information, determining a plurality of voice interaction users sending the voice signals according to the sound source positions of the voice signals and auxiliary information detected by a sensor;
setting a label for interactive information in the voice signal, wherein the label corresponds to a voice interactive user sending the voice signal;
generating feedback information of the interaction information;
and playing feedback information to the voice interaction user corresponding to the label.
According to another aspect of the application, there is provided an apparatus for voice interaction, comprising the following components:
the voice interaction user determining module is used for determining a plurality of voice interaction users sending voice signals according to the sound source position of the voice signals and the auxiliary information detected by the sensor under the condition that the voice signals contain interaction information;
the tag setting module is used for setting tags for the interactive information in the voice signals, and the tags correspond to voice interactive users sending the voice signals;
the feedback information generating module is used for generating feedback information of the interaction information;
and the feedback information playing module is used for playing the feedback information to the voice interaction user corresponding to the label.
In a third aspect, an embodiment of the present application provides an electronic device, including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a method provided by any one of the embodiments of the present application.
In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method provided by any one of the embodiments of the present application.
According to the technology of the application, the problem that multiple persons cannot perform voice interaction simultaneously is solved, the voice interaction efficiency under the condition of multiple persons is improved, and the intelligence of voice interaction is also improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1 is a flow chart of a method of voice interaction according to a first embodiment of the present application;
FIG. 2 is a flow chart of assistance information determination according to a first embodiment of the present application;
FIG. 3 is a flow chart of voice interaction user determination according to a first embodiment of the present application;
fig. 4 is a flowchart of playing feedback information according to a first embodiment of the present application;
FIG. 5 is a schematic diagram of an apparatus for voice interaction according to a second embodiment of the present application;
FIG. 6 is a schematic diagram of a voice interaction user determination module according to a second embodiment of the present application;
FIG. 7 is a schematic diagram of a voice interaction user determination module according to a second embodiment of the present application;
fig. 8 is a schematic diagram of a feedback information playing module according to a second embodiment of the present application;
FIG. 9 is a block diagram of an electronic device for implementing a method of voice interaction of an embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
As shown in fig. 1, the present application provides a method of voice interaction, comprising the steps of:
s101: and under the condition that the voice signals contain interaction information, determining a plurality of voice interaction users sending the voice signals according to the sound source positions of the voice signals and the auxiliary information detected by the sensor.
S102: and setting a label for the interactive information in the voice signal, wherein the label corresponds to the voice interactive user sending the voice signal.
S103: and generating feedback information of the interaction information.
S104: and playing feedback information to the voice interaction user corresponding to the label.
The method and the process can be applied to a riding scene, a meeting scene or a home scene and the like. Taking a riding scene as an example, the execution main body of the method can be a vehicle-mounted computer. It is assumed that four occupants are included in the vehicle, namely a driver on the left side of the front row, a first passenger on the right side of the front row, a second passenger on the left side of the rear row and a third passenger on the right side of the rear row.
The interaction information may be a wake-up word that initiates a voice interaction, or a statement with an explicit intent to interact. For example, statements with explicit intent to interact may include: "turn down the air conditioner temperature a little", "open the window", "where the current position is", and the like.
In case it is detected that the speech signal contains interactive information, a confirmation procedure for the speech interactive user may be initiated. For example, sound source localization may be performed based on sound waves detected by a microphone array provided in the vehicle, and the approximate position of the sound source of the voice signal including the interactive information may be obtained.
The auxiliary information detected by the sensor may be information that the seat detected by the in-vehicle seat sensor is occupied, or the like.
In addition, the sensor-detected auxiliary information may also include the talking vehicle occupant detected from the (dynamic) image captured by the image sensor using image recognition techniques.
Comprehensive judgment is carried out according to the sound source position of the voice signal and the auxiliary information detected by the sensor, and at least one voice interaction user sending the voice signal containing the interaction information can be determined.
For example, two voice interaction users, namely the first passenger on the right side of the front row and the third passenger on the right side of the rear row, are determined, and tags can be set for the two voice interaction users respectively. The tag may be information relating to a seat or a position in the vehicle, etc. After the label is set, each piece of interaction information of the user can load the corresponding label.
For each piece of interaction information, feedback information can be obtained locally or through cloud communication. For example, if the interactive information included in the voice signal of the first passenger on the right side of the front row is "where the present position is", the current position of the vehicle may be determined by the on-vehicle GPS or by communication with the satellite positioning server, and the current position is used as the feedback information. If the interactive information contained in the voice signal of the third passenger on the right side of the rear row is 'turn the air-conditioning temperature down a little', a temperature control instruction can be directly generated to adjust the air-conditioning temperature of the vehicle.
And under the condition that the feedback information is information capable of being broadcasted, the feedback information can be broadcasted to the voice interaction user corresponding to the label. For example, the interactive information included in the voice signal of the first passenger on the right side of the front row is "where is the present position". According to the label, the passenger who gives out the problem can be determined to be the first passenger on the right side of the front row, so that the loudspeaker closest to the passenger can be selected to play the feedback information.
By the scheme, simultaneous interaction of a plurality of voice interaction users can be realized, the voice interaction efficiency under the condition of a plurality of persons is improved, and the intelligence of voice interaction is also improved.
As shown in fig. 2, in one embodiment, the sensor includes an image sensor, and the determining of the auxiliary information includes:
s201: the image detected by the image sensor is recognized, and each user in the image is confirmed.
S202: and obtaining the probability of the voice signal sent by each user according to the facial features of each user.
S203: and determining the probability of the voice signal sent by each user as the auxiliary information.
An image, which may be a dynamic image, detected by an image sensor is acquired. Each user in the dynamic image can be confirmed by using a face recognition technology. In a ride scenario, the user may be a rider. Furthermore, the probability of the voice signal sent by each user can be obtained according to the facial features of each user. For example, the facial features may be the frequency of mouth movements, or facial expressions, or the like. And obtaining the probability of the voice signal sent by each user according to the facial features of each passenger.
In addition, a probability threshold may be set in advance. And determining the probability of the voice signal sent by the user as the auxiliary information under the condition that the obtained probability of the voice signal sent by the user exceeds a probability threshold. Therefore, the calculation amount of the subsequent process can be saved, and the calculation time can be saved.
By the scheme, the image detected by the image sensor is used for carrying out auxiliary recognition on the voice interaction user, so that the accuracy of the recognition of the voice interaction user can be improved.
In one embodiment, the sensor comprises a seat sensor, and the determining of the auxiliary information further comprises:
the information that the seat detected by the seat sensor is occupied is determined as the auxiliary information.
The seat sensor is a film type contact sensor, and contacts of the sensor are uniformly distributed on a stress surface of the automobile seat. When the seat is subjected to a sufficient weight from the outside, a trigger signal is generated. The trigger signal is used as the information that the seat is occupied, i.e., determined as the auxiliary information.
Through the scheme, the voice interaction user is subjected to auxiliary identification by utilizing the occupied information of the seat detected by the seat sensor, and the accuracy of voice interaction user identification can be improved.
As shown in fig. 3, in one embodiment, step S101 includes:
s1011: and determining a first probability that the user at each position is the voice interaction user according to the sound source position of the voice signal.
S1012: and determining a second probability that the user positioned at each position is the voice interaction user according to the auxiliary information.
S1013: a weighted sum of the first probability and the second probability for each location is calculated using pre-assigned weights.
S1014: and in the case that the weighted sum is larger than the preset threshold value, determining the user positioned at the corresponding position as the voice interaction user sending the voice signal.
Sound source localization is performed by using sound waves detected by the microphone array, and the approximate position of the sound source of the voice signal containing the interactive information can be obtained. For example, a microphone array is provided at a center position of the vehicle, and a first passenger on the right side of the front row and a third passenger on the right side of the rear row simultaneously sound. The microphone array is used as a directional array, the directional array can firstly carry out grid division on a positioning area, relative sound pressure of each grid can be obtained through time delay of a received sound source, and a holographic color image for positioning the sound source is finally determined based on the relative sound. And sending the holographic color image to a pre-trained positioning probability model to obtain the probability of the sound source at different positions. The positioning probability model can be trained by adopting a holographic color pattern sample and a sound source positioning sample, so that the trained positioning probability model can obtain the probability that a sound source is positioned at different positions according to the holographic color pattern. That is, a first probability that the user located at each location is a voice interaction user is obtained.
In connection with the foregoing passenger example, for example, it is possible to obtain a probability that the sound source position is on the right side of the front row of 95%, a probability that the sound source position is on the right side of the rear row of 90%, and a probability that the sound source position is in the middle of the rear row of 40%. In addition, the probability that the sound source position is the other seat is obtained as 0.
In addition, a second probability that the user located at the first location is a voice interaction user may be determined from the image detected by the image sensor. In connection with the foregoing examples of the vehicle occupant, for example, the probability that the right side of the front row is a user who utters voice is 85%, and the probability that the right side of the rear row is a user who utters voice is 88% in accordance with the moving image output. The second probability of the user on the right of the front row being a voice interactive user is 85% and the second probability of the user on the right of the back row being a voice interactive user is 88%. Since the middle position in the back row does not actually have a passenger, the probability that the voice interaction user is detected to be located at the middle position in the back row through the moving image is 0. Further, since the occupant of the other seat does not sound, the second probability of detecting the occupant of the other seat from the moving image is negligible.
Further, the information of whether the seat detected by the seat sensor is occupied can be used as a second probability that the user at the first position is a voice interaction user. In combination with the foregoing examples of the vehicle occupant, for example, in the case where it is detected that the seats on the right side of the front row and the right side of the rear row are occupied, it is found that the probability that the user in the above two positions is a voice interaction user is 100%. The second probability that the voice interaction user is located on the right side of the front row and the right side of the back row is 100%. In addition, since there is no occupant in the middle position in the back row, it is found that the probability that the voice interaction user is located in the middle position in the back row is 0.
Weights are assigned to the first probability and the second probability in advance. For example, the first probability may be weighted more heavily than the second probability. For another example, the weight of the second probability obtained from the moving image is larger than the weight of the second probability obtained from the seat sensor.
The weighted sum E of the first probability and the second probability for the first location is represented as: e ═ q1*W1+q2*W2+q3*W3. In the formula, W1Representing according to speechDetermining a first probability, W, that a voice interaction user is located at a first position2Indicating a second probability, W, of determining that the voice interactive user is located at the first position based on the auxiliary information detected from the dynamic image3Indicating a second probability that the voice interactive user is located at the first location by indicating that the voice interactive user is located according to the supplementary information detected by the seat sensor. q. q.s1、q2、q3Respectively represent W1、W2、W3The weight of (c).
And in the case that the weighted sum E is larger than a preset threshold value, determining the voice interaction user at the first position as the voice interaction user which sends out the voice signal. For example, in the above example, in the case where the calculated weighted sum of the front and rear right sides is greater than the predetermined threshold, the user on the front and rear right sides may be determined to be a voice interactive user who utters voice.
By the scheme, the voice interaction user can be determined by integrating the voice sound source position and the auxiliary information detected by the sensor. The accuracy of the determination result is improved.
As shown in fig. 4, in one embodiment, step S104 includes:
s1041: the location of each voice interaction user is determined.
S1042: and determining the loudspeaker closest to each voice interaction user according to the position distribution of each loudspeaker.
S1043: and respectively sending the feedback information to the loudspeaker closest to each voice interaction user for playing according to the label.
And under the condition of determining the voice interaction user, determining the position of the voice interaction user according to the sound source position and the auxiliary information. Such as the front right and rear right in the previous rider example, etc.
According to the position distribution of the loudspeakers acquired in advance, the loudspeaker closest to each voice interaction user can be determined. The position distribution of the speakers acquired in advance may be acquired by position information input by a user, configuration information obtained by a download vehicle, or the like.
From the tags, the voice interaction users and their locations can be identified. Therefore, the feedback information can be sent to the loudspeaker which is closest to each voice interaction user for playing so as to realize voice interaction.
By the scheme, a plurality of voice interaction users can be supported to perform simultaneous interaction, and the interference between the voice interaction users is reduced.
As shown in fig. 5, the present application provides a voice interaction apparatus, including:
the voice interaction user determining module 501 is configured to determine, when it is detected that the voice signal includes interaction information, a plurality of voice interaction users who send the voice signal according to a sound source position of the voice signal and the auxiliary information detected by the sensor.
A tag setting module 502, configured to set a tag for the interactive information in the voice signal, where the tag corresponds to a voice interaction user who sends the voice signal.
A feedback information generating module 503, configured to generate feedback information for the interaction information.
And a feedback information playing module 504, configured to play the feedback information to the voice interaction user corresponding to the tag.
In one embodiment, the sensor includes an image sensor;
as shown in fig. 6, the voice interaction user determination module 501 includes:
the user recognition sub-module 5011 is configured to recognize the image detected by the image sensor and confirm each user in the image.
The assistant information confirming submodule 5012 is configured to obtain a probability that each user sends a voice signal according to the facial features of each user.
And determining the probability of the voice signal sent by each user as the auxiliary information.
In one embodiment, the sensor comprises a seat sensor;
the assistance information confirmation submodule 5012 is also configured to: the information that the seat detected by the seat sensor is occupied is determined as the auxiliary information.
As shown in fig. 7, in an embodiment, the voice interaction user determination module 501 further includes:
the first probability determination submodule 5013 is configured to determine a first probability that a user located at each position is a voice interaction user according to a sound source position of the voice signal.
The second probability determining submodule 5014 is configured to determine, according to the auxiliary information, a second probability that the user located at each location is the voice interaction user.
The weighted sum calculation submodule 5015 is configured to calculate a weighted sum of the first probability and the second probability at each location by using the pre-assigned weights.
The voice interaction user determination execution sub-module 5016 is configured to determine the user located at the corresponding position as the voice interaction user who utters the voice signal if the weighted sum is greater than the predetermined threshold.
As shown in fig. 8, in an embodiment, the feedback information playing module 504 includes:
and the position determining submodule 5041 is used for determining the position of each voice interaction user.
And the loudspeaker determining submodule 5042 is used for determining the loudspeaker which is closest to each voice interaction user according to the position distribution of each loudspeaker.
And the feedback information playing execution submodule 5043 is configured to send the feedback information to the speaker closest to each voice interaction user for playing according to the label.
According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.
Fig. 9 is a block diagram of an electronic device according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 9, the electronic apparatus includes: one or more processors 910, memory 920, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). One processor 910 is illustrated in fig. 9.
The memory 920 is a non-transitory computer readable storage medium provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method of voice interaction provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of voice interaction provided herein.
The memory 920 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the voice interaction method in the embodiment of the present application (for example, the voice interaction user determination module 501, the tag setting module 502, the feedback information generation module 503, and the feedback information playing module 504 shown in fig. 5). The processor 910 executes various functional applications of the server and data processing, i.e., a method of implementing voice interaction in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 920.
The memory 920 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device according to the method of voice interaction, and the like. Further, the memory 920 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 920 may optionally include memory located remotely from the processor 910, which may be connected to the electronic device of the method of voice interaction via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the method of voice interaction may further comprise: an input device 930 and an output device 940. The processor 910, the memory 920, the input device 930, and the output device 940 may be connected by a bus or other means, and fig. 9 illustrates an example of a connection by a bus.
The input device 930 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus of the method of voice interaction, such as an input device like a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, etc. The output devices 940 may include a display device, an auxiliary lighting device (e.g., an LED), a haptic feedback device (e.g., a vibration motor), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (12)

1. A method of voice interaction, comprising:
under the condition that a voice signal is detected to contain interaction information, determining a plurality of voice interaction users sending the voice signal according to the sound source position of the voice signal and auxiliary information detected by a sensor;
setting a label for the interactive information in the voice signal, wherein the label corresponds to a voice interactive user sending the voice signal;
generating feedback information for the interaction information;
and playing the feedback information to the voice interaction user corresponding to the label.
2. The method of claim 1, wherein the sensor comprises an image sensor;
the determination mode of the auxiliary information comprises the following steps:
identifying the image detected by the image sensor and confirming each user in the image;
obtaining the probability of the voice signal sent by each user according to the facial features of each user;
and determining the probability of the voice signal sent by each user as the auxiliary information.
3. The method of claim 2, wherein the sensor comprises a seat sensor;
the determination method of the auxiliary information further comprises:
determining information that the seat detected by the seat sensor is occupied as the auxiliary information.
4. The method according to any one of claims 1 to 3, wherein the determining a plurality of voice interaction users who emit the voice signal according to the sound source position of the voice signal and the auxiliary information detected by the sensor comprises:
determining a first probability that the user at each position is the voice interaction user according to the sound source position of the voice signal;
determining a second probability that the user at each position is the voice interaction user according to the auxiliary information;
calculating a weighted sum of the first probability and the second probability for each location using pre-assigned weights;
and in the case that the weighted sum is larger than a preset threshold value, determining the user at the corresponding position as the voice interaction user which sends the voice signal.
5. The method of claim 1, playing the feedback information to a voice interaction user corresponding to the tag, comprising:
determining a location of each of the voice interaction users;
determining the loudspeaker closest to each voice interaction user according to the position distribution of each loudspeaker;
and according to the labels, respectively sending the feedback information to the loudspeaker closest to each voice interaction user for playing.
6. An apparatus for voice interaction, comprising:
the voice interaction user determining module is used for determining a plurality of voice interaction users sending the voice signals according to the sound source position of the voice signals and the auxiliary information detected by the sensor under the condition that the voice signals contain interaction information;
the tag setting module is used for setting a tag for the interactive information in the voice signal, wherein the tag corresponds to a voice interactive user sending the voice signal;
the feedback information generating module is used for generating feedback information of the interaction information;
and the feedback information playing module is used for playing the feedback information to the voice interaction user corresponding to the label.
7. The apparatus of claim 6, wherein the sensor comprises an image sensor;
the voice interaction user determination module comprises:
the user identification submodule is used for identifying the image detected by the image sensor and confirming each user in the image;
the auxiliary information confirming submodule is used for obtaining the probability of the voice signal sent by each user according to the facial features of each user;
and determining the probability of the voice signal sent by each user as the auxiliary information.
8. The apparatus of claim 7, wherein the sensor comprises a seat sensor;
the assistant information confirming submodule is further configured to: determining information that the seat detected by the seat sensor is occupied as the auxiliary information.
9. The apparatus of any of claims 6 to 8, wherein the voice interaction user determination module further comprises:
a first probability determination submodule, configured to determine, according to a sound source position of the voice signal, a first probability that a user located at each position is the voice interaction user;
a second probability determination submodule, configured to determine, according to the auxiliary information, a second probability that a user located at each location is the voice interaction user;
a weighted sum calculation submodule for calculating a weighted sum of the first probability and the second probability for each position using a pre-assigned weight;
and the voice interaction user determination execution sub-module is used for determining the user at the corresponding position as the voice interaction user sending the voice signal under the condition that the weighted sum is greater than a preset threshold value.
10. The apparatus of claim 6, wherein the feedback information playing module comprises:
the position determining submodule is used for determining the position of each voice interaction user;
the loudspeaker determining submodule is used for determining the loudspeaker which is closest to each voice interaction user according to the position distribution of each loudspeaker;
and the feedback information playing execution submodule is used for respectively sending the feedback information to the loudspeaker closest to each voice interaction user for playing according to the label.
11. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 5.
12. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 5.
CN202010530888.5A 2020-06-11 2020-06-11 Voice interaction method and device, electronic equipment and storage medium Active CN111694433B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010530888.5A CN111694433B (en) 2020-06-11 2020-06-11 Voice interaction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010530888.5A CN111694433B (en) 2020-06-11 2020-06-11 Voice interaction method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111694433A true CN111694433A (en) 2020-09-22
CN111694433B CN111694433B (en) 2023-06-20

Family

ID=72480499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010530888.5A Active CN111694433B (en) 2020-06-11 2020-06-11 Voice interaction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111694433B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112562664A (en) * 2020-11-27 2021-03-26 上海仙塔智能科技有限公司 Sound adjusting method, system, vehicle and computer storage medium
CN112581981A (en) * 2020-11-04 2021-03-30 北京百度网讯科技有限公司 Human-computer interaction method and device, computer equipment and storage medium
CN113362823A (en) * 2021-06-08 2021-09-07 深圳市同行者科技有限公司 Multi-terminal response method, device, equipment and storage medium of household appliance
CN113407758A (en) * 2021-07-13 2021-09-17 中国第一汽车股份有限公司 Data processing method and device, electronic equipment and storage medium
CN114564265A (en) * 2021-12-22 2022-05-31 上海小度技术有限公司 Interaction method and device of intelligent equipment with screen and electronic equipment
CN114664295A (en) * 2020-12-07 2022-06-24 北京小米移动软件有限公司 Robot and voice recognition method and device for same
WO2023116502A1 (en) * 2021-12-23 2023-06-29 广州小鹏汽车科技有限公司 Speech interaction method and apparatus, and vehicle and storage medium
WO2023202635A1 (en) * 2022-04-22 2023-10-26 华为技术有限公司 Voice interaction method, and electronic device and storage medium

Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7126583B1 (en) * 1999-12-15 2006-10-24 Automotive Technologies International, Inc. Interactive vehicle display system
US8560236B1 (en) * 2008-06-20 2013-10-15 Google Inc. Showing uncertainty of location
CN103439688A (en) * 2013-08-27 2013-12-11 大连理工大学 Sound source positioning system and method used for distributed microphone arrays
CN105159111A (en) * 2015-08-24 2015-12-16 百度在线网络技术(北京)有限公司 Artificial intelligence-based control method and control system for intelligent interaction equipment
US20160171974A1 (en) * 2014-12-15 2016-06-16 Baidu Usa Llc Systems and methods for speech transcription
US20170025116A1 (en) * 2015-07-21 2017-01-26 Rovi Guides, Inc. Systems and methods for identifying content corresponding to a language spoken in a household
CN108399916A (en) * 2018-01-08 2018-08-14 蔚来汽车有限公司 Vehicle intelligent voice interactive system and method, processing unit and storage device
CN108877795A (en) * 2018-06-08 2018-11-23 百度在线网络技术(北京)有限公司 The method and apparatus of information for rendering
CN109147787A (en) * 2018-09-30 2019-01-04 深圳北极鸥半导体有限公司 A kind of smart television acoustic control identifying system and its recognition methods
CN109493871A (en) * 2017-09-11 2019-03-19 上海博泰悦臻网络技术服务有限公司 The multi-screen voice interactive method and device of onboard system, storage medium and vehicle device
CN109490834A (en) * 2018-10-17 2019-03-19 北京车和家信息技术有限公司 A kind of sound localization method, sound source locating device and vehicle
CN109493876A (en) * 2018-10-17 2019-03-19 北京车和家信息技术有限公司 A kind of microphone array control method, device and vehicle
CN109545219A (en) * 2019-01-09 2019-03-29 北京新能源汽车股份有限公司 Vehicle-mounted voice exchange method, system, equipment and computer readable storage medium
CN109782231A (en) * 2019-01-17 2019-05-21 北京大学 A kind of end-to-end sound localization method and system based on multi-task learning
CN110047487A (en) * 2019-06-05 2019-07-23 广州小鹏汽车科技有限公司 Awakening method, device, vehicle and the machine readable media of vehicle-mounted voice equipment
CN110070868A (en) * 2019-04-28 2019-07-30 广州小鹏汽车科技有限公司 Voice interactive method, device, automobile and the machine readable media of onboard system
CN110082723A (en) * 2019-05-16 2019-08-02 浙江大华技术股份有限公司 A kind of sound localization method, device, equipment and storage medium
CN110211585A (en) * 2019-06-05 2019-09-06 广州小鹏汽车科技有限公司 In-car entertainment interactive approach, device, vehicle and machine readable media
CN110364153A (en) * 2019-07-30 2019-10-22 恒大智慧科技有限公司 A kind of distributed sound control method, system, computer equipment and storage medium
CN112965033A (en) * 2021-02-03 2021-06-15 深圳市轻生活科技有限公司 Sound source positioning system

Patent Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7126583B1 (en) * 1999-12-15 2006-10-24 Automotive Technologies International, Inc. Interactive vehicle display system
US8560236B1 (en) * 2008-06-20 2013-10-15 Google Inc. Showing uncertainty of location
CN103439688A (en) * 2013-08-27 2013-12-11 大连理工大学 Sound source positioning system and method used for distributed microphone arrays
US20160171974A1 (en) * 2014-12-15 2016-06-16 Baidu Usa Llc Systems and methods for speech transcription
US20170025116A1 (en) * 2015-07-21 2017-01-26 Rovi Guides, Inc. Systems and methods for identifying content corresponding to a language spoken in a household
CN105159111A (en) * 2015-08-24 2015-12-16 百度在线网络技术(北京)有限公司 Artificial intelligence-based control method and control system for intelligent interaction equipment
WO2017031860A1 (en) * 2015-08-24 2017-03-02 百度在线网络技术(北京)有限公司 Artificial intelligence-based control method and system for intelligent interaction device
CN109493871A (en) * 2017-09-11 2019-03-19 上海博泰悦臻网络技术服务有限公司 The multi-screen voice interactive method and device of onboard system, storage medium and vehicle device
CN108399916A (en) * 2018-01-08 2018-08-14 蔚来汽车有限公司 Vehicle intelligent voice interactive system and method, processing unit and storage device
CN108877795A (en) * 2018-06-08 2018-11-23 百度在线网络技术(北京)有限公司 The method and apparatus of information for rendering
CN109147787A (en) * 2018-09-30 2019-01-04 深圳北极鸥半导体有限公司 A kind of smart television acoustic control identifying system and its recognition methods
CN109490834A (en) * 2018-10-17 2019-03-19 北京车和家信息技术有限公司 A kind of sound localization method, sound source locating device and vehicle
CN109493876A (en) * 2018-10-17 2019-03-19 北京车和家信息技术有限公司 A kind of microphone array control method, device and vehicle
CN109545219A (en) * 2019-01-09 2019-03-29 北京新能源汽车股份有限公司 Vehicle-mounted voice exchange method, system, equipment and computer readable storage medium
CN109782231A (en) * 2019-01-17 2019-05-21 北京大学 A kind of end-to-end sound localization method and system based on multi-task learning
CN110070868A (en) * 2019-04-28 2019-07-30 广州小鹏汽车科技有限公司 Voice interactive method, device, automobile and the machine readable media of onboard system
CN110082723A (en) * 2019-05-16 2019-08-02 浙江大华技术股份有限公司 A kind of sound localization method, device, equipment and storage medium
CN110047487A (en) * 2019-06-05 2019-07-23 广州小鹏汽车科技有限公司 Awakening method, device, vehicle and the machine readable media of vehicle-mounted voice equipment
CN110211585A (en) * 2019-06-05 2019-09-06 广州小鹏汽车科技有限公司 In-car entertainment interactive approach, device, vehicle and machine readable media
CN110364153A (en) * 2019-07-30 2019-10-22 恒大智慧科技有限公司 A kind of distributed sound control method, system, computer equipment and storage medium
CN112965033A (en) * 2021-02-03 2021-06-15 深圳市轻生活科技有限公司 Sound source positioning system

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112581981A (en) * 2020-11-04 2021-03-30 北京百度网讯科技有限公司 Human-computer interaction method and device, computer equipment and storage medium
CN112581981B (en) * 2020-11-04 2023-11-03 北京百度网讯科技有限公司 Man-machine interaction method, device, computer equipment and storage medium
CN112562664A (en) * 2020-11-27 2021-03-26 上海仙塔智能科技有限公司 Sound adjusting method, system, vehicle and computer storage medium
CN114664295A (en) * 2020-12-07 2022-06-24 北京小米移动软件有限公司 Robot and voice recognition method and device for same
CN113362823A (en) * 2021-06-08 2021-09-07 深圳市同行者科技有限公司 Multi-terminal response method, device, equipment and storage medium of household appliance
CN113407758A (en) * 2021-07-13 2021-09-17 中国第一汽车股份有限公司 Data processing method and device, electronic equipment and storage medium
CN114564265A (en) * 2021-12-22 2022-05-31 上海小度技术有限公司 Interaction method and device of intelligent equipment with screen and electronic equipment
CN114564265B (en) * 2021-12-22 2023-07-25 上海小度技术有限公司 Interaction method and device of intelligent equipment with screen and electronic equipment
WO2023116502A1 (en) * 2021-12-23 2023-06-29 广州小鹏汽车科技有限公司 Speech interaction method and apparatus, and vehicle and storage medium
WO2023202635A1 (en) * 2022-04-22 2023-10-26 华为技术有限公司 Voice interaction method, and electronic device and storage medium

Also Published As

Publication number Publication date
CN111694433B (en) 2023-06-20

Similar Documents

Publication Publication Date Title
CN111694433A (en) Voice interaction method and device, electronic equipment and storage medium
CN105793921A (en) Initiating actions based on partial hotwords
US20190051302A1 (en) Technologies for contextual natural language generation in a vehicle
CN111541919B (en) Video frame transmission method and device, electronic equipment and readable storage medium
CN111591178B (en) Automobile seat adjusting method, device, equipment and storage medium
CN111968642A (en) Voice data processing method and device and intelligent vehicle
EP3916727B1 (en) Voice pickup method and apparatus for intelligent rearview mirror
US11670293B2 (en) Arbitrating between multiple potentially-responsive electronic devices
CN109817214B (en) Interaction method and device applied to vehicle
CN111383661B (en) Sound zone judgment method, device, equipment and medium based on vehicle-mounted multi-sound zone
CN112634890A (en) Method, apparatus, device and storage medium for waking up playing device
CN115793852A (en) Method for acquiring operation indication based on cabin area, display method and related equipment
CN111768759A (en) Method and apparatus for generating information
CN114038465B (en) Voice processing method and device and electronic equipment
CN113539265B (en) Control method, device, equipment and storage medium
CN113488043A (en) Passenger speaking detection method and device, electronic equipment and storage medium
CN115061762A (en) Page display method and device, electronic equipment and medium
CN115985309A (en) Voice recognition method and device, electronic equipment and storage medium
CN114035878A (en) Information display method, information display device, electronic equipment and storage medium
KR20190074344A (en) Dialogue processing apparatus and dialogue processing method
CN113276890A (en) Automatic control method and device for vehicle and vehicle
CN111724805A (en) Method and apparatus for processing information
EP4350693A2 (en) Voice processing method and apparatus, computer device, and storage medium
CN112951216A (en) Vehicle-mounted voice processing method and vehicle-mounted information entertainment system
CN114154491A (en) Interface skin updating method, device, equipment, medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20211013

Address after: 100176 Room 101, 1st floor, building 1, yard 7, Ruihe West 2nd Road, economic and Technological Development Zone, Daxing District, Beijing

Applicant after: Apollo Zhilian (Beijing) Technology Co.,Ltd.

Address before: 2 / F, *** building, 10 Shangdi 10th Street, Haidian District, Beijing 100085

Applicant before: BEIJING BAIDU NETCOM SCIENCE AND TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant