CN113470649A

CN113470649A - Voice interaction method and device

Info

Publication number: CN113470649A
Application number: CN202110948740.8A
Authority: CN
Inventors: 刘明; 申立明; 柳艳; 马嘉林
Original assignee: Samsung Electronics China R&D Center; Samsung Electronics Co Ltd
Current assignee: Samsung Electronics China R&D Center; Samsung Electronics Co Ltd
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2021-10-01

Abstract

A voice interaction method and apparatus are disclosed. A voice interaction method is provided, which comprises: detecting a first user voice; displaying a plurality of candidate objects associated with the first user voice and object enhancement information corresponding to the candidate objects according to the detected first user voice; determining a target object from the plurality of candidate objects based on object enhancement information corresponding to the plurality of candidate objects.

Description

Voice interaction method and device

Technical Field

The present invention relates generally to the field of voice interaction technology, and more particularly, to a voice interaction method and apparatus.

Background

At present, with the popularization and development of smart tvs, the voice interaction function has become one of the essential functions of smart tvs, and has also been integrated in many products such as smart cars, smart phones, smart stereos, virtual display devices, and the like. People can conveniently retrieve content by voice searching, for example, looking for favorite songs or movies. The current voice interaction modes mainly include two types: one is a near-field scheme, which starts or ends a conversation process through a specific voice key, in the near-field voice scheme, a user presses a key to start the conversation, and when the key is released, the conversation ends, which is similar to the conversation process of an interphone; another is a far-field scheme, where the user controls to open a dialog by a specific wake-up word, starts a voice interaction by a wake-up word, there may be multiple voice interactions in a session, and the end of the session is identified by a specific action (or determination or session timeout).

Traditional voice interactions use voice buttons or wake up words to turn a voice conversation on and off. For devices such as non-touch screens, or AR virtual screens, there are cumbersome voice prompts due to uncertainty in the recognition of the user's intent and understanding of the target entity. In addition, the traditional voice interaction has the problems of fixed sentence patterns, easy interruption by external sound in the interaction and the like. In the voice interaction method, the user can realize human-computer interaction through heuristic information presented by the system, and the naturalness of the human-computer interaction is improved. By displaying visible or invisible enhanced information (HOAI) associated with the target to be selected in the interaction, the method can help the user to perform efficient voice intention feedback, remove voice and noise interference irrelevant to the theme, avoid repeated voice interaction sessions, improve the interaction efficiency and improve the user experience.

The above information is presented as background information only to aid in understanding the present application. No determination is made as to whether any of the above information is appropriate as prior art with respect to the present application, nor is any assertion made.

Disclosure of Invention

Aspects of the present application address at least the above problems and/or disadvantages and provide at least the advantages described below.

The exemplary embodiment of the application provides a voice interaction method and a voice interaction device. According to the voice interaction method and device, when voice interaction is used, when interaction objects are selected on various types of display interfaces, conversation is flexibly managed by displaying the associated object enhancement information of the interaction objects, and therefore natural-like voice man-machine interaction is achieved.

According to an exemplary embodiment of the present application, there is provided a voice interaction method including: detecting a first user voice; displaying a plurality of candidate objects associated with the first user voice and object enhancement information corresponding to the candidate objects according to the detected first user voice; determining a target object from the plurality of candidate objects based on object enhancement information corresponding to the plurality of candidate objects.

Optionally, displaying, according to the detected first user voice, a plurality of candidate objects associated with the first user voice and object enhancement information corresponding to the plurality of candidate objects includes: determining a plurality of candidate objects associated with the first user voice according to the detected first user voice; generating object enhancement information corresponding to the plurality of candidate objects, wherein generating the object enhancement information corresponding to the plurality of candidate objects comprises: context information of the plurality of candidate objects is acquired, and object enhancement information corresponding to the plurality of candidate objects is generated based on characteristics and the context information of the plurality of candidate objects.

Optionally, determining the target object from the plurality of candidate objects based on the object enhancement information corresponding to the plurality of candidate objects comprises: receiving a user input or a second user voice; determining a candidate object having object enhancement information matching the user input or the second user voice among the plurality of candidate objects as a target object.

Optionally, determining the target object from the plurality of candidate objects based on the object enhancement information corresponding to the plurality of candidate objects comprises: receiving a user input or a second user voice; when a plurality of candidate target objects are determined from the plurality of candidate objects based on the object enhancement information corresponding to the plurality of candidate objects and the user input or the second user voice, the object enhancement information corresponding to the plurality of candidate target objects is updated based on the characteristics and the context information of the plurality of candidate target objects, and the target object is determined based on the updated object enhancement information.

Optionally, the method further comprises: recognizing a user intention based on the detected first user voice; based on the object enhancement information corresponding to the plurality of candidate objects and the identified user intention, extraneous external speech or noise is avoided.

Optionally, the method further comprises: and receiving a third user voice, and when the intention of the third user voice is different from the recognized user intention, re-determining the candidate object according to the third user voice, and generating object enhancement information corresponding to the re-determined candidate object.

According to an exemplary embodiment of the present application, there is provided a voice interaction apparatus including: the detection module detects the first user voice; the control module displays a plurality of candidate objects related to the first user voice and object enhancement information corresponding to the candidate objects according to the detected first user voice; a determination module to determine a target object from the plurality of candidate objects based on object enhancement information corresponding to the plurality of candidate objects.

Optionally, the control module is configured to: determining a plurality of candidate objects associated with the first user voice according to the detected first user voice; generating object enhancement information corresponding to the plurality of candidate objects, wherein generating the object enhancement information corresponding to the plurality of candidate objects comprises: context information of the plurality of candidate objects is acquired, and object enhancement information corresponding to the plurality of candidate objects is generated based on characteristics and the context information of the plurality of candidate objects.

Optionally, the determining module is configured to: acquiring user input or second user voice received by the detection module; determining a candidate object having object enhancement information matching the user input or the second user voice among the plurality of candidate objects as a target object.

Optionally, the determining module is configured to: acquiring user input or second user voice received by the detection module; when a plurality of candidate target objects are determined from the plurality of candidate objects based on the object enhancement information corresponding to the plurality of candidate objects and the user input or the second user voice, updated object enhancement information obtained by the control module updating the object enhancement information corresponding to the plurality of candidate target objects based on the characteristics and context information of the plurality of candidate target objects is acquired, and the target object is determined based on the updated object enhancement information.

Optionally, the control module is further configured to: recognizing a user intention based on the detected first user voice; based on the object enhancement information corresponding to the plurality of candidate objects and the identified user intention, extraneous external speech or noise is avoided.

Optionally, the control module is further configured to: and acquiring the third user voice received by the detection module, and when the intention of the third user voice is different from the recognized user intention, re-determining the candidate object according to the third user voice to generate object enhancement information corresponding to the re-determined candidate object.

According to another exemplary embodiment of the present application, a computer-readable storage medium storing instructions is provided, wherein the instructions, when executed by at least one computing device, cause the at least one computing device to perform the voice interaction method as described above.

According to another exemplary embodiment of the present application, there is provided a computing apparatus including: a processor; a memory storing a computer program which, when executed by the processor, implements the voice interaction method as described above.

The voice interaction method and the voice interaction device according to the exemplary embodiment of the application are heuristic voice interaction methods or heuristic voice interaction devices. The method or the device can effectively correlate the subject content and avoid external voice disturbance by heuristically presenting the content based on the visible or invisible correlation enhancement information of the interactive object, and simultaneously adaptively manage the voice command during the interactive session, thereby processing the problem of voice interactive single target selection. The method and the device can provide the experience similar to natural voice interaction for the user, and avoid the problems of voice interaction popularity and convenience caused by insufficient guidance and education of the user by limited voice samples.

Additional aspects will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the embodiments of the application.

Drawings

The above and other aspects, features and advantages of particular embodiments of the present application will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings in which:

fig. 1 (a), (b) and (c) illustrate conceptual diagrams of a voice interaction method according to an exemplary embodiment of the present application;

FIG. 2 shows a flow diagram of a voice interaction method according to an example embodiment of the present application;

FIG. 3 shows a general system architecture diagram according to an exemplary embodiment of the present application;

FIG. 4 shows a diagram of module dependencies, according to an example embodiment of the present application;

FIG. 5 shows a flow diagram of a module implementation according to an example embodiment of the present application;

fig. 6 (a), (b) are diagrams illustrating an embodiment to which a voice interaction method of the present application is applied, according to an exemplary embodiment of the present application;

fig. 7 (a), (b) are diagrams illustrating an embodiment in an AR mode to which a voice interaction method of the present application is applied, according to an exemplary embodiment of the present application;

FIG. 8 shows a block diagram of a voice interaction device, according to an example embodiment of the present application.

Detailed Description

Hereinafter, the present invention will be described in detail with reference to the accompanying drawings. In describing the present invention, a detailed description of the related art or configuration is omitted when it is determined that the detailed description may unnecessarily obscure the gist of the present invention. Further, the following embodiments can be modified in various different forms, and the scope of the technical idea of the present invention is not limited to the following embodiments. Such embodiments are provided to complete the present invention and to fully convey the technical concept of the present invention to those skilled in the art.

The terms and words used in the following description and claims are not limited to a literal meaning, but are used only by the applicants to enable a clear and consistent understanding of the disclosure. Accordingly, it should be apparent to those skilled in the art that the following description of the various embodiments of the present disclosure is provided for illustration only and not for the purpose of limiting the disclosure as defined by the claims and their equivalents.

The terminology used herein is for the purpose of describing various embodiments of the application only and is not intended to be limiting of the application. The singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. In the present application, the terms "comprises" or "comprising" indicate the presence of the features, numbers, steps, operations, structural elements, components or combinations thereof, and do not exclude the possibility of the presence or addition of one or more other features, numbers, steps, operations, structural elements, components or combinations thereof.

In the specification, the term "a or B", "at least one of a or/and B" or "one or more of a or/and B" may include all possible combinations of the items listed together. For example, the term "a or B" or "at least one of a or/and B" may specify (1) at least one a, (2) at least one B, or (3) both at least one a and at least one B.

Herein, the expression "configured to" may be used interchangeably with, for example, "adapted to", "having … … capability", "designed to", "adapted to", "manufactured to" or "capable of". The expression "configured to" does not necessarily mean "specially designed" in a hardware sense. Conversely, in some cases, "an apparatus configured as … …" may indicate that such an apparatus may perform operations with another apparatus or component. For example, the expression "a processor configured to perform A, B and C" may indicate a dedicated processor (e.g., an embedded processor) that performs the respective operations, or a general-purpose processor (e.g., a Central Processing Unit (CPU) or an Application Processor (AP)) that may perform the respective operations by executing one or more software programs stored in a memory device.

An electronic device according to an embodiment of the present disclosure may be implemented as a smartphone. Furthermore, the electronic device may be implemented as a mobile phone, a Personal Digital Assistant (PDA), a tablet Personal Computer (PC), a video phone, an e-book reader, a desktop PC, a laptop PC, a netbook computer, a Portable Multimedia Player (PMP), a moving picture experts group phase 1 or phase 2(MPEG-1 or MPEG-2) audio layer 3(MP3) player, a mobile medical instrument, a camera, an internet of things, or a wearable device.

Fig. 1 (a), (b) and (c) illustrate conceptual diagrams of a voice interaction method according to an exemplary embodiment of the present application. Referring to fig. 1 (a), at the beginning of a user voice session, all visible or invisible objects may be the interactive objects. Referring to (b) and (c) of fig. 1, the system gradually determines a target interactive object from a plurality of visible or invisible objects according to the detected user voice. For example, the voice interaction method according to the present application determines the target object presented in (c) of fig. 1 from among the plurality of objects in (a) of fig. 1. A voice interaction method according to an exemplary embodiment of the present application will be described in detail below with reference to fig. 2.

Fig. 2 shows a flow chart of a voice interaction method according to an exemplary embodiment of the present application. The method may be implemented by a computer program. For example, the method may be performed by an application installed in the electronic device. As an example, the electronic device may be a mobile communication terminal (e.g., a smart phone), a multimedia playing device, a smart wearable apparatus (e.g., a smart watch), or the like capable of recommending content to a user.

Referring to fig. 2, the method includes: in step S201, a first user voice is detected. The user speech may be speech information for interacting with an electronic device, such as a television, smart phone, smart stereo, air conditioner, etc. An electronic device for voice interaction with a user according to an exemplary embodiment of the present application is capable of receiving a user voice from the user and responding to the received user voice.

After detecting the first user speech, the method comprises: in step S202, a plurality of candidate objects associated with the first user voice and object enhancement information corresponding to the plurality of candidate objects are displayed according to the detected first user voice. According to the exemplary embodiment of the application, when the user selects, the system may dynamically display the candidate object according to the voice of the user, and give object enhancement information (HOAI) corresponding to the candidate object, wherein the object enhancement information (HOAI) corresponding to the candidate object is used for inspiring the difference of the user identifying a plurality of candidate objects. By displaying the object enhancement information corresponding to the plurality of candidate objects when the plurality of candidate objects are displayed, the user can conveniently carry out continuous conversation, and the intention of the corresponding plurality of candidate objects is clarified. For example, when the first user voice (or user voice information) is a movie, a plurality of movies and the name and playing time of each movie are displayed according to the detected user voice information.

According to the voice interaction method, displaying a plurality of candidate objects associated with the first user voice and object enhancement information corresponding to the candidate objects according to the detected first user voice comprises the following steps: determining a plurality of candidate objects associated with the first user voice according to the detected first user voice; generating object enhancement information corresponding to a plurality of candidate objects, wherein the generating of the object enhancement information corresponding to the plurality of candidate objects comprises: context information of a plurality of candidate objects is acquired, and object enhancement information corresponding to the plurality of candidate objects is generated based on characteristics of the plurality of candidate objects and the context information. According to an exemplary embodiment of the present application, the context information of the object may be information obtained from a server. The context information of the object may be a location of the object in the server, etc., but is not limited thereto. The property of the object may be a physical property of the object itself, a name of the object, a type of the object, a feature set by a user, etc., but is not limited thereto, for example, the property of the object may be a generation time of the object. A method of generating object enhancement information corresponding to a plurality of candidate objects is described below with reference to fig. 4.

After generating object enhancement information corresponding to a plurality of candidate objects, the method according to an exemplary embodiment of the present application further includes: in step S203, a target object is determined from the plurality of candidate objects based on the object enhancement information corresponding to the plurality of candidate objects. The object enhancement information corresponding to the plurality of candidate objects respectively displays a feature of each of the plurality of candidate objects to be distinguished from other candidate objects of the plurality of candidate objects, so that the target object can be determined based on the feature of each candidate object. However, there may be the same situation with the characteristics of the candidate object according to embodiments of the present application, in which case further user input or user speech is required to update the range of the candidate object.

According to the voice interaction method of the exemplary embodiment of the present application, determining the target object from the plurality of candidate objects based on the object enhancement information corresponding to the plurality of candidate objects includes: receiving a user input or a second user voice; and determining a candidate object with object enhancement information matched with the user input or the second user voice in the plurality of candidate objects as the target object. That is, if the user input or the second user voice coincides with the object enhancement information of one candidate object among the plurality of candidate objects, the one candidate object is determined as the target object.

Further, according to the voice interaction method of the exemplary embodiment of the present application, determining the target object from the plurality of candidate objects based on the object enhancement information corresponding to the plurality of candidate objects includes: receiving a user input or a second user voice; when a plurality of candidate target objects are determined from the plurality of candidate objects based on the object enhancement information corresponding to the plurality of candidate objects and the user input or the second user voice, the object enhancement information corresponding to the plurality of candidate target objects is updated based on the characteristics and the context information of the plurality of candidate target objects, and the target object is determined based on the updated object enhancement information. That is, the user input or the second user voice may be consistent with some of the plurality of candidate objects, and at this time, it is necessary to further narrow the range of the object to be selected and determine the target object among the objects to be selected after narrowing the range, for example, to generate object enhancement information corresponding to the objects to be selected after narrowing the range, and to select the target object from the objects to be selected after narrowing the range based on the object enhancement information. For example, when object enhancement information corresponding to respective ones of a plurality of candidate objects generated based on a first user's voice is identical to each other and is insufficient to determine a target object based on the object enhancement information, the voice interaction method of the present application determines a new candidate object, continues to detect a new user's voice, and determines a target object based on the new user's voice information and object enhancement information corresponding to the new candidate object. According to an exemplary embodiment of the present application, when a target object cannot be determined based on object enhancement information corresponding to a new candidate object and the new user voice information, the above processes of generating object enhancement information and receiving a user input or a user voice may be repeatedly performed until a target object for a voice interaction is determined.

The voice interaction method according to an exemplary embodiment of the present application further includes: recognizing a user intention based on the detected first user voice; based on object enhancement information corresponding to the plurality of candidate objects and the identified user intention, extraneous external speech or noise is avoided. According to the voice interaction method, voice information irrelevant to the interaction can be excluded when a user carries out voice interaction, so that the voice conversation period is effectively managed, and continuous natural-like voice interaction experience is constructed.

The voice interaction method further comprises the following steps: and receiving a third user voice, and when the intention of the third user voice is different from the recognized user intention, re-determining the candidate object according to the third user voice, and generating object enhancement information corresponding to the re-determined candidate object. That is, the method of the present application can generate new object enhancement information (HOAI) based on newly input user speech, apart from the intended text of the current topic, when receiving user speech different from the current intention, thereby supporting cross-domain speech session management and improving multi-turn conversation experience. For example, the first user voice is "how the weather of Nanjing tomorrow" and the user intention can be determined to be the discussion weather state based on the first user voice, so the voice interaction method of the present application may display some weather software or weather conditions of various places of Nanjing for the user to select, however, when the user continues to say "inquire about airline tickets going to Nanjing tomorrow", the method of the present application may depart from the theme of "weather", display applications about airline tickets or tourism to the user, or directly display scenic spots of Nanjing and the position, characteristics, etc. of each scenic spot to the user.

The voice interaction method according to the exemplary embodiment of the present application determines a target object based on heuristic enhancement information (i.e., object enhancement information) of an object to be selected. When the session of selecting the target object is finished, the selected target object can be subjected to subsequent operations, such as deletion, playing, modification and the like, of various existing systems.

Fig. 3 illustrates a general system structure diagram according to an exemplary embodiment of the present application.

Referring to fig. 3, a general system according to an exemplary embodiment of the present application includes a User Interface (UI), a UI recommendation layer, a voice input module, a context extractor module, a visual enhancement module, and a session control. In the present application, the voice input module is used for acquiring the voice information of the user. The context extractor module may receive data from the server, i.e., using information from the server. The visual enhancement module is to generate object enhancement information based on characteristics and contextual information of the object to be selected. And displaying the object to be selected and the object enhancement information corresponding to the object to be selected through the user interface. The session control part is a core module. The interface display of the system can be conveniently integrated into various existing platforms. The context information corresponding to the interactive object can be obtained through the access of the server, and can also be supported by the existing data stored in the system.

FIG. 4 shows a diagram of module dependencies, according to an example embodiment of the present application. As shown in fig. 4, a heuristic voice interaction method according to an exemplary embodiment of the present application can determine candidate objects and their object enhancement information (i.e., enhanced text sequences) based on information obtained from a context extractor and a user's voice input. A flow diagram of a module implementation is described below with reference to fig. 5.

FIG. 5 shows a flow diagram of a module implementation according to an example embodiment of the present application.

Referring to fig. 5, a context map is used to load and generate context data (context information), enhanced information data, etc. of an interactive object, and the process adopts a data management manner of a class diagram to facilitate retrieval. Current vectorization data generation of candidates is performed by GNN using context data as input. And finally inputting the generated current vectorization data into a Sequence decoder (Sequence2Sequence decoder) after passing through a graph embedding module in combination with the attention of graph nodes, wherein the input serves as the input of the Sequence decoder corresponding to the interactive objects, the attention of the graph nodes is related to the current subject through the information of the high-weight context nodes based on an attention mechanism, and the target object range is updated based on the information input by the voice of the user and the dynamic related candidate object set. In this module implementation flow diagram, at the time of voice interaction, user voice is converted into text by the ASR module, whether the voice dialog is device-oriented can be confirmed by the device-oriented session detection DDUD (device-oriented session detection) module, and filtering of irrelevant voice is performed by interaction object information (e.g., object enhancement information) that has been loaded and the currently recognized user intention. The LM module is a pre-trained natural language model, a vectorized representation of the input sequence, where the LM module can support different languages. The output of the LM module is finally input to the sequence decoder through statement embedding (sense embedding), as the input of the sequence decoder corresponding to the user's voice. In the match decoding stage, the sequence decoder determines a target object based on the input sequence of user speech vectors and object enhancement information of candidate objects (interactive objects).

Fig. 6 (a), (b) illustrate diagrams to which an embodiment of a voice interaction method of the present application is applied, according to an exemplary embodiment of the present application.

Referring to fig. 6 (a), after a session starts, when an interface selection is performed, a user says "The Crown", two objects to be selected are in The interface, and The first object and The fourth object both contain "The Crown". In the case shown in (a) of fig. 6, the user cannot determine the target object from the objects to be selected. As shown in fig. 6 (b), the voice interaction method of the present application may calculate object enhancement information of an object to be selected through characteristics and context information of the object to be selected, and provide the object enhancement information of the object to be selected on a display interface, for example, provide heuristic object enhancement information (HOAI) of a movie airing season on a user interface. When The user continues to say "The Crown, Season 2", a fourth object is selected as shown in (b) of fig. 6. Furthermore, according to an exemplary embodiment of the present application, if there is voice input or noise that is not related to the current scene, the conversation management system automatically removes the unrelated voice input or noise.

The session management module according to an exemplary embodiment of the present application may filter irrelevant content based on DDUD features through dynamic matching of a model of a current session intention attention mechanism, thereby enhancing robustness.

Fig. 7 (a), (b) illustrate diagrams of embodiments in the AR mode to which the voice interaction method of the present application is applied, according to exemplary embodiments of the present application. In this embodiment, an enhancement prompt (or object enhancement information) is displayed on the AR device. When the user says 'select cup', relevant information is introduced into the session management module through the context extractor module through an extended module (for example, an image recognition algorithm such as Mask-CNN) or through target information retrieval of the server. In the scene of the AR device, candidate HOAI information is given, and when the user continues to say "select cup, green" or "green", the candidate object corresponding to "green" is selected as the target object.

Fig. 8 is a block diagram illustrating a voice interaction device 800 according to an exemplary embodiment of the present application.

Referring to fig. 8, a voice interaction apparatus 800 according to an exemplary embodiment of the present application may include a detection module 801, a control module 802, and a determination module 803, wherein the detection module 801 detects a first user voice, the control module 802 displays a plurality of candidate objects associated with the first user voice and object enhancement information corresponding to the plurality of candidate objects according to the detected first user voice, and the determination module 803 determines a target object from the plurality of candidate objects based on the object enhancement information corresponding to the plurality of candidate objects. Each module in the voice interaction apparatus 800 may be implemented by one or more modules, and names of the respective modules may vary according to types of the modules. In various embodiments, some modules in the voice interaction device 800 may be omitted, or additional modules may be included. Furthermore, modules/elements according to various embodiments of the present application may be combined to form a single entity, and thus may equivalently perform the functions of the respective modules/elements prior to combination.

In a voice interaction device according to an exemplary embodiment of the present application, the control module 802 is configured to: determining a plurality of candidate objects associated with the first user voice according to the detected first user voice; generating object enhancement information corresponding to a plurality of candidate objects, wherein generating the object enhancement information corresponding to the plurality of candidate objects comprises: context information of a plurality of candidate objects is acquired, and object enhancement information corresponding to the plurality of candidate objects is generated based on characteristics of the plurality of candidate objects and the context information. Referring to fig. 3, the voice interaction apparatus of the present application may obtain context information of a plurality of candidate objects from a server.

The determination module 803 is configured to: obtaining a user input or a second user voice received through the detection module 801; and determining a candidate object with object enhancement information matched with the user input or the second user voice in the plurality of candidate objects as the target object. For example, referring to fig. 6, when the second user voice is "second season" and the object enhancement information of one candidate object of the plurality of candidate objects is "second season", the one candidate object is determined as the target object.

According to another exemplary embodiment of the application, the determining module 803 is configured to: obtaining a user input or a second user voice received through the detection module 801; when a plurality of candidate target objects are determined from the plurality of candidate objects based on the object enhancement information corresponding to the plurality of candidate objects and the user input or the second user voice, updated object enhancement information obtained by the control module 802 updating the object enhancement information corresponding to the plurality of candidate target objects based on the characteristics and the context information of the plurality of candidate target objects is acquired, and the target object is determined based on the updated object enhancement information. When the object enhancement information of some candidate objects in the plurality of candidate objects is the same and the candidate objects cannot be distinguished based on the user input or the second user voice, the control module 802 may narrow the range of the candidate objects, generate a plurality of candidate target objects, and determine a target object from the candidate target objects.

In the voice interaction apparatus of the present application, the control module 802 is further configured to: recognizing a user intention based on the detected first user voice; based on object enhancement information corresponding to the plurality of candidate objects and the identified user intention, extraneous external speech or noise is avoided. Cross-domain speech intent processing may also be performed. In addition, the voice interaction device can also be used for separating from the intentional text under the current theme when the newly received intention of the user voice is inconsistent with the current intention of the user, feeding back information to the context extractor module, and triggering the attention feature update of the context graph, so that more HOAI are provided, the cross-domain voice session management is supported, and the multi-round conversation experience is improved.

The control module 802 is further configured to: the third user's voice received through the detection module 801 is acquired, and when the intention of the third user's voice is different from the recognized user's intention, the candidate object is re-determined according to the third user's voice, and the object enhancement information corresponding to the re-determined candidate object is generated.

It should be understood that each unit in the voice interaction apparatus according to the exemplary embodiments of the present application may be implemented as a hardware component and/or a software component. Those skilled in the art may implement the various units using, for example, Field Programmable Gate Arrays (FPGAs) or Application Specific Integrated Circuits (ASICs), depending on the processing performed by the defined various units.

Further, according to another exemplary embodiment of the present application, there is provided a computer-readable storage medium storing instructions that, when executed by at least one computing device, cause the at least one computing device to perform the voice interaction method as described above.

According to another exemplary embodiment of the application, a computing device is provided, wherein the computing device comprises a memory with a processor storing a computer program which, when executed by the processor, implements the voice interaction method as described above.

Under current voice interaction schemes, the following problems often exist: in the case of normal natural voice interaction, users often give brief voice interaction sentences, and then specifically (for example, "take a cup", "green"), the users need to repeatedly activate the conversation for many times, trigger the voice for many times, and repeatedly wake up words for many times, and the scheme depends on fixed sentences, needs to give user education, but the education and scene adaptability of voice commands are poor; when the voice interaction objects are similar, the system cannot effectively finish the selection of a specific target object, when a large number of selected contents exist, the contents need to be displayed by turning pages, and the defects of poor user experience and operation completion only by triggering voice conversations for many times due to the simple digital marks exist; when the operated objects have the same text labels (such as the same movie name, but different content providers), the system pops up a prompt box (Y/N) to obtain more information to confirm the selection, thereby making the user experience unnatural; in the current voice interaction, when voice interaction is carried out, the current voice interaction scheme is also influenced by external voice and noise, so that the current theme session management is suspended; the current rule-based session management scheme and fixed sentence education methods cannot support diverse devices and voice interaction in complex environments well because of the problems of poor user experience, difficulty in target entity identification and the like.

According to the voice interaction method of the exemplary embodiment of the present application, a more natural and continuous voice dialog selection process can be realized, and a user does not have to repeatedly open a voice conversation using a specific operation key or a wakeup word. When a user is difficult to select through a touch screen (driving), or when a non-entity device (AR device) searches and selects a target object, heuristic object enhancement information (HOAI) is given in a self-adaptive mode, the user is clearly and accurately helped to identify the state and difference information of the candidate object, and therefore the target object can be determined in a natural voice interaction mode.

According to the voice interaction method, the related enhancement information of the interaction object is provided, so that the usability of voice interaction can be remarkably improved, and the problems of poor user interaction experience and low matching accuracy of the user voice operation intention caused by the deficiency of the fixed voice interaction sentence pattern are improved.

In this application, terms such as "module," "unit," "means," and the like are used to refer to an element that performs at least one function or operation, and such element may be implemented as hardware or software, or a combination of hardware and software. Further, components may be integrated in at least one module or chip and implemented in at least one processor, except when each of a plurality of "modules," "units," "components," and so forth need to be implemented in separate hardware.

Various embodiments set forth herein may be implemented as software including one or more instructions stored in a storage medium readable by a machine (e.g., a mobile device or an electronic device). For example, under control of a processor, the processor of the machine may invoke and execute at least one of the one or more instructions stored in the storage medium with or without the use of one or more other components. This enables the machine to be operable to perform at least one function in accordance with the invoked at least one instruction. The one or more instructions may include code generated by a compiler or code capable of being executed by an interpreter. The machine-readable storage medium may be provided in the form of a non-transitory storage medium. Where the term "non-transitory" simply means that the storage medium is a tangible device and does not include a signal (e.g., an electromagnetic wave), the term does not distinguish between data being semi-permanently stored in the storage medium and data being temporarily stored in the storage medium.

Although a few exemplary embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope and spirit of the invention as defined by the appended claims and their equivalents.

Claims

1. A voice interaction method, comprising:

detecting a first user voice;

displaying a plurality of candidate objects associated with the first user voice and object enhancement information corresponding to the candidate objects according to the detected first user voice;

determining a target object from the plurality of candidate objects based on object enhancement information corresponding to the plurality of candidate objects.

2. The method of claim 1, wherein displaying a plurality of candidate objects associated with the first user speech and object enhancement information corresponding to the plurality of candidate objects in accordance with the detected first user speech comprises:

determining a plurality of candidate objects associated with the first user voice according to the detected first user voice;

generating object enhancement information corresponding to the plurality of candidate objects,

wherein the generating of the object enhancement information corresponding to the plurality of candidate objects comprises:

obtaining context information of the plurality of candidate objects,

generating object enhancement information corresponding to the plurality of candidate objects based on the characteristics and context information of the plurality of candidate objects.

3. The method of claim 1, wherein determining a target object from the plurality of candidate objects based on object enhancement information corresponding to the plurality of candidate objects comprises:

receiving a user input or a second user voice;

determining a candidate object having object enhancement information matching the user input or the second user voice among the plurality of candidate objects as a target object.

4. The method of claim 1, wherein determining a target object from the plurality of candidate objects based on object enhancement information corresponding to the plurality of candidate objects comprises:

receiving a user input or a second user voice;

when a plurality of candidate target objects are determined from the plurality of candidate objects based on the object enhancement information corresponding to the plurality of candidate objects and the user input or the second user voice, the object enhancement information corresponding to the plurality of candidate target objects is updated based on the characteristics and the context information of the plurality of candidate target objects, and the target object is determined based on the updated object enhancement information.

5. The method of claim 1, wherein the method further comprises:

recognizing a user intention based on the detected first user voice;

based on the object enhancement information corresponding to the plurality of candidate objects and the identified user intention, extraneous external speech or noise is avoided.

6. The method of claim 5, wherein the method further comprises:

a third user voice is received and,

re-determining the candidate object according to the third user speech when the intention of the third user speech is different from the recognized user intention,

generating object enhancement information corresponding to the re-determined candidate object.

7. A voice interaction device, comprising:

the detection module detects the first user voice;

the control module displays a plurality of candidate objects related to the first user voice and object enhancement information corresponding to the candidate objects according to the detected first user voice;

a determination module to determine a target object from the plurality of candidate objects based on object enhancement information corresponding to the plurality of candidate objects.

8. The apparatus of claim 7, wherein the control module is configured to:

wherein generating object enhancement information corresponding to the plurality of candidate objects comprises:

obtaining context information of the plurality of candidate objects,

9. The apparatus of claim 7, wherein the determination module is configured to:

acquiring user input or second user voice received by the detection module;

10. The apparatus of claim 7, wherein the determination module is configured to:

acquiring user input or second user voice received by the detection module;

when a plurality of candidate target objects are determined from the plurality of candidate objects based on the object enhancement information corresponding to the plurality of candidate objects and the user input or the second user voice, updated object enhancement information obtained by the control module updating the object enhancement information corresponding to the plurality of candidate target objects based on the characteristics and context information of the plurality of candidate target objects is acquired, and the target object is determined based on the updated object enhancement information.

11. The apparatus of claim 7, wherein the control module is further configured to:

recognizing a user intention based on the detected first user voice;

12. The apparatus of claim 11, wherein the control module is further configured to:

acquiring a third user voice received through the detection module,

13. A computer-readable storage medium storing instructions that, when executed by at least one computing device, cause the at least one computing device to perform the method of any of claims 1-6.

14. A computing device, comprising:

a processor;

memory storing a computer program which, when executed by a processor, implements the method of any one of claims 1 to 6.