CN110706699A

CN110706699A - Method and system for realizing interaction task by adopting voice recognition mode

Info

Publication number: CN110706699A
Application number: CN201910921533.6A
Authority: CN
Inventors: 魏涛; 胡泊; 吴秀娟
Original assignee: Samsung Electronics China R&D Center; Samsung Electronics Co Ltd
Current assignee: Samsung Electronics China R&D Center; Samsung Electronics Co Ltd
Priority date: 2019-09-27
Filing date: 2019-09-27
Publication date: 2020-01-17

Abstract

The invention discloses a method and a system for realizing an interactive task by adopting a voice recognition mode, wherein a terminal identifies voice under the current environment to obtain text and semantic information, determines corresponding context information based on the obtained text and semantic information, divides scenes based on the context information, and generates a language skill recommendation table for each scene; when the interactive task is to be realized, the current environment information is directly obtained to determine the context information, the language skill recommendation table is inquired according to the determined context information, and the interactive task of the terminal is executed according to the language skill recommendation table. Therefore, the embodiment of the invention simply and accurately adopts the voice recognition mode to realize the interaction task.

Description

Method and system for realizing interaction task by adopting voice recognition mode

Technical Field

The invention relates to the technical field of computers, in particular to a method and a system for realizing an interactive task by adopting a voice recognition mode.

Background

Currently, terminals are capable of providing various types of application services. When the terminal provides the application service, the voice recognition assistant software can be provided for the user, the user interacts with the terminal through voice, and the terminal provides the corresponding application service after recognizing the voice through the voice recognition assistant software. The function of the terminal for recognizing the voice can be realized in three ways: interactive task type, knowledge question and answer type, and chatting type. The interactive task type voice recognition mode enables a user to recognize the voice through direct voice input by the terminal, and the interactive intention between the user and the application service of the terminal is completed without the need of multiple operations of a voice recognition interface of the terminal.

However, the interactive task type voice recognition method provided by the terminal is not very high for the user, because the user has the following disadvantages when operating the application service of the terminal through voice: 1) the public occasion is not suitable for using voice, and has privacy problem; 2) the terminal is in a noisy remote environment, and the voice recognition effect is poor; 3) because of factors such as inaccurate pronunciation of users, various expression modes with the same meaning and the like, the accuracy rate of voice recognition and voice understanding needs to be improved; 4) some users are not used to use the speech recognition interface provided by the terminal.

Therefore, how to simply and accurately realize the interaction task by adopting the voice recognition mode becomes a problem to be solved urgently.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method for implementing an interactive task by using a speech recognition method, where the method can simply and accurately implement the interactive task by using the speech recognition method.

The embodiment of the invention also provides a system for realizing the interaction task by adopting the voice recognition mode, and the system can simply, conveniently and accurately realize the interaction task by adopting the voice recognition mode.

The embodiment of the invention is realized as follows:

a method for realizing interaction tasks by adopting a voice recognition mode comprises the following steps:

the method comprises the steps that a terminal identifies voice under the current environment to obtain text and semantic information, determines corresponding context information based on the obtained text and semantic information, divides scenes based on the context information, and generates a language skill recommendation table for each scene;

when the interactive task is to be realized, obtaining current environment information to determine context information, inquiring a language skill recommendation table of a scene to which the interactive task belongs according to the determined context information, and executing the interactive task of the terminal according to the language skill recommendation table of the scene to which the interactive task belongs.

The method for dividing scenes based on the context information and generating the language skill recommendation table for each scene is determined by adopting a machine learning method.

The language skill recommendation table comprises a plurality of language skill information, and the plurality of language skill information are sorted according to the utilization rate.

Before the interactive task of the terminal is executed, the method further comprises the following steps:

and selecting one language skill information from the language skill recommendation table, and executing the interactive task of the terminal according to the selected language skill information.

The querying of the language skill recommendation table of the belonged scene further comprises: and presenting the language skill recommendation table of the scene in a menu item and pull-down list mode.

A system for realizing interaction tasks by adopting a voice recognition mode comprises: a voice assistant plug-in module, a context perception module, a language input information collection module, a language skill generation module, a language skill recommendation module and a language skill display module, wherein,

the voice assistant plug-in module is used for acquiring voice information and performing voice recognition;

the context sensing module is used for acquiring current environment information;

the language input information acquisition module is used for identifying the voice based on the current environment to obtain text and semantic information;

the language skill generation module is used for determining corresponding context information based on the obtained text and semantic information and generating language skill information;

the language skill recommendation module is used for dividing scenes according to the context information and generating a language skill recommendation table for each scene, wherein the language skill recommendation table comprises language skill information;

and the language skill display module is used for calling the context sensing module to obtain the current environment information, determining the context information, determining a corresponding scene according to the determined context information, calling a corresponding language skill recommendation table through the language skill recommendation module according to the determined scene, and realizing the interactive task and display of the terminal according to the corresponding language skill recommendation table.

The environment information is as follows: the current user, time, place, interface, or/and IOT device.

The language skill display module is further configured to display a corresponding language skill recommendation table as follows: and presenting the language skill recommendation table of the scene in a menu item and pull-down list mode.

As seen above, the terminal identifies a voice in the current environment to obtain text and semantic information, determines corresponding context information based on the obtained text and semantic information, divides scenes based on the context information, and generates a language skill recommendation table for each scene; when the interactive task is to be realized, the current environment information is directly obtained to determine the context information, the language skill recommendation table is inquired according to the determined context information, and the interactive task of the terminal is executed according to the language skill recommendation table. Therefore, when the interactive task of the terminal is executed through voice, personalized context-aware language skill recommendation is performed by analyzing the history of the interactive task executed by the user and the current use habit of the user, so that the intention of the user is well predicted and hit.

Drawings

FIG. 1 is a flowchart of a method for implementing an interactive task by using speech recognition according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating an example of a method for implementing an interactive task by using speech recognition according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a system for implementing an interaction task by using a speech recognition method according to an embodiment of the present invention;

fig. 4 is a display diagram for displaying a language skill recommendation table on a mobile phone according to an embodiment of the present invention;

fig. 5 is a display diagram for displaying a language skill recommendation table on a television according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.

The method comprises the steps that a terminal identifies voice under the current environment to obtain text and semantic information, determines corresponding context information based on the obtained text and semantic information, divides scenes based on the context information, and generates a language skill recommendation table for each scene; when the interactive task is to be realized, the current environment information is directly obtained to determine the context information, the language skill recommendation table is inquired according to the determined context information, and the interactive task of the terminal is executed according to the language skill recommendation table.

Therefore, when the interactive task of the terminal is executed through voice, personalized context-aware language skill recommendation is performed by analyzing the history of the interactive task executed by the user and the current use habit of the user, so that the intention of the user is well predicted and hit.

Fig. 1 is a flowchart of a method for implementing an interactive task by using a speech recognition method according to an embodiment of the present invention, which includes the following specific steps:

101, a terminal identifies voice based on the current environment to obtain text and semantic information, determines corresponding context information based on the obtained text and semantic information, divides scenes based on the context information, and generates a language skill recommendation table for each scene;

and 102, when the interactive task is to be realized, obtaining current environment information to determine context information, inquiring a language skill recommendation table of a scene to which the interactive task belongs according to the determined context information, and executing the interactive task of the terminal according to the language skill recommendation table of the scene to which the interactive task belongs.

In the method, the scenes are divided based on the context information, and the language skill recommendation table generated for each scene is determined by adopting a machine learning method.

In the method, the language skill recommendation table includes a plurality of language skill information, and the plurality of language skill information are sorted according to usage rates.

In the method, before the interactive task of the terminal is executed, the method further comprises:

In the method, the querying the language skill recommendation table of the scene to which the query belongs further includes: and presenting the language skill recommendation table of the scene in a menu item and pull-down list mode.

The method mainly comprises two key processes, wherein one process is a generation process of a language skill recommendation table; the other is the presentation process of the language skill recommendation table. The implementation of these two key processes is described below in conjunction with fig. 2. Fig. 2 is a flowchart of an example of a method for implementing an interactive task by using a speech recognition method according to an embodiment of the present invention.

The first key process: the generation process of the language skill recommendation table is described with reference to arrows 1-5 in the figure:

1. when a user wakes up the voice assistant plug-in module, the voice assistant plug-in module is triggered, and the voice assistant plug-in module informs the language input information acquisition module to start information acquisition;

2-3, the language input information acquisition module acquires current environment information through the context sensing module and then sends a request to the voice assistant plug-in module to acquire a text acquired by the voice assistant plug-in module from the voice recognition of the user and semantic information after natural language understanding;

4. the language input information acquisition module transmits the acquired text, semantic information and the current environment information to the language skill generation module, generates or updates a language skill base according to the information and informs the language skill recommendation module;

5. the language skill recommendation module is started under specific conditions (such as the voice skill base is updated or the system is idle), a large number of different contexts are divided into a plurality of scenes, and a corresponding language skill recommendation table is generated for each scene.

The second key process: the display process of the language skill recommendation table is described by combining arrows A-D in the figure:

A. when the user wakes up the voice assistant plug-in module, the voice assistant plug-in module is triggered, and the module informs the language skill display module;

B-C, the language skill display module acquires the current environment information through the context sensing module, calls the language skill recommendation module according to the context information, acquires a language skill recommendation table of a scene type to which the current environment information belongs, and displays the recommendation table through a graphical interface;

D. when a user selects a certain recommended language skill, the language skill presentation module sends the text of the skill to the voice assistant plug-in module, and the voice assistant plug-in module sends the text to the voice assistant plug-in module as a recognition result of the voice input of the user to execute an interactive task.

Fig. 3 is a schematic structural diagram of a system for implementing an interaction task by using a speech recognition method according to an embodiment of the present invention, where the system includes: a voice assistant plug-in module, a context perception module, a language input information collection module, a language skill generation module, a language skill recommendation module and a language skill display module, wherein,

In the system, the environment information is: the current user, time, place, interface, or/and IOT device.

In the system, the scene is divided based on the context information, and the language skill recommendation table generated for each scene is determined by adopting a machine learning method.

In the system, the language skill recommendation table includes a plurality of language skill information, and the plurality of language skill information are sorted according to usage rates.

In the system, the language skill display module is further configured to display a corresponding language skill recommendation table as follows: and presenting the language skill recommendation table of the scene in a menu item and pull-down list mode.

A display diagram showing the language skill recommendation table on the mobile phone in the embodiment of the present invention is shown in fig. 4, and a display diagram showing the language skill recommendation table on the television is shown in fig. 5. In this embodiment, the voice assistant plug-in module is Bixby.

As shown in fig. 5, each time Bixby is called, a language skill recommendation table is displayed in different scenes, and recommendation skills in the language skill recommendation table are different. For example, through learning, if a user turns on a television and wakes Bixby, a series of actions are performed. The specific process is as follows.

1) The voice assistant plug-in module realizes 4-aspect functions through interaction with a user.

a) When the module is awakened by a user, the collection work of language input information is triggered;

b) acquiring a text after voice recognition and semantic information after natural language understanding, wherein the semantic information comprises structural information such as intention or groove;

c) when the voice assistant plug-in module is awakened by a user, the display of a language skill recommendation table is triggered;

d) when the user selects a certain recommended language skill, the text in the language skill recommendation table is used as the result of the recognition of the user's voice input and is sent to the module.

2) The context awareness module calls an Application Programming Interface (API) of the terminal to obtain environment information of the device, such as a current IOT, a user role, time, a location, a current application, and an interactive task to be executed currently.

3) The language input information acquisition module acquires current environment information by calling the context informing module, acquires text and semantic information of current voice input by calling the voice assistant plug-in module, and judges whether the current scene is an effective interactive task according to the intention of the current language input. And if the interaction task is an effective interaction task in the current scene, calling a language skill generation module by using the collected language input information. And if the interactive task is not valid in the current scene, directly exiting. The module does not store the historical language input information, the format and examples of which are shown in table 1. In order to cluster different contexts in the following, discretization processing needs to be performed on new feature values of the contexts, for example, each time is divided by taking 2 hours as a unit.

Table 14) the language skills generation module analyzes the language input information, generates and maintains a language skills library, including language skills recommendation tables and context tables.

The language skills recommendation package and examples are shown in table 2:

TABLE 2

The format and examples of the context table are shown in table 3:

TABLE 3

When new language input information is obtained, the language skill generation module queries corresponding language skill identification in the language skill recommendation table according to the semantic information. If the language Skill exists, return its NL Skill Id; if not, a language Skill record is created based on the current language input information and its NL Skill Id is returned. Then, according to the context information input by the current language, the context table is searched. If the same context does not exist, a context record is created and the resulting NL Skill Id is queried before the end of its NL Skill Id History field is incremented. The above table retrieval operation can be accelerated by creating a hash index or the like.

5) And when the set language skill base is updated and the system is idle, the language skill recommendation module regenerates the language skill recommendation list. Firstly, a machine learning method is adopted, and a large number of different contexts are gathered into a plurality of representative scenes according to different language skills of a bed. And then, aiming at each type of scene, sorting the candidate language skills according to the utilization rate to generate a corresponding recommendation list. In particular, the module maintains a list of device common language skills in sparrow tongue, which may be manually compiled by an interactive expert or may be a big data statistic retrieved from a voice assistant plug-in module.

A) The common language skills of users are different in different scenes (such as non-weekend work hours, video watching at night, business trip, traffic hours, and children as television users). The contexts are clustered according to different language skills, and scenes and the use habits of users can be automatically mined. Each context is first vectorized according to its language skill usage. With 200 language skills erected, one context information is identified as a 200-dimensional vector. The value of the vector i dimensions represents the usage rate of language skills with NL kill Id i in the context information. The specific usage rate calculation method considers the frequency and time attenuation factors used by the language skills, and the weight value of the language skills which are not used for a long time is reduced. Then clustering is carried out on all different contexts, and the contexts with similar language skill utilization rates are classified into one class. There are many ways to calculate the similarity of language skills, one of which is to measure the cosine of the angle between the vectors. The format and examples of the scene table are shown in table 4.

Context Id (meaning see Context table)	Scenario Id
		1	1 (for example, non weekends work hours)
2	1
		3	2 (for example at night when watching video)
4	3 (if you are on business trip)
		5	4 (with IOT equipment on-line)

Table 4b) in order to classify the unseen context information (history information without language skill use) into a certain type of scene, a classification model of a context is trained using a classification algorithm such as a decision tree based on the feature values of each context, using the correspondence of the scene table as a label.

c) And generating a corresponding language skill recommendation list aiming at each type of scene, and sequencing according to the utilization rate of the user points. The usage rate of language skills of a specific scene can be calculated in various ways. One way is to accumulate vectors of language skill usage for all context information in this type of scenario. The format and examples of the language skills recommendation table are shown in table 5.

Table 56) the language skill display module calls the context classification model of the recommendation module to find the corresponding scene type according to the context information recognized when the voice assistant plug-in module is awakened, and then obtains the language skill recommendation table of the scene for display. The display interface of the language skill recommendation table is integrated into the interface of the voice assistant plug-in module and is presented in a semitransparent mode when the voice assistant plug-in module is awakened. And designing a corresponding visual style and style according to the graphical interface of the actual equipment so as to facilitate the selection of a user. And if the user selects a certain recommended language skill, the interactive task is realized through the calling of the voice assistant plug-in module. If the user chooses to use the voice input directly, the like hides the language skill recommendation table and the voice assistant plug-in module processes the full authority. If more recommended items exist in a certain scene, in order to improve usability, language skills with the same Intent and different Slot values can be selected and synthesized into a menu item for presentation. The default text is the language skill with the highest usage, and by clicking on the drop down list next to the term, the language skill with the other value of Slot can be selected. See in detail the expansion of the list after clicking Change Contact in figure 3. If there are few recommended items in a scene, it is supplemented with default device common language skills.

It can be seen that the embodiment of the invention organically combines the advantages of voice input and graphical interface input to perform personalized context-aware language skill recommendation. According to the individual language interaction history of the user, different contexts are clustered into scenes by adopting a machine learning method, a corresponding language skill recommendation table is generated for each scene, the using habits of the user can be dynamically learned, the change of the habits of the user can be adapted, the language skill recommendation table can be seamlessly embedded into the application scene of the voice assistant plug-in module, and the usability, the friendliness and the privacy of an interaction interface of the terminal are improved.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for realizing interaction tasks by adopting a voice recognition mode is characterized by comprising the following steps:

2. The method of claim 1, wherein the partitioning of the scenarios based on the contextual information, the generating of the language skill recommendation table for each scenario is determined using a machine learning approach.

3. The method of claim 1, wherein the language skill recommendation table includes a plurality of language skill information, the plurality of language skill information being ordered according to usage.

4. The method of claim 4, prior to performing the interactive task of the terminal, further comprising:

5. The method of claim 1, wherein said querying the language skills recommendation tables for the affiliated scenario further comprises: and presenting the language skill recommendation table of the scene in a menu item and pull-down list mode.

6. A system for realizing interaction tasks by adopting a voice recognition mode is characterized by comprising: a voice assistant plug-in module, a context perception module, a language input information collection module, a language skill generation module, a language skill recommendation module and a language skill display module, wherein,

7. The system of claim 6, wherein the environmental information is: the current user, time, place, interface, or/and IOT device.

8. The system of claim 6, wherein the partitioning of the scenarios based on the contextual information, the generation of language skill recommendation tables for each scenario is determined using a machine learning approach.

9. The system of claim 6, wherein the language skill recommendation table includes a plurality of language skill information, the plurality of language skill information being ordered according to usage.

10. The system of claim 6, wherein the language skill presentation module is further configured to present a corresponding language skill recommendation table as: and presenting the language skill recommendation table of the scene in a menu item and pull-down list mode.