CN100473095C

CN100473095C - A method for implementing speech interaction application scene

Info

Publication number: CN100473095C
Application number: CNB2004100011197A
Authority: CN
Inventors: 孙文彦; 张继勇; 诸光; 任文捷; 陈庭玮
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2004-01-20
Filing date: 2004-01-20
Publication date: 2009-03-25
Anticipated expiration: 2024-01-20
Also published as: CN1558655A

Abstract

The invention provides a method for realizing voice interactive application which comprises the steps of, defining a plurality of situations, each of which corresponds a plurality of label combination for representing accomplished predetermined functions in the VoiceXML (voice xml marking language), integrating at least one of the said multiple situations in accordance with demands, obtaining VoiceXML labels based on the combined situations, and producing the corresponding VoiceXML file based on the VoiceXML syntax. The invention realizes the flexibility for skip judgment.

Description

Method for realizing voice interaction application scene

Technical Field

The invention relates to a method for providing interactive scene design for telephone voice interactive application based on voicexml, which utilizes the voice interactive flow structure design combining the traditional IVR tree structure and the mesh structure.

Background

With the continuous maturity of voice application technology and the continuous increase of the demand for intelligent systems, various voice interaction application systems are continuously appeared, and the voice interaction application is widely applied to the application fields of banks, stocks, public information, enterprise call centers and the like. The W3C organizes and correspondingly formulates a standard xml language voicexml of the voice application, but most of the current voice application platforms based on the voicexml only provide a label editing function of the voicexml, some editing interfaces aim at the requirement of a voice browser, the design process follows the conventional method for using the browser, and the real-time requirement of telephone voice interaction is not considered; in addition, the interface design is carried out aiming at the label, the interactive scene definition compatible with the traditional IVR tree is not available, and the method is not easy to be used by flow customization personnel.

At present, the voice interactive application of IVR has been widely applied to the application fields of banks, stocks, public information, enterprise call centers, and the like, and businesses such as telephone inquiry stocks, telephone banking, and the like are gradually known. With the continuous maturity of voice application technology and the increasing demand of application intelligence, the automatic voice interaction technology adopting the voice recognition technology will gradually replace the traditional IVR voice interaction technology, and the voice interaction flow design of the full IVR tree structure in the traditional IVR voice interaction technology will not adapt to the requirements of automatic voice interaction application.

The conventional IVR tree structure has disadvantages in that: the tree-shaped multi-level menu structure is completely adopted, a user needs to interact for many times to complete the required function, and the communication time is long; because the IVR menu is completely adopted, the user is easy to lose in the multi-level menu, and the automatic completion rate of the telephone is low; some functions cannot be realized, such as quickly finding and locating a name or address from a large amount of data, which cannot be realized by using a multi-level menu of an IVR.

Meanwhile, although the design method of the full-mesh interactive process has the advantages of flexibility, convenience, dispersion, jumping and the like, the design method has obvious defects: because the flows are discrete, mutual jumping among the flows cannot be limited, and deadlock is easily caused; the interactive process is complex to modify, the phenomenon of repeated functions of the process nodes is difficult to check, and some nodes may never be used by users in the interaction, so that an 'island' of the process nodes is caused; poor visibility to relatively large scale interactive processes; in addition, for interaction flow customizers familiar with the structure of the IVR tree, the full mesh of interaction flow can make them unwieldy.

Disclosure of Invention

The present invention is directed to overcoming the above-mentioned shortcomings of the prior art, and to this end, the present invention provides a method for designing a telephone voice interaction scenario. Each node of the traditional IVR tree is an interaction scene with a user, and is classified according to a telephone operation function realized by the interaction scene, tags defined in voicexml can be all included in the interaction scene, and each tag becomes an attribute in the interaction scene.

In order to achieve the above purpose, the technical scheme of the invention is realized as follows:

a voice interaction application scenario method comprises the following steps:

defining a plurality of scenes, wherein each scene corresponds to a plurality of label combinations representing the realization of the preset function in the VoiceXML;

combining at least one of the plurality of scenes as needed;

acquiring a tag of VoiceXML based on the combined scene;

and generating a corresponding VoiceXML file according to the VoiceXML grammar.

Optionally, each of the plurality of scenes includes an associated tag, and the content of the speech recognition grammar file.

Preferably, the scene comprises at least one of: identifying scenes, recording scenes, switching scenes and on-hook scenes.

Optionally, combining at least one of the plurality of scenes comprises: adding scenes by combining an IVR tree with a mesh structure; and/or deleting scenes in an IVR tree in conjunction with a mesh structure.

Preferably, combining at least one of the plurality of scenes comprises: and (5) checking the scene validity.

Optionally, the scene validity check comprises: selecting a scene; searching a parent node scene; checking whether there is a jump to the scene in the parent scene; if yes, continuing to check the next scene; otherwise, the scene is invalid and the operation is exited.

Preferably, said combining at least one of said plurality of scenes comprises: selecting attributes of the scene, and/or a prompt set, and/or an instruction set, and/or an action set according to user requirements; assembled according to the VoiceXML syntax.

Optionally, the step of generating a corresponding VoiceXML file according to the VoiceXML syntax includes: and analyzing the combined scene into a VoiceXML mark, explaining the action flow of the user based on a VoiceXML mark library, and automatically generating a corresponding VoiceXML file.

With the present invention, a specific application is embodied on the interface as an IVR tree. The jump relation is described by the property of the scene. The flexibility of jump judgment is increased; the use of the user is convenient.

Drawings

FIG. 1 is a schematic diagram of the voice interaction flow of the conventional IVR tree structure and mesh structure of the present invention;

FIGS. 2, 3, and 4 respectively depict the adding, deleting, and validity checking processes of the interaction flow nodes;

FIG. 5 is a primary interface of a voice interaction application editing environment in accordance with the present invention;

fig. 6, 7, and 8 are a prompt set interface, an instruction set interface, and an action set interface, respectively.

Detailed Description

In order that those skilled in the art will better understand the present invention, the following detailed description of the present invention is provided in conjunction with the accompanying drawings and embodiments.

In a telephony voice interaction flow, a start node and an end node are defined to represent the start and end of the interaction flow. And adding child nodes on the basis of the parent nodes, wherein the relationship between the parent nodes and the child nodes is recorded by a 'parent-child' attribute. Hops between nodes (including between parents and children) are represented by actions. Each node has an action set, which records all actions of the node, i.e. under which condition, which node to jump to. The system automatically generates the jump from the child node to the parent node, namely: and returning to the action.

Thus, the hierarchical structure of the IVR tree is recorded in a parent-child relationship, and the action set represents a mesh-like hierarchical jump. The application flow represented on the interface is an IVR tree, the free jump between nodes is realized by the internal attribute of the IVR tree, and the application flow is a mesh structure in nature. The final user can traverse the flow by a multi-layer menu mode of the key-press, and can also directly speak the voice instruction to jump to the corresponding node. Therefore, the combination of the IVR tree structure and the network structure is realized.

In the invention, the node types of the voice interaction process are as follows: the system comprises an identification node, a transfer node, a recording node, an on-hook node and a user-defined JSP node. The identification node represents a one-time playing identification interaction scene, the transfer node realizes the transfer function of the telephone, the recording node realizes the recording function, and the on-hook node realizes the on-hook.

Describing the hierarchical structure among the nodes by a parent-child relationship, wherein one parent node can have a plurality of child nodes, and each child node only has one parent node; the creation of the child node must be done at the parent node. The creation process of the interactive flow is an IVR tree generation process.

Each node has an important attribute of an action set besides a parent attribute, the action set is composed of a plurality of actions, and each action records a jump between nodes meeting a certain condition. For example: "conditions: and (3) jumping nodes when the main menu is sold: before sale ". Each child node, except for the end nodes (referring to the on-hook and transit nodes), has a jump to the parent node, created by the system by default, named "return". The voice command and the key command are part of the action condition, and the information of the user personality is optional content forming the action condition, such as information for identifying whether the user is registered or not.

An example of a voice interaction flow through the rules described above is shown in fig. 1.

In the figure, a main menu is a first scene after a user enters, and is a starting node of an application, and manual agents 1-3 are used for realizing a transfer telephone and are ending nodes of the application. The child nodes of the main menu comprise pre-sale, post-sale, registration and complaints, wherein the pre-sale comprises two child nodes of a household and a commercial, the household node is a father node of the artificial seat 1, the commercial node is a father node of the artificial seat 2, and the registration is a father node of the artificial seat 3. As shown in the figure, the jump condition between the parent and the child is that a certain key is satisfied, for example, the main menu is 2, and the jump is to the after-sale node. Except 3 transit nodes, the other child nodes all have return jump to the father node, and the return is defined by the flow customization person, can be an 'star' key or a voice instruction.

In addition to the parent-child relationship, the graph describes free jumping between nodes, for example, a main menu node can jump directly to a home node, the home node can jump to the main menu, a commercial node can jump to after sales, and a complaint node jumps to a human agent 3. The free jump among the nodes is completely determined by flow customization personnel, and the jump among the interlayers can be selected among the same layers. Through the free jump, a substantial mesh structure is formed between the nodes.

The creation of the whole interactive flow is completed by adding nodes, and when the creation is finished, the system needs to check the validity of the nodes to ensure that the hierarchical relationship of the application interactive flow exists. The joining of the free jump is completed by editing the action set of the node at any time.

Fig. 2, fig. 3, and fig. 4 respectively describe the processes of adding, deleting, and checking the validity of the interaction flow. Wherein,

adding a node comprises the steps of: selecting a father node; adding child nodes; editing the attribute of the child node, including adding free jump from the child node to other nodes; adding a jump from a father node to a child node; add the child node to the parent node's return.

And deleting the node comprises the steps of: deleting all child nodes of the node; deleting all jumps to the child node of the node; deleting the node; all hops to that node are deleted.

And (3) node validity checking: selecting a node; finding its parent node; checking whether a jump to the node exists in a father node; if so, continuing the check of the next node

Otherwise, the node is invalid and exits.

In the invention, in order to realize the method of the voice interaction application scene, according to the difference of telephone voice operation, the interaction scene is divided into: and identifying four types of scenes, recording scenes, switching scenes and on-hook scenes. According to the relation between the specific meaning of the voicexml tag and the interactive scene, the related tag is classified into the scene to be used as the attribute of the scene; in addition, the specific content of the speech recognition grammar file also serves as an attribute of the scene. Based on the method, different graphical interfaces are designed for different scenes, and the designed graphical interfaces become tools for users to edit scene attributes. Such as a grammar file as an attribute for identifying a scene, the voice interactive application editing environment will provide a graphical interface for a user to edit this attribute.

The interactive scene corresponds to the IVR tree and the flow nodes in the mesh structure, the scene is organized according to the conversation flow design, and a specific application is embodied as the IVR tree on the interface. The attributes of the scene describe the jump relationship between the nodes.

In the invention, except the on-hook scene which is set by default in the system, other scenes are created by flow customization personnel, and the common attributes of the three scenes are as follows: and the node name and the father node name describe the parent-child hierarchical relationship of the IVR tree. The following detailed description of the scenario will not refer to these two attributes.

1. Identifying scenes

The functions are as follows: one interaction with a user is described, and the recognition scene is functionally divided and comprises two types of sub-scenes: playing the sub-scene and playing and identifying the sub-scene. The sub-scene is played to describe the system playing prompt, and certain action is carried out according to the current conditions after the playing is finished. The sub-scene description is played and identified by the system playing prompt words, the user input is waited, and after the user voice or key input, the system performs certain 'action' according to the current conditions. The difference between the two is that the latter involves the process of speech recognition. The "current condition" includes the recognition result of the current scene or the previous scene, the current value of the global variable in the system, and the like. The flow customizing personnel can freely select according to the needs of the personnel.

The attributes are as follows: the main attributes of the recognition scene comprise three categories of a prompt set, an instruction set and an action set.

A prompt language set: the method comprises the following steps of identifying the prompt words to be played in the scene, wherein each prompt word consists of two parts: type and content. The types of the prompts are as follows: wav files, TTS (speech synthesis) text, variables, and database query results. Variable cues form cues that represent the value of a current variable by selecting global variables of the system. The database query result hint is associated with the current database query action. The user can select various types of prompts, and different types of prompts form a prompt set according to the selection sequence of the user.

For example:

type (B)	Content providing method and apparatus
type (B)	Content providing method and apparatus	Wav file	Confirmation. wav
TTS text	The product you need is	Wav file	Confirmation. wav
TTS text	The product you need is	Variables of	Product name

If the variable value of the current variable name is the product name, it is "home computer". The application will play the prompt as: wav, "you need a product is a home computer".

The instruction set: a recognition grammar file is described for use with a current interaction scenario, each instruction corresponding to one or more grammars in the grammar file. The instructions are given by: command name, voice command, pinyin, and key command. The instruction name is used as the main judgment content of the recognition result, and the same instruction name can have different voice commands and key commands. The same voice command or key command can only correspond to the same command name. After the voice command input by the user, the system automatically generates a pinyin list for the user to select.

For example: the instruction name is: home, voice command is: for home use, the key command is 1

The instruction name is: home, voice command is: household computer

Since the voice command "home", "home computer" and the key command "1" all correspond to the same instruction "home", the recognition of the three will return the same recognition result "home".

And (3) action set: a series of operations performed after the current scene is completely played or recognized is described. Each action is composed of action conditions and a specific operation. The condition means that the recognition result of the current scene or the previous scene meets a certain condition, or the global variable meets a certain condition. In the specific operation, prompting words and variable assignment can be selected, and the name of a jump node is selected as a necessity. Namely: the execution result of each recognition scene must jump to a certain scene, the interaction of the interactive application is to be continued, and the phenomenon of interaction 'pause' cannot occur. The terms are defined as above.

If the selected condition is the node, the condition content can select 'no input by user' and 'recognition refusal', and the operation meeting the condition also needs to be set.

For example:

action 1: conditions are as follows: home-made main menu

And (4) prompting: TTS text: you choose to be a home computer

And (4) variable assignment: var1 for household

Jumping nodes: household appliance

It is described that when the current recognition result is "home", the system will play "you choose to be a home computer", and assign a variable value var1, and finally jump to the home scenario.

Corresponding to voicexml:

identifying a scene corresponds to both the < field > and < block > tags of voicexml. The instruction set corresponds to < grammar > tags and adds the function of writing grammar files and compiling the grammar files through the interface. The prompt corresponds to a < prompt > tag. The action set corresponds to the < filled > and < catch > tags. The handling of normal and reject events has been generalized to action sets.

Combining the above descriptions, a simple voicexml is shown below

< field name ═ Main Menu >

S main menu "/>"

< prompt > household 1, commercial 2</prompt >

< goto next > "# Main Menu"/>

</catch>

< goto next > "# Main Menu"/>

</catch>

<if>

< condition name ═ main menu ═ expr ═ home >

< prompt > for home use. [ PROMPT ]

< goto next > "# household"/>

</if>

</filled>

</field>

</form>

The recognition scenario may be replaced by the following:

name of scene	Main menu
name of scene	Main menu	Instruction set	Instruction name: home, voice command: domestic.
Prompting language set	TTS text: household 1, commercial 2	Instruction set	Instruction name: home, voice command: domestic.
Prompting language set	TTS text: household 1, commercial 2	Action set	Action 1: conditions are as follows: the main menu is as follows: household skipping: household appliance

Flow chart

2. Recording scene:

the functions are as follows: and describing the playing prompt words in the recording scene, recording, and directly jumping to a certain interactive scene after recording.

The attributes are as follows: the recording scene is composed of a prompt language set and a skip node. The definition of the prompt is the same as above.

Corresponding voicexml: the recording scene corresponds to the < record > tag of voicexml.

3. Switching scenes:

the functions are as follows: the switching scene describes the operation of directly switching the telephone after the prompt is played, and is an end node of the interactive process.

The attributes are as follows: the switching scene is composed of a prompt language set and a switching telephone number.

Corresponding voicexml: the transit node corresponds to the < transfer > tag in voicexml.

4. An on-hook scene:

the functions are as follows: the scenario that the telephone voice interaction application is actively hung up is described, and the scenario is a technical node of an interaction process.

The attributes are as follows: there are no special attributes.

Corresponding voicexml: corresponding to the < exit > tag in voicexml.

In addition, the global variable contains a variable name and value attributes, and the definition and assignment of the global variable correspond to the < var > and < assign > tags of voicexml.

FIG. 5 is a main interface of a voice interactive application editing environment designed based on the method of the present patent. Fig. 6, 7, and 8 are a prompt set interface, an instruction set interface, and an action set interface, respectively.

While the present invention has been described with respect to the embodiments, those skilled in the art will appreciate that there are numerous variations and permutations of the present invention without departing from the spirit of the invention, and it is intended that the appended claims cover such variations and modifications as fall within the true spirit of the invention.

Claims

1. A method for realizing a voice interaction application scene comprises the following steps:

defining a plurality of scenes, wherein each scene corresponds to a plurality of label combinations representing the realization of a preset function in the voice xml editing language VoiceXML;

combining at least one of the plurality of scenes according to requirements to obtain a combined scene;

acquiring a tag of VoiceXML based on the combined scene;

and generating a corresponding VoiceXML file according to the VoiceXML grammar.

2. The method of claim 1, wherein each of the plurality of scenes includes an associated tag, and contents of a speech recognition grammar file.

3. The method of claim 2, wherein the scene comprises at least one of: identifying scenes, recording scenes, switching scenes and on-hook scenes.

4. The method of claim 2 or 3, wherein combining at least one of the plurality of scenes comprises: adding scenes by combining a key voice interactive IVR tree with a mesh structure; and/or deleting scenes in an IVR tree in conjunction with a mesh structure.

5. The method of claim 4, combining at least one of the plurality of scenes comprising: and (5) checking the scene validity.

6. The method of claim 5, wherein the scene validity check comprises: selecting a scene; searching a parent node scene; checking whether there is a jump to the scene in the parent scene; if yes, continuing to check the next scene; otherwise, the scene is invalid and the operation is exited.

7. The method of claim 3, wherein said combining at least one of said plurality of scenes comprises: selecting attributes of the scene, and/or a prompt set, and/or an instruction set, and/or an action set according to user requirements; assembled according to the VoiceXML syntax.

8. The method of claim 7, wherein said combining at least one of said plurality of scenes comprises: the playing of the sub-scenes is combined with the playing and the sub-scenes are identified.

9. The method of claim 1, wherein the defining a plurality of scenes comprises: different graphical interfaces are defined for different scenes to facilitate human-computer interaction.

10. The method of claim 3 wherein generating a corresponding VoiceXML file according to the VoiceXML grammar comprises: and analyzing the combined scene into a VoiceXML mark, explaining the action flow of the user based on a VoiceXML mark library, and automatically generating a corresponding VoiceXML file.