CN100473095C - A method for implementing speech interaction application scene - Google Patents

A method for implementing speech interaction application scene Download PDF

Info

Publication number
CN100473095C
CN100473095C CNB2004100011197A CN200410001119A CN100473095C CN 100473095 C CN100473095 C CN 100473095C CN B2004100011197 A CNB2004100011197 A CN B2004100011197A CN 200410001119 A CN200410001119 A CN 200410001119A CN 100473095 C CN100473095 C CN 100473095C
Authority
CN
China
Prior art keywords
scene
scenes
voicexml
node
combining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2004100011197A
Other languages
Chinese (zh)
Other versions
CN1558655A (en
Inventor
孙文彦
张继勇
诸光
任文捷
陈庭玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CNB2004100011197A priority Critical patent/CN100473095C/en
Publication of CN1558655A publication Critical patent/CN1558655A/en
Application granted granted Critical
Publication of CN100473095C publication Critical patent/CN100473095C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention provides a method for realizing voice interactive application which comprises the steps of, defining a plurality of situations, each of which corresponds a plurality of label combination for representing accomplished predetermined functions in the VoiceXML (voice xml marking language), integrating at least one of the said multiple situations in accordance with demands, obtaining VoiceXML labels based on the combined situations, and producing the corresponding VoiceXML file based on the VoiceXML syntax. The invention realizes the flexibility for skip judgment.

Description

Method for realizing voice interaction application scene
Technical Field
The invention relates to a method for providing interactive scene design for telephone voice interactive application based on voicexml, which utilizes the voice interactive flow structure design combining the traditional IVR tree structure and the mesh structure.
Background
With the continuous maturity of voice application technology and the continuous increase of the demand for intelligent systems, various voice interaction application systems are continuously appeared, and the voice interaction application is widely applied to the application fields of banks, stocks, public information, enterprise call centers and the like. The W3C organizes and correspondingly formulates a standard xml language voicexml of the voice application, but most of the current voice application platforms based on the voicexml only provide a label editing function of the voicexml, some editing interfaces aim at the requirement of a voice browser, the design process follows the conventional method for using the browser, and the real-time requirement of telephone voice interaction is not considered; in addition, the interface design is carried out aiming at the label, the interactive scene definition compatible with the traditional IVR tree is not available, and the method is not easy to be used by flow customization personnel.
At present, the voice interactive application of IVR has been widely applied to the application fields of banks, stocks, public information, enterprise call centers, and the like, and businesses such as telephone inquiry stocks, telephone banking, and the like are gradually known. With the continuous maturity of voice application technology and the increasing demand of application intelligence, the automatic voice interaction technology adopting the voice recognition technology will gradually replace the traditional IVR voice interaction technology, and the voice interaction flow design of the full IVR tree structure in the traditional IVR voice interaction technology will not adapt to the requirements of automatic voice interaction application.
The conventional IVR tree structure has disadvantages in that: the tree-shaped multi-level menu structure is completely adopted, a user needs to interact for many times to complete the required function, and the communication time is long; because the IVR menu is completely adopted, the user is easy to lose in the multi-level menu, and the automatic completion rate of the telephone is low; some functions cannot be realized, such as quickly finding and locating a name or address from a large amount of data, which cannot be realized by using a multi-level menu of an IVR.
Meanwhile, although the design method of the full-mesh interactive process has the advantages of flexibility, convenience, dispersion, jumping and the like, the design method has obvious defects: because the flows are discrete, mutual jumping among the flows cannot be limited, and deadlock is easily caused; the interactive process is complex to modify, the phenomenon of repeated functions of the process nodes is difficult to check, and some nodes may never be used by users in the interaction, so that an 'island' of the process nodes is caused; poor visibility to relatively large scale interactive processes; in addition, for interaction flow customizers familiar with the structure of the IVR tree, the full mesh of interaction flow can make them unwieldy.
Disclosure of Invention
The present invention is directed to overcoming the above-mentioned shortcomings of the prior art, and to this end, the present invention provides a method for designing a telephone voice interaction scenario. Each node of the traditional IVR tree is an interaction scene with a user, and is classified according to a telephone operation function realized by the interaction scene, tags defined in voicexml can be all included in the interaction scene, and each tag becomes an attribute in the interaction scene.
In order to achieve the above purpose, the technical scheme of the invention is realized as follows:
a voice interaction application scenario method comprises the following steps:
defining a plurality of scenes, wherein each scene corresponds to a plurality of label combinations representing the realization of the preset function in the VoiceXML;
combining at least one of the plurality of scenes as needed;
acquiring a tag of VoiceXML based on the combined scene;
and generating a corresponding VoiceXML file according to the VoiceXML grammar.
Optionally, each of the plurality of scenes includes an associated tag, and the content of the speech recognition grammar file.
Preferably, the scene comprises at least one of: identifying scenes, recording scenes, switching scenes and on-hook scenes.
Optionally, combining at least one of the plurality of scenes comprises: adding scenes by combining an IVR tree with a mesh structure; and/or deleting scenes in an IVR tree in conjunction with a mesh structure.
Preferably, combining at least one of the plurality of scenes comprises: and (5) checking the scene validity.
Optionally, the scene validity check comprises: selecting a scene; searching a parent node scene; checking whether there is a jump to the scene in the parent scene; if yes, continuing to check the next scene; otherwise, the scene is invalid and the operation is exited.
Preferably, said combining at least one of said plurality of scenes comprises: selecting attributes of the scene, and/or a prompt set, and/or an instruction set, and/or an action set according to user requirements; assembled according to the VoiceXML syntax.
Optionally, the step of generating a corresponding VoiceXML file according to the VoiceXML syntax includes: and analyzing the combined scene into a VoiceXML mark, explaining the action flow of the user based on a VoiceXML mark library, and automatically generating a corresponding VoiceXML file.
With the present invention, a specific application is embodied on the interface as an IVR tree. The jump relation is described by the property of the scene. The flexibility of jump judgment is increased; the use of the user is convenient.
Drawings
FIG. 1 is a schematic diagram of the voice interaction flow of the conventional IVR tree structure and mesh structure of the present invention;
FIGS. 2, 3, and 4 respectively depict the adding, deleting, and validity checking processes of the interaction flow nodes;
FIG. 5 is a primary interface of a voice interaction application editing environment in accordance with the present invention;
fig. 6, 7, and 8 are a prompt set interface, an instruction set interface, and an action set interface, respectively.
Detailed Description
In order that those skilled in the art will better understand the present invention, the following detailed description of the present invention is provided in conjunction with the accompanying drawings and embodiments.
In a telephony voice interaction flow, a start node and an end node are defined to represent the start and end of the interaction flow. And adding child nodes on the basis of the parent nodes, wherein the relationship between the parent nodes and the child nodes is recorded by a 'parent-child' attribute. Hops between nodes (including between parents and children) are represented by actions. Each node has an action set, which records all actions of the node, i.e. under which condition, which node to jump to. The system automatically generates the jump from the child node to the parent node, namely: and returning to the action.
Thus, the hierarchical structure of the IVR tree is recorded in a parent-child relationship, and the action set represents a mesh-like hierarchical jump. The application flow represented on the interface is an IVR tree, the free jump between nodes is realized by the internal attribute of the IVR tree, and the application flow is a mesh structure in nature. The final user can traverse the flow by a multi-layer menu mode of the key-press, and can also directly speak the voice instruction to jump to the corresponding node. Therefore, the combination of the IVR tree structure and the network structure is realized.
In the invention, the node types of the voice interaction process are as follows: the system comprises an identification node, a transfer node, a recording node, an on-hook node and a user-defined JSP node. The identification node represents a one-time playing identification interaction scene, the transfer node realizes the transfer function of the telephone, the recording node realizes the recording function, and the on-hook node realizes the on-hook.
Describing the hierarchical structure among the nodes by a parent-child relationship, wherein one parent node can have a plurality of child nodes, and each child node only has one parent node; the creation of the child node must be done at the parent node. The creation process of the interactive flow is an IVR tree generation process.
Each node has an important attribute of an action set besides a parent attribute, the action set is composed of a plurality of actions, and each action records a jump between nodes meeting a certain condition. For example: "conditions: and (3) jumping nodes when the main menu is sold: before sale ". Each child node, except for the end nodes (referring to the on-hook and transit nodes), has a jump to the parent node, created by the system by default, named "return". The voice command and the key command are part of the action condition, and the information of the user personality is optional content forming the action condition, such as information for identifying whether the user is registered or not.
An example of a voice interaction flow through the rules described above is shown in fig. 1.
In the figure, a main menu is a first scene after a user enters, and is a starting node of an application, and manual agents 1-3 are used for realizing a transfer telephone and are ending nodes of the application. The child nodes of the main menu comprise pre-sale, post-sale, registration and complaints, wherein the pre-sale comprises two child nodes of a household and a commercial, the household node is a father node of the artificial seat 1, the commercial node is a father node of the artificial seat 2, and the registration is a father node of the artificial seat 3. As shown in the figure, the jump condition between the parent and the child is that a certain key is satisfied, for example, the main menu is 2, and the jump is to the after-sale node. Except 3 transit nodes, the other child nodes all have return jump to the father node, and the return is defined by the flow customization person, can be an 'star' key or a voice instruction.
In addition to the parent-child relationship, the graph describes free jumping between nodes, for example, a main menu node can jump directly to a home node, the home node can jump to the main menu, a commercial node can jump to after sales, and a complaint node jumps to a human agent 3. The free jump among the nodes is completely determined by flow customization personnel, and the jump among the interlayers can be selected among the same layers. Through the free jump, a substantial mesh structure is formed between the nodes.
The creation of the whole interactive flow is completed by adding nodes, and when the creation is finished, the system needs to check the validity of the nodes to ensure that the hierarchical relationship of the application interactive flow exists. The joining of the free jump is completed by editing the action set of the node at any time.
Fig. 2, fig. 3, and fig. 4 respectively describe the processes of adding, deleting, and checking the validity of the interaction flow. Wherein,
adding a node comprises the steps of: selecting a father node; adding child nodes; editing the attribute of the child node, including adding free jump from the child node to other nodes; adding a jump from a father node to a child node; add the child node to the parent node's return.
And deleting the node comprises the steps of: deleting all child nodes of the node; deleting all jumps to the child node of the node; deleting the node; all hops to that node are deleted.
And (3) node validity checking: selecting a node; finding its parent node; checking whether a jump to the node exists in a father node; if so, continuing the check of the next node
Otherwise, the node is invalid and exits.
In the invention, in order to realize the method of the voice interaction application scene, according to the difference of telephone voice operation, the interaction scene is divided into: and identifying four types of scenes, recording scenes, switching scenes and on-hook scenes. According to the relation between the specific meaning of the voicexml tag and the interactive scene, the related tag is classified into the scene to be used as the attribute of the scene; in addition, the specific content of the speech recognition grammar file also serves as an attribute of the scene. Based on the method, different graphical interfaces are designed for different scenes, and the designed graphical interfaces become tools for users to edit scene attributes. Such as a grammar file as an attribute for identifying a scene, the voice interactive application editing environment will provide a graphical interface for a user to edit this attribute.
The interactive scene corresponds to the IVR tree and the flow nodes in the mesh structure, the scene is organized according to the conversation flow design, and a specific application is embodied as the IVR tree on the interface. The attributes of the scene describe the jump relationship between the nodes.
In the invention, except the on-hook scene which is set by default in the system, other scenes are created by flow customization personnel, and the common attributes of the three scenes are as follows: and the node name and the father node name describe the parent-child hierarchical relationship of the IVR tree. The following detailed description of the scenario will not refer to these two attributes.
1. Identifying scenes
The functions are as follows: one interaction with a user is described, and the recognition scene is functionally divided and comprises two types of sub-scenes: playing the sub-scene and playing and identifying the sub-scene. The sub-scene is played to describe the system playing prompt, and certain action is carried out according to the current conditions after the playing is finished. The sub-scene description is played and identified by the system playing prompt words, the user input is waited, and after the user voice or key input, the system performs certain 'action' according to the current conditions. The difference between the two is that the latter involves the process of speech recognition. The "current condition" includes the recognition result of the current scene or the previous scene, the current value of the global variable in the system, and the like. The flow customizing personnel can freely select according to the needs of the personnel.
The attributes are as follows: the main attributes of the recognition scene comprise three categories of a prompt set, an instruction set and an action set.
A prompt language set: the method comprises the following steps of identifying the prompt words to be played in the scene, wherein each prompt word consists of two parts: type and content. The types of the prompts are as follows: wav files, TTS (speech synthesis) text, variables, and database query results. Variable cues form cues that represent the value of a current variable by selecting global variables of the system. The database query result hint is associated with the current database query action. The user can select various types of prompts, and different types of prompts form a prompt set according to the selection sequence of the user.
For example:
type (B) Content providing method and apparatus
Wav file Confirmation. wav
TTS text The product you need is
Variables of Product name
If the variable value of the current variable name is the product name, it is "home computer". The application will play the prompt as: wav, "you need a product is a home computer".
The instruction set: a recognition grammar file is described for use with a current interaction scenario, each instruction corresponding to one or more grammars in the grammar file. The instructions are given by: command name, voice command, pinyin, and key command. The instruction name is used as the main judgment content of the recognition result, and the same instruction name can have different voice commands and key commands. The same voice command or key command can only correspond to the same command name. After the voice command input by the user, the system automatically generates a pinyin list for the user to select.
For example: the instruction name is: home, voice command is: for home use, the key command is 1
The instruction name is: home, voice command is: household computer
Since the voice command "home", "home computer" and the key command "1" all correspond to the same instruction "home", the recognition of the three will return the same recognition result "home".
And (3) action set: a series of operations performed after the current scene is completely played or recognized is described. Each action is composed of action conditions and a specific operation. The condition means that the recognition result of the current scene or the previous scene meets a certain condition, or the global variable meets a certain condition. In the specific operation, prompting words and variable assignment can be selected, and the name of a jump node is selected as a necessity. Namely: the execution result of each recognition scene must jump to a certain scene, the interaction of the interactive application is to be continued, and the phenomenon of interaction 'pause' cannot occur. The terms are defined as above.
If the selected condition is the node, the condition content can select 'no input by user' and 'recognition refusal', and the operation meeting the condition also needs to be set.
For example:
action 1: conditions are as follows: home-made main menu
And (4) prompting: TTS text: you choose to be a home computer
And (4) variable assignment: var1 for household
Jumping nodes: household appliance
It is described that when the current recognition result is "home", the system will play "you choose to be a home computer", and assign a variable value var1, and finally jump to the home scenario.
Corresponding to voicexml:
identifying a scene corresponds to both the < field > and < block > tags of voicexml. The instruction set corresponds to < grammar > tags and adds the function of writing grammar files and compiling the grammar files through the interface. The prompt corresponds to a < prompt > tag. The action set corresponds to the < filled > and < catch > tags. The handling of normal and reject events has been generalized to action sets.
Combining the above descriptions, a simple voicexml is shown below
<form id="test">
< field name ═ Main Menu >
S main menu "/>"
< prompt > household 1, commercial 2</prompt >
<catch event="noinput"count="1">
< goto next > "# Main Menu"/>
</catch>
<catch event="nomatch"count="1">
< goto next > "# Main Menu"/>
</catch>
<filled>
<if>
< condition name ═ main menu ═ expr ═ home >
< prompt > for home use. [ PROMPT ]
< goto next > "# household"/>
</if>
</filled>
</field>
</form>
The recognition scenario may be replaced by the following:
name of scene Main menu
Instruction set Instruction name: home, voice command: domestic.
Prompting language set TTS text: household 1, commercial 2
Action set Action 1: conditions are as follows: the main menu is as follows: household skipping: household appliance
Flow chart
2. Recording scene:
the functions are as follows: and describing the playing prompt words in the recording scene, recording, and directly jumping to a certain interactive scene after recording.
The attributes are as follows: the recording scene is composed of a prompt language set and a skip node. The definition of the prompt is the same as above.
Corresponding voicexml: the recording scene corresponds to the < record > tag of voicexml.
3. Switching scenes:
the functions are as follows: the switching scene describes the operation of directly switching the telephone after the prompt is played, and is an end node of the interactive process.
The attributes are as follows: the switching scene is composed of a prompt language set and a switching telephone number.
Corresponding voicexml: the transit node corresponds to the < transfer > tag in voicexml.
4. An on-hook scene:
the functions are as follows: the scenario that the telephone voice interaction application is actively hung up is described, and the scenario is a technical node of an interaction process.
The attributes are as follows: there are no special attributes.
Corresponding voicexml: corresponding to the < exit > tag in voicexml.
In addition, the global variable contains a variable name and value attributes, and the definition and assignment of the global variable correspond to the < var > and < assign > tags of voicexml.
FIG. 5 is a main interface of a voice interactive application editing environment designed based on the method of the present patent. Fig. 6, 7, and 8 are a prompt set interface, an instruction set interface, and an action set interface, respectively.
While the present invention has been described with respect to the embodiments, those skilled in the art will appreciate that there are numerous variations and permutations of the present invention without departing from the spirit of the invention, and it is intended that the appended claims cover such variations and modifications as fall within the true spirit of the invention.

Claims (10)

1. A method for realizing a voice interaction application scene comprises the following steps:
defining a plurality of scenes, wherein each scene corresponds to a plurality of label combinations representing the realization of a preset function in the voice xml editing language VoiceXML;
combining at least one of the plurality of scenes according to requirements to obtain a combined scene;
acquiring a tag of VoiceXML based on the combined scene;
and generating a corresponding VoiceXML file according to the VoiceXML grammar.
2. The method of claim 1, wherein each of the plurality of scenes includes an associated tag, and contents of a speech recognition grammar file.
3. The method of claim 2, wherein the scene comprises at least one of: identifying scenes, recording scenes, switching scenes and on-hook scenes.
4. The method of claim 2 or 3, wherein combining at least one of the plurality of scenes comprises: adding scenes by combining a key voice interactive IVR tree with a mesh structure; and/or deleting scenes in an IVR tree in conjunction with a mesh structure.
5. The method of claim 4, combining at least one of the plurality of scenes comprising: and (5) checking the scene validity.
6. The method of claim 5, wherein the scene validity check comprises: selecting a scene; searching a parent node scene; checking whether there is a jump to the scene in the parent scene; if yes, continuing to check the next scene; otherwise, the scene is invalid and the operation is exited.
7. The method of claim 3, wherein said combining at least one of said plurality of scenes comprises: selecting attributes of the scene, and/or a prompt set, and/or an instruction set, and/or an action set according to user requirements; assembled according to the VoiceXML syntax.
8. The method of claim 7, wherein said combining at least one of said plurality of scenes comprises: the playing of the sub-scenes is combined with the playing and the sub-scenes are identified.
9. The method of claim 1, wherein the defining a plurality of scenes comprises: different graphical interfaces are defined for different scenes to facilitate human-computer interaction.
10. The method of claim 3 wherein generating a corresponding VoiceXML file according to the VoiceXML grammar comprises: and analyzing the combined scene into a VoiceXML mark, explaining the action flow of the user based on a VoiceXML mark library, and automatically generating a corresponding VoiceXML file.
CNB2004100011197A 2004-01-20 2004-01-20 A method for implementing speech interaction application scene Expired - Fee Related CN100473095C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2004100011197A CN100473095C (en) 2004-01-20 2004-01-20 A method for implementing speech interaction application scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2004100011197A CN100473095C (en) 2004-01-20 2004-01-20 A method for implementing speech interaction application scene

Publications (2)

Publication Number Publication Date
CN1558655A CN1558655A (en) 2004-12-29
CN100473095C true CN100473095C (en) 2009-03-25

Family

ID=34350569

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2004100011197A Expired - Fee Related CN100473095C (en) 2004-01-20 2004-01-20 A method for implementing speech interaction application scene

Country Status (1)

Country Link
CN (1) CN100473095C (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101527755B (en) * 2009-03-30 2011-07-13 中兴通讯股份有限公司 Voice interactive method based on VoiceXML movable termination and movable termination
CN101609673B (en) * 2009-07-09 2012-08-29 交通银行股份有限公司 User voice processing method based on telephone bank and server
CN102149059B (en) * 2010-02-09 2014-08-13 中兴通讯股份有限公司 Method and device for realizing call transfer via VXML (voice extensible markup language)
US8717915B2 (en) * 2010-05-25 2014-05-06 Microsoft Corporation Process-integrated tree view control for interactive voice response design
CN102830949A (en) * 2011-06-14 2012-12-19 镇江佳得信息技术有限公司 Method for realizing system voice navigation based on memory system (MS) Speech
CN102323920A (en) * 2011-06-24 2012-01-18 华南理工大学 Text message editing modification method
CN103078995A (en) * 2012-12-18 2013-05-01 苏州思必驰信息科技有限公司 Customizable individualized response method and system used in mobile terminal
CN104410637B (en) * 2014-11-28 2018-01-05 科大讯飞股份有限公司 The development system and method for interactive voice answering IVR visible process
CN105161095B (en) * 2015-07-29 2017-03-22 百度在线网络技术(北京)有限公司 Method and device for picture composition of speech recognition syntax tree

Also Published As

Publication number Publication date
CN1558655A (en) 2004-12-29

Similar Documents

Publication Publication Date Title
US8275384B2 (en) Social recommender system for generating dialogues based on similar prior dialogues from a group of users
KR940002325B1 (en) Method and apparatus for generating computer controlled interactive voice services
US8473488B2 (en) Voice operated, matrix-connected, artificially intelligent address book system
US7389213B2 (en) Dialogue flow interpreter development tool
JP4460305B2 (en) Operation method of spoken dialogue system
CN104410637B (en) The development system and method for interactive voice answering IVR visible process
CN101242452B (en) Method and system for automatic generation and provision of sound document
CN101138228A (en) Customisation of voicexml application
EP1598810A2 (en) System for conducting a dialogue
US6122345A (en) System and method for developing and processing automatic response unit (ARU) services
US20050165607A1 (en) System and method to disambiguate and clarify user intention in a spoken dialog system
CN101729694A (en) Method for allocating and running realization process of automatic service and system thereof
GB2407682A (en) Automated speech-enabled application creation
CN100473095C (en) A method for implementing speech interaction application scene
CN102263863A (en) Process-integrated tree view control for interactive voice response design
CN105376433A (en) Voice automatic revisit device, system and method
US8005202B2 (en) Automatic generation of a callflow statistics application for speech systems
CN102300007A (en) Flattening menu system for call center based on voice identification
CN101631262A (en) VoiceXML business integrated development system and realizing method thereof
CN110019716A (en) More wheel answering methods, terminal device and storage medium
CA2427512C (en) Dialogue flow interpreter development tool
CN115148212A (en) Voice interaction method, intelligent device and system
CN109408815A (en) Dictionary management method and system for voice dialogue platform
US20040217986A1 (en) Enhanced graphical development environment for controlling mixed initiative applications
CN112487170B (en) Man-machine interaction dialogue robot system facing scene configuration

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090325

Termination date: 20210120

CF01 Termination of patent right due to non-payment of annual fee