CN113555015A

CN113555015A - Voice interaction method, voice interaction device, electronic device and storage medium

Info

Publication number: CN113555015A
Application number: CN202010329337.2A
Authority: CN
Inventors: 傅迪; 徐春霞; 陈晨; 钱露; 陈振涛; 杨晓彬; 张黎
Original assignee: Baidu Online Network Technology Beijing Co Ltd; Shanghai Xiaodu Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Shanghai Xiaodu Technology Co Ltd
Priority date: 2020-04-23
Filing date: 2020-04-23
Publication date: 2021-10-26

Abstract

The application discloses a voice interaction method, voice interaction equipment, electronic equipment and a storage medium, and relates to the field of intelligent voice interaction. The method comprises the following steps: sending a second voice request to a second server under the condition that the voice interaction device provides the first skill; the second voice request comprises a voice request for requesting a second skill; receiving a second control instruction; and the second control instruction is generated and fed back by the second server according to the second voice request. According to the embodiment of the application, the switching operation of the voice interaction equipment among different skills can be simplified, and the user experience is improved.

Description

Voice interaction method, voice interaction device, electronic device and storage medium

Technical Field

The application relates to the field of artificial intelligence, in particular to the field of intelligent voice interaction.

Background

Voice interaction devices have appeared in more and more households, and the prior art supports voice interaction for some skills, such as querying weather, querying time, etc. Some voice interaction devices, especially the voice interaction device with the screen, can also be provided with some application software, and the cloud server corresponding to the application is utilized to provide corresponding skills for the voice interaction device. Under the scene that the voice interaction equipment starts the skills, if the user wants to use other skills, the user needs to quit the application corresponding to the current skill first, and then sends a voice request of the user for requesting other skills to the voice interaction equipment.

For example, a voice interaction device has a shopping-like application installed, which the device is currently opening and providing shopping skills; if the user wants to check the weather at this time, the user needs to quit the shopping skill first, then sends a voice request for checking the weather to the voice interaction device, and the cloud server of the voice interaction device provides the weather checking skill.

Therefore, in the prior art, the switching operation of the voice interaction device among some skills is complicated, and the user experience is influenced.

Disclosure of Invention

Embodiments of the present application provide a voice interaction method, a voice interaction device, an electronic device, and a storage medium, so as to solve one or more technical problems in the prior art.

In a first aspect, the present application provides a voice interaction method, including:

sending a second voice request to a second server under the condition that the voice interaction device provides the first skill; the second voice request comprises a voice request for requesting a second skill;

receiving a second control instruction; and the second control instruction is generated and fed back by the second server according to the second voice request.

By adopting the voice interaction method provided by the embodiment of the application, the switching of the voice interaction equipment among different skills can be controlled through voice, the switching operation is simplified, and the user experience is improved.

In a second aspect, the present application provides a voice interaction method, including:

under the condition that the voice interaction equipment provides the first skill, receiving a second voice request sent by the voice interaction equipment; the second voice request comprises a voice request for requesting a second skill;

generating a second control instruction according to the second voice request;

and feeding back the second control instruction to the voice interaction equipment.

In a third aspect, the present application provides a voice interaction device, including:

the request sending module is used for sending a second voice request to the second server under the condition of providing the first skill; the second voice request comprises a voice request for requesting a second skill;

the receiving module is used for receiving a second control instruction; and the second control instruction is generated and fed back by the second server according to the second voice request.

In a fourth aspect, the present application provides a server, comprising:

the request receiving module is used for receiving a second voice request sent by the voice interaction equipment under the condition that the voice interaction equipment provides a first skill; the second voice request comprises a voice request for requesting a second skill;

the instruction generating module is used for generating a second control instruction according to the second voice request;

and the instruction sending module is used for feeding back the second control instruction to the voice interaction equipment.

In a fifth aspect, an embodiment of the present application provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a method provided by any one of the embodiments of the present application.

In a sixth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method provided by any one of the embodiments of the present application.

Other effects of the above-described alternative will be described below with reference to specific embodiments.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic diagram of an application system of a voice interaction method according to an embodiment of the present application;

FIG. 2 is a first flowchart illustrating an implementation of a voice interaction method according to an embodiment of the present application;

FIG. 3 is a flowchart II illustrating an implementation of a voice interaction method according to an embodiment of the present application;

fig. 4 is a flowchart illustrating a third implementation procedure of the voice interaction method according to an embodiment of the present application;

fig. 5 is a first diagram illustrating information transmission in a voice interaction method according to an embodiment of the present application;

fig. 6 is a diagram illustrating information transmission in a voice interaction method according to an embodiment of the present application;

FIG. 7 is a flowchart of a fourth implementation of a voice interaction method according to an embodiment of the present application;

FIG. 8 is a flow chart of an implementation of a voice interaction method according to an embodiment of the present application;

FIG. 9 is a first schematic structural diagram of a voice interaction device according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a voice interaction device according to an embodiment of the present application;

FIG. 11 is a first block diagram illustrating a server according to an embodiment of the present application;

FIG. 12 is a second schematic structural diagram of a server according to an embodiment of the present application;

fig. 13 is a block diagram of an electronic device for implementing a voice interaction method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The embodiment of the application provides a voice interaction method which can be applied to voice interaction equipment, in particular to voice interaction equipment with a screen and a screen.

In order to implement the voice interaction method of the embodiment of the present application, an embodiment of the present application provides a voice interaction system, which includes a voice interaction device and two cloud servers (hereinafter referred to as servers). Wherein the voice interaction device is installed with an application that is capable of providing a corresponding skill (hereinafter referred to as a first skill). Such as a shopping-like application, that provides shopping skills to the user. In the two cloud servers, a first server is a server corresponding to the application and is used for supporting the first skill; the second server is a server of the voice interaction device itself, and provides other skills (hereinafter referred to as second skills) for the user, such as playing video, playing audio, viewing weather, and the like.

Fig. 1 is a schematic diagram of an application system of a voice interaction method according to an embodiment of the present application. As shown in fig. 1, the system includes: the system comprises a voice interaction device, a first server and a second server. The second server analyzes the voice request sent by the voice interaction equipment and provides a second skill; when the analysis result of the voice request corresponds to the first skill, the second server returns the analysis result to the voice interaction equipment; the first skill is re-requested from the first server by the voice interaction device. In the following embodiments, a method applied to each device/server is described in detail.

The embodiment of the application provides a voice interaction method which can be applied to voice interaction equipment, in particular to voice interaction equipment with a screen. Fig. 2 is a first flowchart of an implementation of a voice interaction method according to an embodiment of the present application, including the following steps:

s201: sending a second voice request to a second server under the condition that the voice interaction device provides the first skill; the second voice request comprises a voice request for requesting a second skill;

s202: receiving a second control instruction; the second control instruction is generated and fed back by the second server according to the second voice request.

Optionally, on the server side, the second server parses the second voice request. And determining that the second voice request is used for requesting a second skill according to the analysis result, and generating a corresponding second control instruction by the second server according to the analysis result to provide the second skill.

For example, the first skill is a shopping-like skill and the second skill is a music playing skill. Under the scene that the voice interaction device starts shopping applications and provides shopping skills, a user sends a voice request of playing music, and the voice interaction device sends the voice request to the second server. And after the second server analyzes the voice request and finds that the voice request is used for requesting music playing skills, generating a control instruction corresponding to the music playing skills, issuing the control instruction to the voice interaction equipment, and issuing related broadcasting techniques.

As another example, the first skill is a shopping-like skill and the second skill is a video playback skill. Under the scene that the voice interaction device starts the shopping application and provides shopping skills, the user sends out a voice request of a prize winning movie, and the voice interaction device sends the voice request to the second server. And after the second server analyzes the voice request and finds that the voice request is used for requesting video playing skills, generating a corresponding control instruction, issuing the control instruction to the voice interaction equipment and issuing a related broadcasting speech.

Therefore, by adopting the voice interaction mode provided by the embodiment of the application, when the switching among different skills is realized, the switching can be carried out in a voice control mode without manual operation of a user. Therefore, the switching operation can be simplified, and the user experience can be improved.

Fig. 3 is a flowchart illustrating an implementation of a voice interaction method according to an embodiment of the present application. In some embodiments, as shown in fig. 3, the step S202 further includes:

s303: and executing the second control command to provide the second skill.

In some embodiments, as shown in fig. 4, the voice interaction method according to the embodiment of the present application further includes:

s401: under the condition that the voice interaction equipment provides the first skill, sending a first voice request to a second server; the first voice request comprises a voice request for a first skill;

s402: receiving a parsing result aiming at the first voice request; the analysis result is generated and fed back by the second server according to the first voice request;

s403: sending the analysis result to a first server, wherein the first server is a server supporting the first skill;

s404: receiving a first control instruction corresponding to the analysis result; the first control instruction is generated and fed back by the first server according to the analysis result.

Therefore, by adopting the mode, the control on the current skill of the voice interaction equipment can be realized.

Optionally, the first skill includes a shopping skill and may also include other skills provided by the corresponding application. Therefore, the embodiment of the application can realize voice control for switching from shopping skills to other skills.

Fig. 5 is a first schematic diagram illustrating information transmission in a voice interaction method according to an embodiment of the present application. As shown in fig. 5, the method comprises the following steps:

s501: the voice interaction equipment currently provides the first skill and sends a second voice request to a second server; the second voice request is for requesting a second skill.

S502: and the second server analyzes the second voice request, determines that the second voice request corresponds to a second skill, and generates a second control instruction corresponding to the second skill according to an analysis result.

S503: and the second server sends a second control instruction to the voice interaction equipment.

Fig. 6 is a schematic diagram illustrating information transmission in a voice interaction method according to an embodiment of the present application. As shown in fig. 5, the method comprises the following steps:

s601: the voice interaction equipment currently provides a first skill and sends a first voice request to a second server; the first voice request is for requesting a first skill.

S602: the second server analyzes the first voice request to obtain an analysis result; and determining a first skill corresponding to the first voice request according to the analysis result.

S603: and the second server sends the analysis result to the voice interaction equipment.

S604: and the voice interaction equipment sends the analysis result to the first server.

S605: and the first server generates a first control instruction corresponding to the first skill according to the analysis result.

S606: and the first server root sends the first control instruction to the voice interaction equipment.

The embodiment of the application also provides a voice interaction method, which can be applied to the second server. Fig. 7 is a fourth flowchart of an implementation of a voice interaction method according to an embodiment of the present application, including the following steps:

s701: under the condition that the voice interaction equipment provides the first skill, receiving a second voice request sent by the voice interaction equipment; the second voice request comprises a voice request for requesting a second skill;

s702: generating a second control instruction according to the second voice request;

s703: and feeding back the second control instruction to the voice interaction equipment.

In some embodiments, as shown in fig. 8, the voice interaction method according to the embodiment of the present application further includes:

s801: under the condition that the voice interaction equipment provides the first skill, receiving a first voice request sent by the voice interaction equipment; the first voice request comprises a voice request for a first skill;

s802: generating a corresponding analysis result according to the first voice request;

s803: and feeding back the analysis result to the voice interaction equipment.

The embodiment of the application also provides voice interaction equipment. Fig. 9 is a first schematic structural diagram of a voice interaction device according to an embodiment of the present application, including:

a request sending module 901, configured to send a second voice request to a second server when the voice interaction device provides the first skill; the second voice request comprises a voice request for requesting a second skill;

a receiving module 902, configured to receive a second control instruction; and the second control instruction is generated and fed back by the second server according to the second voice request.

Fig. 10 is a schematic structural diagram of a voice interaction device according to an embodiment of the present application. As shown in fig. 10, in some embodiments, the voice interaction device further includes:

an executing module 1003, configured to execute the second control instruction to provide the second skill.

As shown in fig. 10, in some embodiments, the voice interaction device further includes: a parsing result transmitting module 1004;

the request sending module 901 is further configured to send a first voice request to the second server when the voice interaction device provides the first skill; the first voice request comprises a voice request for the first skill;

the receiving module 902 is further configured to receive a parsing result for the first voice request; the second server generates and feeds back the analysis result according to the first voice request;

a parsing result sending module 1004, configured to send the parsing result to a first server, where the first server is a server supporting the first skill;

the receiving module 902 is further configured to receive a first control instruction corresponding to the parsing result; and the first control instruction is generated and fed back by the first server according to the analysis result.

Optionally, the first skill comprises a shopping skill.

The functions of each module in each voice interaction device in the embodiment of the present application may refer to the corresponding description in the above method, and are not described herein again.

The embodiment of the present application further provides a server, which may be the second server. Fig. 11 is a first schematic structural diagram of a server according to an embodiment of the present application, including:

a request receiving module 1101, configured to receive a second voice request sent by a voice interaction device when the voice interaction device provides a first skill; the second voice request comprises a voice request for requesting a second skill;

an instruction generating module 1102, configured to generate a second control instruction according to the second voice request;

an instruction sending module 1103, configured to feed back the second control instruction to the voice interaction device.

As shown in fig. 12, in some embodiments, the server further includes: an analysis result generation module 1204 and an analysis result feedback module 1205;

the request receiving module 1101 is further configured to receive a first voice request sent by a voice interaction device when the voice interaction device provides a first skill; the first voice request comprises a voice request for the first skill;

an analysis result generation module 1204, configured to generate a corresponding analysis result according to the first voice request;

the analysis result feedback module 1205 is configured to feed back the analysis result to the voice interaction device.

The functions of each module in each server in the embodiment of the present application may refer to corresponding descriptions in the above method, and are not described herein again.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 13 is a block diagram of an electronic device according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 13, the electronic apparatus includes: one or more processors 1310, a memory 1320, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). One processor 1310 is illustrated in fig. 13.

Memory 1320 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method of voice interaction provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of voice interaction provided herein.

Memory 1320, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., request sending module 901 and receiving module 902 shown in fig. 9) corresponding to the method of voice interaction in the embodiments of the present application. The processor 1310 executes various functional applications of the server and data processing, i.e., a method of implementing voice interaction in the above-described method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory 1320.

The memory 1320 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the voice-interactive electronic device, and the like. Further, the memory 1320 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 1320 optionally includes memory located remotely from the processor 1310, and such remote memory may be connected to the voice-interactive electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method of voice interaction may further comprise: an input device 1330 and an output device 1340. The processor 1310, the memory 1320, the input device 1330, and the output device 1340 may be connected by a bus or other means, such as by a bus in FIG. 13.

The input device 1330 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the voice-interactive electronic apparatus, such as an input device like a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, etc. The output devices 1340 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of voice interaction, comprising:

2. The method of claim 1, further comprising:

executing the second control instruction to provide the second skill.

3. The method of claim 1 or 2, further comprising:

under the condition that the voice interaction equipment provides the first skill, sending a first voice request to a second server; the first voice request comprises a voice request for the first skill;

receiving a parsing result for the first voice request; the second server generates and feeds back the analysis result according to the first voice request;

sending the analysis result to a first server, wherein the first server is a server supporting the first skill;

receiving a first control instruction corresponding to the analysis result; and the first control instruction is generated and fed back by the first server according to the analysis result.

4. The method of claim 1 or 2, wherein the first skill comprises a shopping-like skill.

5. A method of voice interaction, comprising:

generating a second control instruction according to the second voice request;

6. The method of claim 5, further comprising:

under the condition that a voice interaction device provides a first skill, receiving a first voice request sent by the voice interaction device; the first voice request comprises a voice request for the first skill;

generating a corresponding analysis result according to the first voice request;

and feeding back the analysis result to the voice interaction equipment.

7. A voice interaction device, comprising:

the request sending module is used for sending a second voice request to the second server under the condition that the voice interaction equipment provides the first skill; the second voice request comprises a voice request for requesting a second skill;

8. The apparatus of claim 7, further comprising:

and the execution module is used for executing the second control instruction to provide the second skill.

9. The apparatus according to claim 7 or 8, wherein the apparatus further comprises a parsing result sending module;

the request sending module is further used for sending a first voice request to the second server by the voice interaction equipment under the condition of providing the first skill; the first voice request comprises a voice request for the first skill;

the receiving module is further configured to receive an analysis result for the first voice request; the second server generates and feeds back the analysis result according to the first voice request;

the analysis result sending module is used for sending the analysis result to a first server, and the first server is a server supporting the first skill;

the receiving module is further used for receiving a first control instruction corresponding to the analysis result; and the first control instruction is generated and fed back by the first server according to the analysis result.

10. The apparatus of claim 7 or 8, wherein the first skill comprises a shopping-like skill.

11. A server, comprising:

12. The server according to claim 11, wherein the server further comprises a parsing result generation module and a parsing result feedback module;

the request receiving module is further used for receiving a first voice request sent by the voice interaction equipment under the condition that the voice interaction equipment provides a first skill; the first voice request comprises a voice request for the first skill;

the analysis result generation module is used for generating a corresponding analysis result according to the first voice request;

and the analysis result feedback module is used for feeding back the analysis result to the voice interaction equipment.

13. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 6.

14. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 6.