CN108766429B

CN108766429B - Voice interaction method and device

Info

Publication number: CN108766429B
Application number: CN201810568760.0A
Authority: CN
Inventors: 路华; 黄世维; 黄硕
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-06-05
Filing date: 2018-06-05
Publication date: 2020-08-21
Anticipated expiration: 2038-06-05
Also published as: CN108766429A

Abstract

The embodiment of the application discloses a voice interaction method and device. One embodiment of the method comprises: extracting first voice information containing a target word voice segment; superposing a prompt tone at the target word sound fragment, and outputting first voice information superposed with the prompt tone by voice, wherein the prompt tone is used for prompting the currently broadcasted content as a target word; responding to the collected second voice information fed back by the user, and matching the second voice information with the target word; responsive to determining that the second speech information matches the target word, speech outputting third speech information associated with the target word. This embodiment improves the efficiency of the voice interaction.

Description

Voice interaction method and device

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a voice interaction method and device.

Background

With the development of computer technology, the variety of voice interaction products is more and more abundant. In a pure voice interaction product, the user expression is not limited by a graphical interface, the degree of freedom is extremely high, and the answer of the user is generally required to be limited. Therefore, in a voice-only interactive environment, it is important to efficiently and cost-effectively inform the user of those limitations.

In the existing mode, a user is usually given a corresponding prompt through a graphical interface, and after reading a description or a tutorial, the user knows the available voice instruction. In another existing method, a user can be informed of a voice instruction available by means of voice output.

Disclosure of Invention

The embodiment of the application provides a voice interaction method and device.

In a first aspect, an embodiment of the present application provides a voice interaction method, where the method includes: extracting first voice information containing a target word voice segment; superposing a prompt tone at the target word sound fragment, and outputting first voice information obtained after the prompt tone is superposed in a voice mode, wherein the prompt tone is used for prompting the currently broadcasted content as a target word; responding to the collected second voice information fed back by the user, and matching the second voice information with the target word; responsive to determining that the second speech information matches the target word, speech outputting third speech information associated with the target word.

In some embodiments, the method further includes superimposing an alert tone at the target word sound segment, and outputting the first speech information after superimposing the alert tone by speech, including: and superposing an impulse type prompt tone at the beginning of the target word voice segment, and outputting the first voice information after superposing the prompt tone by voice, wherein the prompt tone is ended before the end of the target word voice segment.

In some embodiments, the method further includes superimposing an alert tone at the target word sound segment, and outputting the first speech information after superimposing the alert tone by speech, including: and superposing a continuous prompt tone at the beginning of the target word voice segment, and outputting the first voice information superposed with the prompt tone in a voice mode, wherein the prompt tone is ended when the target word voice segment is ended.

In some embodiments, responsive to determining that the second speech information matches the target word, speech outputting third speech information associated with the target word includes: in response to determining that the second speech information matches the target word, determining a type of the first speech information, determining third speech information associated with the target word based on the type of the first speech information, and speech outputting the third speech information.

In some embodiments, determining third speech information associated with the target word based on the type of the first speech information, the speech outputting the third speech information comprises: generating an information search request containing a target word in response to determining that the type of the first voice information is a news broadcast type; sending an information search request to a server, and receiving a search result returned by the server; and taking the voice information corresponding to the search result as third voice information, and outputting the third voice information in a voice mode.

In some embodiments, determining third speech information associated with the target word based on the type of the first speech information, the speech outputting the third speech information comprises: generating a service query request containing a target word in response to the fact that the type of the first voice information is determined to be a service query type; sending a service query request to a server, and receiving a query result returned by the server; and taking the voice information corresponding to the query result as third voice information, and outputting the third voice information in a voice mode.

In some embodiments, determining third speech information associated with the target word based on the type of the first speech information, the speech outputting the third speech information comprises: and generating a jump instruction for instructing to jump to a preset next voice message in response to the fact that the type of the first voice message is determined to be the message confirmation type, and determining the next voice message as a third voice message.

In some embodiments, the volume of the alert tone is less than the volume of the target word speech segment.

In a second aspect, an embodiment of the present application provides a voice interaction apparatus, including: an extraction unit configured to extract first voice information including a target word voice segment; the first output unit is configured to superpose a prompt tone at the target word tone fragment, and the voice outputs first voice information after the prompt tone is superposed, wherein the prompt tone is used for prompting that the currently broadcasted content is the target word; the matching unit is configured to respond to the second voice information fed back by the user and match the second voice information with the target word; a second output unit configured to, in response to determining that the second speech information matches the target word, speech output third speech information associated with the target word.

In some embodiments, the first output unit is further configured to: and superposing an impulse type prompt tone at the beginning of the target word voice segment, and outputting the first voice information after superposing the prompt tone by voice, wherein the prompt tone is ended before the end of the target word voice segment.

In some embodiments, the first output unit is further configured to: and superposing a continuous prompt tone at the beginning of the target word voice segment, and outputting the first voice information superposed with the prompt tone in a voice mode, wherein the prompt tone is ended when the target word voice segment is ended.

In some embodiments, the matching unit is further configured to: in response to determining that the second speech information matches the target word, determining a type of the first speech information, determining third speech information associated with the target word based on the type of the first speech information, and speech outputting the third speech information.

In some embodiments, the matching unit comprises: a first generation module configured to generate an information search request containing a target word in response to determining that the type of the first voice information is a news broadcast type; the system comprises a first sending module, a second sending module and a third sending module, wherein the first sending module is configured to send an information searching request to a server and receive a searching result returned by the server; and the first output module is configured to take the voice information corresponding to the search result as third voice information, and the third voice information is output in a voice mode.

In some embodiments, the matching unit comprises: the second generation module is configured to respond to the fact that the type of the first voice information is determined to be a service query class, and generate a service query request containing a target word; the second sending module is configured to send a service query request to the server and receive a query result returned by the server; and the second output module is configured to take the voice information corresponding to the query result as third voice information, and the third voice information is output in a voice mode.

In some embodiments, the matching unit comprises: and a third generation module configured to generate a skip instruction for instructing a skip to a preset next voice message in response to determining that the type of the first voice message is the information confirmation class, and determine the next voice message as the third voice message.

In a third aspect, an embodiment of the present application provides a terminal device, including: one or more processors; a storage device having one or more programs stored thereon which, when executed by one or more processors, cause the one or more processors to implement a method as in any embodiment of the voice interaction method.

In a fourth aspect, the present application provides a computer-readable medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a method as in any embodiment of the voice interaction method.

According to the voice interaction method and the voice interaction device, the first voice information containing the target word voice fragment is extracted, the prompt tone is superposed on the target word voice fragment, the first voice information superposed with the prompt tone is output in a voice mode, then when the second voice information fed back by a user is collected, the third voice information to be output in a voice mode is determined based on the matching of the second voice information and the target word, and finally the third voice information is output in a voice mode. Therefore, the user does not need to use a graphical interface or voice to inform the user of the voice instruction which can be input, and does not need to spend extra time reading or hearing instructions and tutorials, the user can be prompted in a mode of superposing the prompt tones, which voice instructions can be given, and the efficiency of voice interaction is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a voice interaction method according to the present application;

FIG. 3 is a schematic diagram of an application scenario of a voice interaction method according to the present application;

FIG. 4 is a flow diagram of yet another embodiment of a voice interaction method according to the present application;

FIG. 5 is a schematic block diagram of one embodiment of a voice interaction device according to the present application;

fig. 6 is a schematic structural diagram of a computer system suitable for implementing a terminal device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary system architecture 100 to which the voice interaction method or voice interaction apparatus of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as a voice interaction application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices that support voice interaction, including but not limited to smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing support for voice interaction type applications installed on the

terminal devices

101, 102, 103. The background server can analyze and process the received data such as the information search request, the service inquiry request and the like, and feed back the processing result to the terminal equipment.

The server 105 may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the voice interaction method provided in the embodiment of the present application is generally executed by the

terminal devices

101, 102, and 103, and accordingly, the voice interaction apparatus is generally disposed in the

terminal devices

101, 102, and 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a voice interaction method according to the present application is shown. The voice interaction method comprises the following steps:

step 201, extracting first voice information containing a target word voice segment.

In the present embodiment, an execution subject of the voice interaction method (e.g., the

terminal devices

101, 102, 103 shown in fig. 1) may extract first voice information to be voice-output. The first voice message may include a target word voice segment.

The target word sound segment may be a sound segment formed by a sound converted from the target word in the first sound information. The target word may be a word for generating an instruction (e.g., an information search instruction, a service inquiry instruction, a jump instruction). As an example, in the first voice information "asena 3 defeats siercz than 0," asena "," siercz "may be both target words. The voice segments corresponding to the "asena" and the "cherries" are the target word voice segments. After the user answers "asena" in voice, the execution subject can generate a language information search instruction containing the character string "asena". After the user answers "cherries" in speech, the execution body may generate an information search instruction including a character string "cherries".

And 202, superposing a prompt tone at the target word sound fragment, and outputting the first voice information superposed with the prompt tone by voice.

In this embodiment, the execution subject may superimpose an alert tone on the target word sound segment, and output the first speech information on which the alert tone is superimposed. The prompt tone can be used for prompting that the currently broadcasted content is the target word. As an example, the alert sound may be an impulse type alert sound (e.g., "ding" or "dong"), and the volume gradually decreases with time. The alert tone may be a continuous type alert tone, and the volume of the alert tone may be maintained at the same level from the start to the end of the alert tone. It should be noted that the alert sound may have a different volume (e.g., smaller than the target word speech segment), a different tone (e.g., a softer tone experienced by a human subject), and the like from the target word speech segment, so as to reduce interference to the user.

Here, the prompt tone is superimposed at the target word voice segment, which may be the start of the target word voice segment; it is also possible to superimpose the alert tone at a preset time (e.g., 0.1 second, or 0.2 second, etc. before the start of the target word speech segment) before the start of the target word speech segment. The prompt tone may end when the target word speech segment ends, or may end before the target word speech segment ends.

In some optional implementations of this embodiment, the executing body may superimpose an impulse-type alert tone at the beginning of the target word speech segment, and output the first speech information superimposed with the alert tone. Wherein the cue tone may end before the end of the target word speech segment.

In some optional implementations of the embodiment, the executing body may superimpose a continuous type alert tone at the beginning of the target word speech segment, and output the first speech information superimposed with the alert tone. Wherein, the prompt tone may end when the target word speech segment ends.

And step 203, responding to the collected second voice information fed back by the user, and matching the second voice information with the target word.

In this embodiment, the executing body may match the second speech information with the target word in response to collecting the second speech information fed back by the user. The method can be specifically executed according to the following steps:

the method comprises the following steps that firstly, after the execution main body outputs the first voice information after the prompt tone is superimposed, the microphone can be used for collecting voice signals within a preset time length.

And secondly, the execution main body can process the collected voice signals to obtain voice information, and the voice information is used as second voice information fed back by the user. It should be noted that the executing body may process the collected voice signal in various manners. As an example, the speech signal may be first subjected to a high-pass filtering process to eliminate (or attenuate) the interfering speech signal in the speech signal. Then, various echo cancellation methods can be used to perform echo cancellation processing on the voice signal after the interference sound signal is cancelled, so as to obtain the voice signal after the echo signal is cancelled. Finally, the speech signal after the echo signal is removed may be subjected to automatic gain control processing to increase the speech signal after the echo signal is removed, so as to obtain speech information, and the speech information is used as second speech information fed back by the user.

Third, the executing entity may recognize the second speech information by using a pre-trained acoustic model, and obtain a speech recognition result (e.g., a character string corresponding to the second speech information). Here, the acoustic model may be obtained by performing supervised training on a training sample composed of a large amount of speech information by a machine learning method. Here, the acoustic Model may be trained using various models, such as Hidden Markov Models (HMMs), Recurrent Neural Networks (RNNs), Deep Neural Networks (DNNs), and a combination of a plurality of models may be used.

Fourth, the execution subject may match the voice recognition result with the target word. As an example, it may be determined whether the voice recognition result coincides with the target word. If the first voice information is consistent with the target word, the second voice information can be determined to be matched with the target word; otherwise, a mismatch may be determined. As yet another example, the execution subject may determine whether the voice recognition result includes the target word. If yes, the second voice information can be determined to be matched with the target word; otherwise, a mismatch may be determined.

It should be noted that, the electronic device may also determine whether the second speech information matches the target word in other manners. For example, whether the second speech information is similar to the target word speech fragment or not may be determined by comparing the second speech information with the target word speech fragment. If yes, the second voice message can be determined to be matched with the target word.

In response to determining that the second speech information matches the target word, a third speech information associated with the target word is speech output, step 204.

In this embodiment, the execution main body may speech-output the third speech information associated with the target word in response to determining that the second speech information matches the target word. Here, a preset instruction (for example, an information search instruction for instructing an information search using the target word as a search word) associated with the target word may be executed first, and the execution result (for example, information obtained by the search) may be determined as the third speech information associated with the target word.

It should be noted that, in response to determining that the voice recognition result does not match the target word, preset voice information for prompting the user to resend the voice information may be determined as the third voice information.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the voice interaction method according to the present embodiment. In the application scenario of fig. 3, a user holds the terminal device 301 and performs voice interaction with the terminal device 301.

First, the terminal device 301 extracts the first voice information "asena 3:0 defeat sierpinsm" containing the target word voice fragments "asena" and "sierpinsm". Prompt tones are superimposed at the target word speech segment "asena" and "cherries", respectively. Then the voice outputs the first voice information superposed with the prompt tone. Then, after hearing the first voice message broadcasted by the terminal device 301, the user knows that the user can ask questions about "asena" and "cherie". And utters the second speech information "asena". Then, the terminal device 301 searches for the target word "asena", converts the introduction information of the searched asena into voice, and broadcasts the voice as third voice information.

According to the method provided by the embodiment of the application, the first voice information containing the target word voice segment is extracted, the prompt tone is superposed on the target word voice segment, the first voice information superposed with the prompt tone is output in a voice mode, then when the second voice information fed back by a user is collected, the third voice information to be output in a voice mode is determined based on the matching of the second voice information and the target word, and finally the third voice information is output in a voice mode. Therefore, the user does not need to use a graphical interface or voice to inform the user of the input voice command, and does not need to spend extra time on reading or listening to the instruction and the tutorial, and the user can be prompted in a mode of superposing the prompt tones, so that the voice interaction efficiency and flexibility are improved.

With further reference to FIG. 4, a flow 400 of yet another embodiment of a voice interaction method is shown. The process 400 of the voice interaction method includes the following steps:

step 401, extracting first voice information containing a target word voice segment.

In this embodiment, the execution subject of the voice interaction method (e.g., the

terminal devices

101, 102, 103 shown in fig. 1) may extract the first voice information containing the target word voice fragment. The target word speech segment may be a speech segment formed by speech converted from the target word in the first speech information. The target word may be a word for generating an instruction (e.g., an information search instruction, a service inquiry instruction, a jump instruction).

Step 402, superposing a continuous prompt tone at the beginning of the target word voice segment, and outputting the first voice information superposed with the prompt tone by voice.

In this embodiment, the executing entity may superimpose a continuation-type alert tone at the beginning of the target word speech segment, and output the first speech information superimposed with the alert tone, where the alert tone ends when the target word speech segment ends. Here, the volume of the alert tone may be smaller than the volume of the target word speech segment.

And step 403, in response to collecting the second voice information fed back by the user, matching the second voice information with the target word.

In this embodiment, after the execution main body outputs the first voice information obtained by superimposing the alert tone on the voice, the execution main body may collect a voice signal within a preset time period by using the installed microphone. Then, the collected voice signal can be processed to obtain voice information, and the voice information is used as second voice information fed back by the user. And then, the second voice information can be recognized by utilizing the acoustic model trained in advance, so that a voice recognition result is obtained. Finally, the speech recognition result may be matched with the target word. As an example, it may be determined whether the voice recognition result coincides with the target word. If the first voice information is consistent with the target word, the second voice information can be determined to be matched with the target word; otherwise, a mismatch may be determined.

In response to determining that the second speech information matches the target word, a type of the first speech information is determined, step 404.

In this embodiment, the type of the first speech information is determined in response to determining that the second speech information matches the target word. Here, the type of the first voice information may include, but is not limited to, various types such as a news report type, a business query type, an information confirmation type, and the like, and different types of voice information may correspond to different third voice information associated with the target word.

And step 405, determining third voice information associated with the target word based on the type of the first voice information, and outputting the third voice information in a voice mode.

In this embodiment, the executing body may determine third speech information associated with the target word based on a type of the first speech information, and speech-output the third speech information. Here, the different types of speech information may correspond to different third speech information associated with the above-described target word.

In some optional implementations of the embodiment, in response to determining that the type of the first voice information is a news broadcast, the executing entity may first generate an information search request including the target word. Then, the information search request may be sent to a server (e.g., the server 105 shown in fig. 1), and the search result returned by the server may be received. Then, the voice information corresponding to the search result may be used as third voice information, and the third voice information may be outputted in a voice manner. As an example, the execution subject speech outputs the first speech information "asena 3:0 defeats the cherries". Wherein, the 'asena' and the 'Chersch' in the first voice message are both target word voice segments and are superposed with prompt tones. The executing body may send an information search request including the target word "asena" to the server to search for information related to "asena" (for example, introduction information to asena), in response to determining that the target word "asena" is included in the second voice message replied by the user. The execution body may convert the search result into a voice, and output the converted voice as the voice. The converted voice is the third voice information.

In some optional implementation manners of this embodiment, in response to determining that the type of the first voice message is a service query class, the execution main body may first generate a service query request including the target word. Then, the service query request may be sent to a server, and a query result returned by the server may be received. And then, taking the voice information corresponding to the query result as third voice information, and outputting the third voice information in a voice mode. As an example, the execution main body speech-outputs the first speech information "you can inquire your balance and other information". The balance and other voice information in the first voice information are target word voice fragments and are superposed with prompt tones. In response to determining that the second voice message returned by the user contains the target word voice fragment "balance", the executing entity may send a service query request containing the target word "balance" to the server to query the balance. The execution body can convert the query result into voice and output the converted voice. The converted voice is the third voice information.

In some optional implementations of the embodiment, in response to determining that the type of the first voice information is an information confirmation class, the execution main body may generate a jump instruction for instructing a jump to a preset next voice information, and may determine the next voice information as the third voice information. As an example, the execution subject described above vocally outputs the first voice information "your destination is conference room No. 5, confirmation does". Wherein, the destination and the confirmation in the first voice message are both target word voice fragments and are superposed with prompt tones. The executing body may generate a jump instruction for instructing a jump to a preset next voice message "start navigating for you now" in response to determining that the target word voice segment "confirmation" is included in the second voice message replied by the user, and may determine that the voice message "start navigating for you now" is the third voice message. In response to determining that the second voice message replied by the user contains the target word voice segment "destination", the execution main body may generate a jump instruction for instructing to jump to a preset next voice message "please re-input the destination", and may determine the voice message "please re-input the destination" as the third voice message. It should be noted that the voice message of the message confirmation type may not include the word "confirmation". For example, "you are good, we provide here Chinese meals and hamburgers, beverages are cola and orange juice". The type of the voice message may also be referred to as a message confirmation class. Taking this voice information as an example, the "Chinese meal", "hamburger", "cola" and "orange juice" are all target word voice segments, and are all superimposed with prompt tones. In response to determining that any target word voice segment (for example, "chinese meal") is included in the second voice message replied by the user, the execution main body may generate a jump instruction to jump to a preset next voice message "chinese meal has steamed stuffed bun, dumpling, and rice, please select" corresponding to "chinese meal", and may determine that the voice message "chinese meal has steamed stuffed bun, dumpling, and rice, please select" as the third voice message.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the voice interaction method in this embodiment highlights the step of determining the third voice information to be voice-output for different types of the first voice information. Therefore, the scheme described in the embodiment prompts the user which voice instructions can be given by using the mode of superposing the prompt tone at the target word sound fragment, so that the efficiency and the flexibility of voice interaction are improved. Meanwhile, the mode can support voice interaction for many times, and a user does not need to read a description or a tutorial in the interaction process, and does not need to broadcast a rule for sending an instruction by the user, so that the flexibility of the voice interaction is further improved, and the efficiency of the voice interaction is further improved.

With further reference to fig. 5, as an implementation of the method shown in the above-mentioned figures, the present application provides an embodiment of a voice interaction apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied to various electronic devices.

As shown in fig. 5, the voice interaction apparatus 500 according to the present embodiment includes: an extracting unit 501 configured to extract first voice information including a target word voice segment; a first output unit 502 configured to superimpose a warning tone at the target word tone segment, and output a first voice message obtained by superimposing the warning tone in a voice manner, where the warning tone is used to prompt that the currently broadcasted content is a target word; a matching unit 503 configured to match the second voice information with the target word in response to collecting the second voice information fed back by the user; a second output unit 504 configured to, in response to determining that the second speech information matches the target word, speech output third speech information associated with the target word.

In some embodiments, the first output unit 502 may be further configured to superimpose an impulse-type alert tone at the beginning of the target word speech segment, and output the first speech information superimposed with the alert tone, wherein the alert tone ends before the end of the target word speech segment.

In some embodiments, the first output unit 502 may be further configured to superimpose a continuous type alert tone at the beginning of the target word speech segment, and output the first speech information superimposed with the alert tone, wherein the alert tone ends when the target word speech segment ends.

In some embodiments, the matching unit 503 may be further configured to determine a type of the first voice information in response to determining that the second voice information matches the target word, determine third voice information associated with the target word based on the type of the first voice information, and output the third voice information by voice.

In some embodiments, the matching unit 503 may include a first generating module, a first sending module, and a first outputting module (not shown in the figure). The first generating module may be configured to generate an information search request including the target word in response to determining that the type of the first voice information is a news broadcast type. The first sending module may be configured to send the information search request to a server, and receive a search result returned by the server. The first output module may be configured to output the third voice information as a voice by taking the voice information corresponding to the search result as the third voice information.

In some embodiments, the matching unit 503 may include a second generating module, a second sending module, and a second outputting module (not shown in the figure). The second generating module may be configured to generate a service query request including the target word in response to determining that the type of the first voice information is a service query class. The second sending module may be configured to send the service query request to a server, and receive a query result returned by the server. The second output module may be configured to output the third voice information as voice information corresponding to the query result.

In some embodiments, the matching unit 503 may include a third generating module (not shown in the figure). The third generating module may be configured to generate a skip instruction for instructing to skip to a preset next voice message in response to determining that the type of the first voice message is the information confirmation type, and determine the next voice message as the third voice message.

In the apparatus provided by the above embodiment of the present application, the extracting unit 501 extracts the first speech information including the speech segment of the target word, then the first output unit 502 superimposes the alert tone at the speech segment of the target word, and speech outputs the first speech information after superimposing the alert tone, and then the matching unit 503 matches the second speech information with the target word when acquiring the second speech information fed back by the user, and if matching, the second output unit 504 speech outputs the third speech information associated with the target word. Therefore, the user does not need to use a graphical interface or voice to inform the user of the input voice command, and does not need to spend extra time on reading or listening to the instruction and the tutorial, and the user can be prompted in a mode of superposing the prompt tones, so that the voice interaction efficiency and flexibility are improved.

Referring now to FIG. 6, shown is a block diagram of a computer system 600 suitable for use in implementing a terminal device of an embodiment of the present application. The terminal device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a touch screen, a touch panel, and the like; an output portion 607 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage section 608; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a semiconductor memory or the like is mounted on the drive 610 as necessary, so that the computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) 601. It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an extraction unit, a first output unit, a matching unit, and a second output unit. Where the names of these units do not in some cases constitute a limitation on the unit itself, for example, the extraction unit may also be described as a "unit that extracts first speech information containing a target word speech fragment".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: extracting first voice information containing a target word voice segment; superposing a prompt tone at the target word sound fragment, and outputting first voice information superposed with the prompt tone by voice, wherein the prompt tone is used for prompting the currently broadcasted content as a target word; responding to the collected second voice information fed back by the user, and matching the second voice information with the target word; responsive to determining that the second speech information matches the target word, speech outputting third speech information associated with the target word.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A voice interaction method, comprising:

extracting first voice information containing a target word voice segment;

superposing a prompt tone at the target word sound fragment, and outputting first voice information obtained after superposing the prompt tone in a voice mode, wherein the prompt tone is used for prompting that the currently broadcasted content is a target word, and the acoustic characteristic of the prompt tone is different from that of the target word voice;

responding to second voice information fed back by a user, and matching the second voice information with the target word;

responsive to determining that the second speech information matches the target word, speech outputting third speech information associated with the target word.

2. The voice interaction method according to claim 1, wherein the superimposing an alert tone at the target word sound segment, and the voice outputting the first voice information after superimposing the alert tone includes:

and superposing an impulse type prompt tone at the start of the target word voice segment, and outputting the first voice information superposed with the prompt tone in a voice mode, wherein the prompt tone is ended before the end of the target word voice segment.

3. The voice interaction method according to claim 1, wherein the superimposing an alert tone at the target word sound segment, and the voice outputting the first voice information after superimposing the alert tone includes:

and superposing a continuous prompt tone at the beginning of the target word voice segment, and outputting the first voice information superposed with the prompt tone in a voice mode, wherein the prompt tone is ended when the target word voice segment is ended.

4. The voice interaction method of claim 1, wherein said responsive to determining that the second voice information matches the target word, voice outputting third voice information associated with the target word comprises:

in response to determining that the second speech information matches the target word, determining a type of the first speech information, determining third speech information associated with the target word based on the type of the first speech information, the third speech information being speech output.

5. The voice interaction method of claim 4, wherein the determining third voice information associated with the target word based on the type of the first voice information, the voice outputting the third voice information comprises:

generating an information search request containing the target word in response to determining that the type of the first voice information is a news broadcast type;

sending the information search request to a server, and receiving a search result returned by the server;

and taking the voice information corresponding to the search result as third voice information, and outputting the third voice information in a voice mode.

6. The voice interaction method of claim 4, wherein the determining third voice information associated with the target word based on the type of the first voice information, the voice outputting the third voice information comprises:

responding to the type of the first voice information determined to be a service query type, and generating a service query request containing the target word;

sending the service query request to a server, and receiving a query result returned by the server;

and taking the voice information corresponding to the query result as third voice information, and outputting the third voice information in a voice mode.

7. The voice interaction method of claim 4, wherein the determining third voice information associated with the target word based on the type of the first voice information, the voice outputting the third voice information comprises:

and in response to the fact that the type of the first voice message is determined to be the message confirmation type, generating a jump instruction for instructing to jump to a preset next voice message, and determining the next voice message as a third voice message.

8. The voice interaction method of one of claims 1 to 7, wherein a volume of the alert tone is less than a volume of the target word voice segment.

9. A voice interaction device, comprising:

an extraction unit configured to extract first voice information including a target word voice segment;

a first output unit configured to superimpose a prompt tone at the target word tone segment, and output a first voice message after superimposing the prompt tone in a voice mode, wherein the prompt tone is used for prompting that the currently broadcasted content is a target word, and the acoustic characteristic of the prompt tone is different from the acoustic characteristic of the target word voice;

the matching unit is configured to respond to the collection of second voice information fed back by a user and match the second voice information with the target word;

a second output unit configured to, in response to determining that the second speech information matches the target word, speech output third speech information associated with the target word.

10. The voice interaction apparatus of claim 9, wherein the first output unit is further configured to:

11. The voice interaction apparatus of claim 9, wherein the first output unit is further configured to:

12. The voice interaction device of claim 9, wherein the matching unit is further configured to:

13. The voice interaction apparatus of claim 12, wherein the matching unit comprises:

a first generation module configured to generate an information search request including the target word in response to determining that the type of the first voice information is a news broadcast type;

the first sending module is configured to send the information searching request to a server and receive a searching result returned by the server;

and the first output module is configured to take the voice information corresponding to the search result as third voice information, and the third voice information is output in a voice mode.

14. The voice interaction apparatus of claim 12, wherein the matching unit comprises:

a second generation module configured to generate a service query request including the target word in response to determining that the type of the first voice information is a service query class;

the second sending module is configured to send the service query request to a server and receive a query result returned by the server;

and the second output module is configured to take the voice information corresponding to the query result as third voice information, and output the third voice information in a voice mode.

15. The voice interaction apparatus of claim 12, wherein the matching unit comprises:

a third generating module configured to generate a skip instruction for instructing a skip to a preset next voice message in response to determining that the type of the first voice message is the information confirmation type, and determine the next voice message as a third voice message.

16. The voice interaction device of one of claims 9-15, wherein a volume of the alert tone is less than a volume of the target word speech segment.

17. A terminal device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.

18. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-8.