CN111429899A

CN111429899A - Speech response processing method, device, equipment and medium based on artificial intelligence

Info

Publication number: CN111429899A
Application number: CN202010122179.3A
Authority: CN
Inventors: 吕林澧; 叶松; 孙建波
Original assignee: OneConnect Financial Technology Co Ltd Shanghai
Current assignee: OneConnect Smart Technology Co Ltd; OneConnect Financial Technology Co Ltd Shanghai
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2020-07-17
Also published as: WO2021169615A1

Abstract

The invention discloses a voice response processing method, a voice response processing device, voice response processing equipment and voice response processing media based on artificial intelligence. The method comprises the following steps: acquiring a voice stream to be processed, which is acquired by a voice recording module in real time; carrying out statement integrity analysis on the voice stream to be processed to obtain the voice stream to be analyzed; executing a first processing process and a second processing process in parallel, controlling a voice playing module to play a target language word record based on the first processing process, and identifying a voice stream to be analyzed based on the second processing process to obtain a target response voice; and monitoring the playing state of the target language word record played by the voice playing module in real time, and controlling the voice playing module to play the target response voice if the playing state is the end of playing. The method can enable the intelligent interaction equipment to respond in real time in the human-computer interaction process, and improve the response time and response effect of voice interaction.

Description

Speech response processing method, device, equipment and medium based on artificial intelligence

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a speech response processing method, apparatus, device, and medium based on artificial intelligence.

Background

For example, the intelligent interactive device with a voice interaction function can collect and recognize real-time voice of a user, and respond based on a recognition result of the real-time voice to achieve the purpose of man-machine interaction.

Disclosure of Invention

The embodiment of the invention provides a voice response processing method, a voice response processing device, voice response processing equipment and a voice response processing medium based on artificial intelligence, and aims to solve the problem that the pause response time of voice interaction between current intelligent interaction equipment and a user is too long.

An artificial intelligence based voice response processing method, comprising:

acquiring a voice stream to be processed, which is acquired by a voice recording module in real time;

carrying out statement integrity analysis on the voice stream to be processed to obtain the voice stream to be analyzed;

executing a first processing process and a second processing process in parallel, calling the first processing process to control a voice playing module to play a target language word record, calling the second processing process to identify the voice stream to be analyzed, and acquiring a target response voice;

and monitoring the playing state of the voice playing module for playing the target language word record in real time, and controlling the voice playing module to play the target response voice if the playing state is the end of playing.

An artificial intelligence based voice response processing apparatus comprising:

the voice stream to be processed acquisition module is used for acquiring the voice stream to be processed which is acquired by the voice recording module in real time;

the to-be-analyzed voice stream acquisition module is used for carrying out statement integrity analysis on the to-be-processed voice stream to obtain the to-be-analyzed voice stream;

the playing analysis parallel processing module is used for executing a first processing process and a second processing process in parallel, calling the first processing process to control the voice playing module to play a target voice word record, calling the second processing process to identify the voice stream to be analyzed, and acquiring a target response voice;

and the response voice real-time playing module is used for monitoring the playing state of the target language word record played by the voice playing module in real time, and controlling the voice playing module to play the target response voice if the playing state is the end of playing.

An intelligent interaction device comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the artificial intelligence based voice response processing method when executing the computer program.

A computer-readable storage medium, which stores a computer program that, when executed by a processor, implements the artificial intelligence based voice response processing method described above.

According to the voice response processing method, the voice response processing device, the voice response processing equipment and the voice response processing medium based on artificial intelligence, sentence integrity analysis is firstly carried out on the voice stream to be processed acquired in real time in the voice interaction process so as to determine the voice stream to be analyzed, and the accuracy and timeliness of subsequent recognition analysis are improved. The target tone word record is played while the voice stream to be analyzed is recognized, and the target response voice is played after the target tone word record is played, so that the recognition process of the voice stream to be analyzed and the playing process of the target tone word record are simultaneously carried out, the target tone word record is played in the pause response time of the voice stream to be analyzed, the target tone word record and the playing of the target response voice are naturally linked, and the response time and the response effect of voice interaction are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a diagram of an application environment of an artificial intelligence based voice response processing method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for artificial intelligence based speech response processing according to an embodiment of the present invention;

FIG. 3 is another flow chart of a method for artificial intelligence based speech response processing in accordance with an embodiment of the present invention;

FIG. 4 is another flow chart of a method for artificial intelligence based speech response processing in accordance with an embodiment of the present invention;

FIG. 5 is another flow chart of a method for artificial intelligence based speech response processing in accordance with an embodiment of the present invention;

FIG. 6 is another flow chart of a method for artificial intelligence based speech response processing in accordance with an embodiment of the present invention;

FIG. 7 is another flow chart of a method for artificial intelligence based speech response processing in accordance with an embodiment of the present invention;

FIG. 8 is another flow chart of a method for artificial intelligence based speech response processing in accordance with an embodiment of the present invention;

FIG. 9 is a schematic diagram of an artificial intelligence based speech response processing apparatus according to an embodiment of the present invention;

FIG. 10 is a diagram of an intelligent interaction device in an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The artificial intelligence based voice response processing method provided by the embodiment of the invention can be applied to independently arranged intelligent interaction equipment and can also be applied to an application environment as shown in fig. 1.

As an example, when the voice response processing method based on artificial intelligence is applied to an independently-arranged intelligent interaction device, the intelligent interaction device is provided with a processor, and a voice recording module and a voice playing module which are connected with the processor, and the voice response processing method based on artificial intelligence can be executed on the processor, so that in the process of voice interaction between a user and the intelligent interaction device, the intelligent interaction device has short response time of each pause, the user does not feel that delay exists in the voice interaction process, and experience is better.

As another example, the artificial intelligence based voice response processing method is applied to an artificial intelligence based voice response processing system, where the artificial intelligence based voice response processing system includes an intelligent interaction device and a server as shown in fig. 1, the intelligent interaction device communicates with the server through a network, the intelligent interaction device is provided with a voice recording module and a voice playing module, and the artificial intelligence based voice response processing method can be executed on the server, so that in a process of voice interaction between a user and the intelligent interaction device, the intelligent interaction device has a short response time each time pause, so that the user does not feel delay in the voice interaction process, and experience is better. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers. The intelligent interaction device can be a robot capable of realizing human-computer interaction.

In one embodiment, as shown in fig. 2, a method for processing a voice response based on artificial intelligence is provided, which is described by taking the method as an example applied to a processor of a standalone intelligent interactive device or a server connected to the intelligent interactive device, and includes the following steps:

s201: and acquiring the voice stream to be processed acquired by the voice recording module in real time.

The voice recording module is a module capable of realizing a recording function. As an example, the voice recording module may be a recording chip integrated on the smart interactive device or the client for implementing a recording function.

The voice stream to be processed is the voice stream which is acquired by the voice recording module in real time and needs to be subjected to subsequent identification processing. As an example, the processor of the intelligent interactive device or the server connected to the intelligent interactive device may acquire a to-be-processed voice stream formed by the voice recording module in a real-time user speaking process, where the to-be-processed voice stream is a voice stream that the user wants to interact with the intelligent interactive device and is used for reflecting the intention of the user.

S202: and carrying out statement integrity analysis on the voice stream to be processed to obtain the voice stream to be analyzed.

Wherein, the voice stream to be analyzed is the voice stream determined from the voice stream to be processed, which can reflect that the user has finished speaking a session. The sentence integrity analysis of the voice stream to be processed means that the voice stream to be processed is divided based on a complete sentence, so that each voice stream to be analyzed can completely and accurately reflect the intention of the user.

As an example, the processor of the intelligent interactive device or the server connected to the intelligent interactive device may intercept, from the voice stream to be processed recorded in real time by the voice recording module, a voice stream that may reflect that the user has spoken a certain section of speech as the voice stream to be analyzed, so as to perform recognition analysis on the voice stream to be analyzed subsequently, to determine the user intention reflected in the voice stream to be analyzed, and to respond based on the user intention, thereby achieving the purpose of human-computer interaction. It can be understood that, the voice stream to be analyzed, which can reflect that the user has spoken a certain section of speech, is intercepted from the voice stream to be processed, so that the accuracy and timeliness of subsequent recognition and analysis can be guaranteed, and the problem that the accuracy and timeliness of voice recognition and analysis are low due to the fact that the certain section of speech spoken by the user is divided into several sections and processed respectively is avoided.

S203: and executing a first processing process and a second processing process in parallel, calling the first processing process to control the voice playing module to play the target language word record, calling the second processing process to identify the voice stream to be analyzed, and acquiring the target response voice.

The target recording of the Chinese character and the key word refers to the recording of the Chinese character and the key word which needs to be played at this time, and the recording of the Chinese character and the key word is the recording which is recorded in advance and is related to the Chinese character and the key word, for example, the recording which is recorded in advance and corresponds to the Chinese character and the key word, such as kayian.

And the target response voice is voice responding according to the user intention determined by the recognition analysis of the voice stream to be analyzed. For example, if the user intention corresponding to the speaking content corresponding to the voice stream to be analyzed is "i want to know the profitability of the product a", the target response voice is "the profitability of the product a is … …", so that the user intention in the voice stream to be analyzed can be intelligently responded instead of manual response, and labor cost is saved.

The first processing process is a process which is created on a processor of the intelligent interaction device or a processor of the server and used for controlling the voice playing module to work. The second processing process is a process created on the processor of the intelligent interactive device or the processor of the server for performing recognition processing on the voice stream to be recognized.

As an example, after obtaining the voice stream to be analyzed, the processor of the intelligent interactive device or the server connected to the intelligent interactive device creates or calls a first processing process and a second processing process which are created in advance, so that the first processing process and the second processing process are executed in parallel, the first processing process controls the voice playing module to play the target voice word record, and the second processing process identifies the voice stream to be analyzed to obtain the target response voice, so that the playing of the target voice word record and the identification process of the voice stream to be analyzed are processed in parallel, so as to play the target voice word record within the pause response time of the recognition analysis processing of the voice stream to be analyzed, so that the intelligent interactive device responds in time, the user experience is prevented from being poor due to the overlong pause response time, and the target voice word record is played, so that the human-computer interactive process is more spoken, the user experience is improved. The pause response time is understood as the processing time for performing recognition analysis on the voice stream to be analyzed to determine and play the target response voice. For example, the pause response time for performing recognition analysis on a certain section of voice stream to be analyzed is 3s, and if the processor of the intelligent interaction device or the server connected to the intelligent interaction device plays the target language word record for 2s in 1s after the voice stream to be analyzed is obtained, the pause response time of the intelligent interaction device is shortened to be within 1s, so that the user does not feel response delay, and the improvement of user experience is facilitated.

S204: and monitoring the playing state of the target language word record played by the voice playing module in real time, and controlling the voice playing module to play the target response voice if the playing state is the end of playing.

As an example, the target tone word recording may be understood as a recording played during the process of performing recognition analysis on the voice stream to be analyzed, and generally, the playing duration of the target tone word recording is within the pause response time corresponding to the voice stream to be analyzed, so that the intelligent interaction device may play the target response voice in real time after the voice playing module is controlled to play the target tone word recording, so as to implement timely response to the user intentions respectively determined by the voice stream to be analyzed.

It can be understood that, after the processor of the intelligent interaction device or the server connected to the intelligent interaction device obtains the voice stream to be analyzed and controls the voice playing module to play the target voice word record based on the first processing procedure, the state monitoring tool is invoked to monitor the playing state of the voice playing module for playing the target voice word record in real time, where the playing state includes the end of playing and the end of playing. When the playing state of the voice playing module for playing the target tone word record is the playing end, the first processing process can be called to control the voice playing module to play the target response voice corresponding to the voice stream to be analyzed, so that the target response voice can be naturally linked and played after the target tone word record is played, and the influence on user experience caused by overlong pause response time is avoided. The state monitoring tool is a preset tool for monitoring the playing state of the voice playing module.

In the artificial intelligence based voice response processing method provided by this embodiment, a to-be-processed voice stream acquired in real time in a voice interaction process is subjected to statement integrity analysis to determine the to-be-analyzed voice stream, which is helpful for improving the accuracy and timeliness of subsequent recognition analysis. By executing the first processing process and the second processing process in parallel, the recognition process of the voice stream to be analyzed and the playing process of the target tone word recording can be simultaneously carried out, and the target tone word recording is played within the pause response time of the analysis processing of the voice stream to be analyzed, so that the response time and the response effect of voice interaction are improved. After the playing state of the target tone word recording is monitored to be the playing end in real time, the voice playing module is controlled to play the target response voice, so that the playing of the target tone word recording and the target response voice is naturally linked, and the response effect of voice interaction is improved

In an embodiment, as shown in fig. 3, step S202, namely, performing statement integrity analysis on the voice stream to be processed to obtain the voice stream to be analyzed, specifically includes the following steps:

s301: and monitoring the voice stream to be processed by adopting a voice activation detection algorithm to obtain a voice pause point and corresponding pause duration.

Voice Activity Detection (VAD) algorithm, which aims to detect whether the current Voice signal contains Voice signal, i.e. to judge the input signal and distinguish the Voice signal from various background noise signals.

The voice pause point is the position of voice pause in the voice stream to be processed identified by adopting the VAD algorithm, namely the position of the voice stream to be processed when the user stops speaking is identified by adopting the VAD algorithm. The pause duration corresponding to the voice pause point refers to the time difference between the start time and the end time of the voice pause recognized by adopting the VAD algorithm.

As an example, the intelligent interactive device may perform silence monitoring on the voice stream to be processed by using a voice activation detection algorithm to determine a corresponding voice pause point in the voice stream to be processed and a pause duration corresponding to any voice pause point when the user stops speaking, so as to analyze whether the user has finished speaking a sentence based on the pause duration corresponding to the voice pause point, thereby performing sentence integrity analysis.

S302: and determining the voice pause point with the pause duration larger than a preset duration threshold as a target pause point.

The preset time length threshold is a preset time length threshold used for evaluating pause after a user finishes a sentence. The target pause point is the pause position at which the user finishes a sentence analytically determined from the speech stream to be processed.

As an example, the intelligent interaction device compares a pause duration corresponding to any voice pause point with a preset duration threshold; if the pause duration is greater than the preset duration threshold, determining that the user has already spoken a sentence, and determining the voice pause point corresponding to the pause duration as a target pause point; if the pause duration is not greater than the preset duration threshold, the user is determined not to finish a sentence, the voice pause point at the moment is a short pause in the speaking process of the user, and therefore the voice pause point corresponding to the pause duration is not determined as the target pause point.

S303: and obtaining the voice stream to be analyzed based on the two adjacent target stop points.

Specifically, after at least two target stop points are determined from the voice stream to be processed collected in real time, the intelligent interaction equipment determines the voice stream between the two adjacent target stop points as the voice stream to be analyzed, so that the voice stream to be analyzed can reflect a complete sentence which a user wants to express, the accuracy and the timeliness of subsequent identification and analysis are improved, namely, when the subsequent voice stream to be analyzed is identified and analyzed, signals between the target stop points do not need to be identified and analyzed, and the timeliness of the subsequent voice stream to be analyzed is guaranteed; since each speech stream to be analyzed reflects the complete sentence that the user wants to express, the accuracy of subsequent recognition and response is higher.

As an example, the intelligent interactive device determines a position of a start time at which recording of a voice stream to be processed is started as an initial target stop point; then, determining a next target pause point after the initial target pause point as an end target pause point, and determining a voice stream to be analyzed based on the initial target pause point and the end target pause point; and finally, updating the ending target stop point to a new initial target stop point, repeatedly executing the step of determining the next target stop point after the initial target stop point as the ending target stop point, and determining a voice stream to be analyzed based on the starting target stop point and the ending target stop point so as to realize the real-time division of a plurality of voice streams to be analyzed from the voice stream to be processed, thereby ensuring the real-time property of the determination of the voice stream to be analyzed and being beneficial to improving the accuracy and the timeliness of the subsequent identification and analysis of the voice stream to be analyzed.

In the artificial intelligence based voice response processing method provided by this embodiment, a VAD algorithm is first used to monitor a voice pause point and a corresponding pause duration in a voice stream to be processed, which are collected in real time, so as to ensure objectivity in the processing process. And determining the voice pause point with the pause duration larger than the preset duration threshold as the target pause point, so as to avoid inaccurate subsequent recognition and analysis processes caused by voice division of the voice pause point with the subsequent pause duration not larger than the preset duration threshold. And then, the voice stream to be analyzed is determined based on the two adjacent target stop points, so that the voice stream to be analyzed can reflect the complete sentence which is desired to be expressed by the user, and the accuracy and timeliness of subsequent recognition and analysis are improved.

In an embodiment, as shown in fig. 4, the step S203 of invoking the first processing process to control the voice playing module to play the target audio word record specifically includes the following steps executed by invoking the first processing process:

s401: and acquiring the voice time corresponding to the voice stream to be analyzed.

Specifically, the intelligent interactive device may invoke a first processing procedure, determine two adjacent target stop points corresponding to the voice stream to be analyzed, and obtain the voice duration corresponding to the voice stream to be analyzed according to the two target stop points. As an example, the intelligent interactive device determines, based on two adjacent target stop points, a voice stream to be analyzed, specifically, a voice stream from an end time of a previous target stop point to a start time of a next target stop point in the voice stream to be processed is determined as the voice stream to be analyzed, and at this time, a time difference from the end time of the previous target stop point to the start time of the next target stop point may be determined as a voice duration corresponding to the voice stream to be analyzed. It can be understood that the voice duration corresponding to the voice stream to be analyzed can be determined based on the start time and the end time of two adjacent target stop points, so that the determination process of the voice duration is simple and convenient, and the efficiency of subsequent processing is improved.

S402: and based on the voice duration query system database, determining a target language word record based on the original language word record matched with the voice duration, and controlling the voice playing module to play the target language word record.

The system database is a database which is arranged on the intelligent interaction equipment or connected with the intelligent interaction equipment and is used for storing relevant data involved in the voice interaction process. The original voice word recording is a pre-recorded recording related to the voice words and is played when the intelligent interaction device and the user perform man-machine interaction. The target tone word recording is one of the original tone word recordings, and is specifically one of the original tone word recordings matched with the voice duration corresponding to the voice stream to be analyzed.

As an example, the system database may pre-record original voice word records corresponding to different playing durations, and after acquiring a voice duration corresponding to a voice stream to be analyzed, pre-estimate a pre-estimate processing duration required for performing recognition analysis on the voice stream to be analyzed based on the voice duration corresponding to the voice stream to be analyzed; and then, selecting the original tone word record with the playing time length matched with the estimated processing time length from the system database as a target tone word record, and controlling the voice playing module to play the target tone word record. For example, a duration comparison table is pre-stored in the system database and used for the corresponding relationship between the voice duration of the voice stream to be analyzed and the estimated processing duration thereof, so that the estimated processing duration can be quickly determined through table look-up operation in the following. The matching of the playing duration and the estimated processing duration can be understood as that the time difference between the playing duration and the estimated processing duration is minimum or is within a preset error range, so that the target language word recording can be linked and played more naturally after the target language word recording is played in real time within the pause response time of the recognition and analysis process of the voice stream to be analyzed, and the efficiency of response processing is improved.

Further, when the number of the original tone word recordings with the playing duration matched with the pre-estimated processing duration selected from the system database is at least two, namely when the time difference between the playing duration corresponding to the at least two original voice word recordings and the pre-estimated processing duration is within a preset error range, the existence of the at least two original tone word recordings is determined, at this time, one of the at least two original tone word recordings is randomly selected as a target tone word recording, or one of the at least two original tone word recordings which is different from the target tone word recording selected last time is selected as the target tone word recording.

In the artificial intelligence based voice response processing method provided by this embodiment, based on two adjacent target stop points corresponding to the voice stream to be analyzed, the voice duration corresponding to the voice stream to be analyzed can be quickly determined, so that the acquisition process is simple and convenient, and the efficiency is high; and then, the target tone word recording is determined based on the original tone word recording matched with the voice duration, so that the target response voice is linked and played more naturally after the target tone word recording is played, and the efficiency of response processing is improved.

In an embodiment, as shown in fig. 5, invoking the second processing process in step S203 to recognize the to-be-analyzed voice stream, and acquiring the target response voice specifically includes the following steps executed by invoking the second processing process:

s501: and performing voice recognition on the voice stream to be analyzed to obtain a corresponding text to be analyzed.

The text to be analyzed refers to the text content determined after voice recognition is performed on the voice stream to be analyzed. In this embodiment, the process of performing voice recognition on the voice stream to be analyzed to obtain the text to be analyzed corresponding to the voice stream to be analyzed may be understood as a process of converting a voice signal of the voice stream to be analyzed into text information that can be subsequently recognized.

As an example, the intelligent interaction device may use an ASR (Automatic Speech Recognition) technology or a pre-trained static decoding network capable of implementing Speech-to-text conversion to perform Speech Recognition on a Speech stream to be analyzed, so as to quickly obtain a text to be analyzed corresponding to the Speech stream to be analyzed, so as to perform semantic analysis subsequently.

S502: and performing semantic analysis on the text to be analyzed to obtain a target intention corresponding to the text to be analyzed.

The target intention is the user intention determined after semantic analysis is carried out on the text to be analyzed. In this embodiment, the process of performing semantic analysis on the text to be analyzed to obtain the target intention corresponding to the text to be analyzed may be understood as a process of analyzing the user intention from the text information of the text to be analyzed by using an artificial intelligence technology, which is equivalent to a process of separating the user intention from the user utterance by the human brain.

As an example, the intelligent interactive device may perform semantic analysis on the text to be analyzed by using N L P (Natural L language Processing) technology or a semantic analysis model constructed in advance based on a neural network, so as to accurately and quickly determine the target intention from the text to be analyzed.

S503: and querying a system database based on the target intention to obtain a target response dialog corresponding to the target intention.

The target response dialect is a dialect which is responded by the intelligent interaction device based on the analyzed target intention, exists in a text form and is a response of the target intention identified by the text to be analyzed. For example, if the target intention identified by the text to be analyzed is "rate of return of a product a", the corresponding target response is expressed as "rate of return of a product a is … …", or if the target intention identified by the text to be analyzed is "what amount of money i will be paid this month", the corresponding target response is expressed as "amount of money you will be paid this month is … …", and so on.

As an example, after determining a target intention corresponding to a text to be analyzed, the intelligent interaction device queries a system database based on the target intention, directly obtains a target response dialog corresponding to the target intention from the system database, or obtains response content corresponding to the target intention from the system database, and forms the target response dialog based on the response content.

S504: and acquiring target response voice based on the target response speech technology.

Wherein the target response speech is speech corresponding to the target response utterance. It can be understood that the target response voice may be understood as voice which needs to be played in real time after the pause response time corresponding to the voice stream to be analyzed when the intelligent interaction device performs human-computer interaction with the user, specifically, voice which is responded to the target intention recognized in the voice stream to be analyzed.

As an example, the process of determining the target response voice based on the target response technology may be implemented by querying a system database to determine a pre-recorded target response voice corresponding to the target response technology, so as to make the target response voice acquisition efficiency faster; the target response speech can also be subjected to text-to-speech conversion processing by adopting a text-to-speech conversion technology so as to obtain corresponding target response speech and ensure the real-time performance of the target response speech. The text-to-speech technology here is a technology for realizing conversion of text contents into speech contents, such as TTS speech synthesis technology.

In the artificial intelligence-based voice response processing method provided by this embodiment, the target intention of the voice stream to be analyzed can be quickly determined by performing voice recognition and semantic analysis on the voice stream; and then, a target response speech and a corresponding target response speech are determined based on the target intention, so that the voice stream to be analyzed, which is acquired and intercepted in real time based on the voice recording module, is recognized, analyzed and responded, and intelligent interaction is realized, so that the intelligent interaction equipment can be widely applied to scenes needing to respond to manual questions, such as intelligent interaction equipment which is arranged in public places and is used for facilitating user consultation, and the labor cost is saved.

In an embodiment, as shown in fig. 6, step S503, that is, querying the system database based on the target intention to obtain the target response dialog corresponding to the target intention, specifically includes the following steps:

s601: based on the target intent, an intent type is determined.

Wherein the intention type is the type to which the target intention belongs according to the target intention. As an example, intent types can be divided into general intent and specific intent. The general purpose means a purpose for inquiring general information, that is, a purpose for inquiring general information irrelevant to specific user information, for example, a purpose for inquiring a profitability of an a product. The specific intention means an intention for inquiring specific information, that is, an intention for inquiring specific information related to specific user information, for example, an intention for inquiring specific information such as a loan amount and a payment due term of the user 1.

S602: and if the intention type is the general intention, inquiring a general dialect database based on the target intention, and acquiring a target response dialect corresponding to the target intention.

The general dialect database is a database special for storing general response dialect and is a sub-database in the system database. The general response dialog is a preset dialog for responding to the general question and replying.

As an example, when the target intention identified by the text to be analyzed is a general intention, it is stated that the user wants to query general information unrelated to the specific user information, and the general information may store corresponding general response dialogs in a general dialogs database, so the intelligent interactive terminal may query the general dialogs database based on the target intention to use the general response dialogs corresponding to the target intention as the target response dialogs, so that the target response dialogs are obtained efficiently.

S603: if the intention type is the special intention, inquiring the special information database based on the target intention to obtain an intention inquiry result, and obtaining a target response dialect corresponding to the target intention based on a dialect template corresponding to the special intention and the intention inquiry result.

The special information database is a database which is specially used for storing the special information of the user and is a sub-database in the system database. The user-specific information is used to store information related to the user, such as the user's account balance or loan amount. The language template corresponding to the special purpose is a preset language template corresponding to the special purpose and used for responding to the return stroke aiming at the special purpose. For example, for "i want to know my monthly payment information", the corresponding dialect template is "your monthly payment amount is … …, payment date is … …", and so on.

As an example, when the target intention identified by the text to be analyzed is a special intention, it is stated that the user wants to query special information related to specific user information, and the special information is generally stored in a special information database, so the intelligent interaction device can query the special information database based on the target intention to quickly obtain an intention query result corresponding to the special intention, and then fill the intention query result in a dialect template corresponding to the special intention to obtain a target response dialect corresponding to the target intention, so as to guarantee the real-time performance of target response dialect obtaining.

In the artificial intelligence based voice response processing method provided by this embodiment, for the intention type corresponding to the target intention identified by the text to be analyzed, the corresponding target response dialogues are determined by respectively adopting the processing modes corresponding to the general intention and the special intention, so as to ensure the acquisition efficiency and the real-time performance of the target response dialogues.

In an embodiment, as shown in fig. 7, step S504, namely obtaining the target response voice based on the target response technology, specifically includes the following steps:

s701: and if the intention type is the general intention, inquiring a system database based on the target response dialogues, and determining the general response voice record corresponding to the target response dialogues as the target response voice.

As an example, when the target intention identified by the text to be identified is a general intention, a general response utterance corresponding to the target intention in the general utterance database may be determined as a target response utterance, so that the target response utterance is obtained more efficiently; in order to further improve the acquisition efficiency of the response voice, a general response sound record corresponding to the general response technology can be pre-recorded and stored in a system database, and when the general response technology is determined as the target response technology, the pre-recorded general response sound record of the general response technology can be directly determined as the target response voice so as to improve the acquisition efficiency of the target response voice.

S702: and if the intention type is the special intention, performing voice synthesis on the target response dialogue to obtain target response voice.

As an example, when the target intention identified by the text to be recognized is a special intention, the target response utterance determined based on the target intention is the text content formed by filling the intention query result corresponding to the target intention in the utterance template, and at this time, the target response speech corresponding to the target response utterance does not exist in the system database, so that the text-to-speech conversion technology is adopted to perform text-to-speech conversion on the target response utterance to acquire the corresponding target response speech, so as to ensure the real-time performance of the target response speech. The text-to-speech technology here is a technology for realizing conversion of text contents into speech contents, such as TTS speech synthesis technology.

In the artificial intelligence based voice response processing method provided by this embodiment, for an intention type corresponding to a target intention recognized by a text to be analyzed, the intention type is a general type, and a general response recording can be directly used as a target response voice to improve the acquisition efficiency of the target response voice; and when the intention type is the special intention, performing text-to-speech conversion on the target response speech so as to obtain the target response speech, so as to improve the real-time performance of the target response speech.

In an embodiment, as shown in fig. 8, step S204 is to monitor a playing status of the voice playing module playing the target voice word record in real time, and if the playing status is the end of playing, control the voice playing module to play the target response voice, which specifically includes the following steps:

s801: and the real-time monitoring voice playing module plays the playing state of the target language word record, and if the playing state is the end of playing, whether the target response voice can be acquired within a preset time period is judged.

Because the target tone word recording is used for playing the voice of the user through the voice playing module in the pause response time from the acquisition of the voice stream to be analyzed to the playing of the target response voice, the transition connection is natural, and the problem that the user experience is poor due to the overlong pause response time is avoided.

Therefore, the intelligent interaction equipment can monitor the playing state of the voice playing module for playing the target language word record in real time by using a preset state monitoring tool; if the playing state is the end of playing, whether the target response voice can be acquired in a preset time period needs to be judged, and subsequent processing is carried out according to the judgment result. The preset time period is a preset time period; if the playing state is that the playing is not finished, continuing waiting until the playing state is monitored to be that the playing is finished, and judging whether the target response voice can be obtained within the preset time period is executed.

S802: and if the target response voice can be acquired within the preset time period, playing the target response voice in real time.

As an example, if the intelligent interaction device can obtain the target response voice within a preset time period, after the target response voice is obtained, the target response voice is played in real time, so as to realize real-time switching from playing the target language word to playing the target response voice, and the intelligent interaction device can be used to perform voice response in time, so as to avoid that the user experience is influenced by too long pause response time.

S803: and if the target response voice cannot be acquired within the preset time period, executing an emergency treatment mechanism.

The emergency processing mechanism is a processing mechanism which is preset and used when the target response voice cannot be acquired within a preset time. As an example, if the intelligent interaction device cannot acquire the target response voice within a preset time period, at this time, the playing times of the tone words may be acquired; if the playing times of the tone words are smaller than the preset time threshold, randomly playing the next tone word record so that the intelligent interaction equipment can respond in time, and the target response voice is responded before playing, so that the user cannot wait for response for a long time and does not have response; and if the playing times of the tone words are not less than the preset time threshold, randomly playing fault prompt voice so that the user can know whether the intelligent interaction equipment is in fault in time and avoid continuously waiting for response. The language word playing times refer to the times of recording the language words which have been played currently. The failure prompt voice is a preset voice for prompting that the device has a failure, and may specifically correspond to a failure cause for which the target response voice cannot be acquired.

In the artificial intelligence based voice response processing method provided by this embodiment, the target response voice or the voice corresponding to the emergency processing mechanism is played respectively according to the determination result of whether the target response voice can be obtained within the preset time period after the target language word recording is played, so as to implement timely response to the voice stream to be analyzed formed by the user speaking, and improve the response efficiency.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In an embodiment, an artificial intelligence based voice response processing apparatus is provided, and the artificial intelligence based voice response processing apparatus is in one-to-one correspondence with the artificial intelligence based voice response processing method in the above embodiment. As shown in fig. 9, the artificial intelligence based voice response processing apparatus includes a to-be-processed voice stream obtaining module 901, a to-be-analyzed voice stream obtaining module 902, a playing and analyzing parallel processing module 903, and a response voice real-time playing module 904. The functional modules are explained in detail as follows:

a to-be-processed voice stream obtaining module 901, configured to obtain a to-be-processed voice stream collected by the voice recording module in real time.

The to-be-analyzed voice stream obtaining module 902 is configured to perform statement integrity analysis on the to-be-processed voice stream to obtain the to-be-analyzed voice stream.

The play analysis parallel processing module 903 is configured to execute a first processing process and a second processing process in parallel, call the first processing process to control the voice play module to play the target voice word record, and call the second processing process to identify the voice stream to be analyzed, so as to obtain the target response voice.

And the response voice real-time playing module 904 is configured to monitor a playing state of the voice playing module playing the target voice word record in real time, and if the playing state is the end of playing, control the voice playing module to play the target response voice.

Preferably, the to-be-analyzed voice stream obtaining module 902 includes a pause duration obtaining unit, a target pause point determining unit, and a to-be-analyzed voice stream obtaining unit.

And the pause duration acquisition unit is used for monitoring the voice stream to be processed by adopting a voice activation detection algorithm and acquiring a voice pause point and corresponding pause duration.

And the target stop point determining unit is used for determining the voice stop point with the stop duration larger than a preset duration threshold as the target stop point.

And the voice stream to be analyzed acquiring unit is used for acquiring the voice stream to be analyzed based on the two adjacent target stop points.

Preferably, the playing analysis parallel processing module 903 includes a voice duration obtaining unit and a tone word playing control unit.

And the voice time obtaining unit is used for obtaining the voice time corresponding to the voice stream to be analyzed.

And the language word playing control unit is used for inquiring the system database based on the voice time length, determining a target language word record based on the original language word record matched with the voice time length, and controlling the voice playing module to play the target language word record.

Preferably, the playback analysis parallel processing module 903 includes a text to be analyzed acquisition unit, a target intention acquisition unit, a target response grammar acquisition unit, and a target response voice acquisition unit.

And the text to be analyzed acquiring unit is used for performing voice recognition on the voice stream to be analyzed to acquire a text to be analyzed corresponding to the voice stream to be analyzed.

And the target intention acquisition unit is used for performing semantic analysis on the text to be analyzed and acquiring the target intention corresponding to the text to be analyzed.

And the target response dialect obtaining unit is used for inquiring the system database based on the target intention and obtaining the target response dialect corresponding to the target intention.

And the target response voice acquisition unit is used for acquiring the target response voice based on the target response technology.

Preferably, the target response utterance acquisition unit includes an intention type determination subunit, a general utterance determination subunit, and a special utterance determination subunit.

And the intention type determining subunit is used for determining the intention type based on the target intention.

And the universal dialect determining subunit is used for querying the universal dialect database based on the target intention and acquiring a target response dialect corresponding to the target intention if the intention type is the universal intention.

And the special-purpose dialect determining subunit is used for querying the special information database based on the target intention to obtain an intention query result if the intention type is the special intention, and obtaining a target response dialect corresponding to the target intention based on the dialect template corresponding to the special intention and the intention query result.

Preferably, the target response voice acquiring unit comprises a general voice determination subunit and a special voice determination subunit.

And the universal voice determining subunit is used for inquiring the system database based on the target response dialogues and determining the universal response voice record corresponding to the target response dialogues as the target response voice if the intention type is the universal intention.

And the special voice determining subunit is used for carrying out voice synthesis on the target response speech if the intention type is the special intention so as to acquire the target response voice.

Preferably, the response voice real-time playing module 904 includes a response voice receiving judgment unit, a first response processing unit and a second response processing unit.

And the response voice receiving and judging unit is used for monitoring the playing state of the target language word record played by the voice playing module in real time, and judging whether the target response voice can be acquired within a preset time period if the playing state is the end of playing.

And the first response processing unit is used for playing the target response voice in real time if the target response voice can be acquired within a preset time period.

And the second response processing unit is used for executing an emergency processing mechanism if the target response voice cannot be acquired within a preset time period.

For the specific limitations of the artificial intelligence based speech response processing apparatus, reference may be made to the above limitations of the artificial intelligence based speech response processing method, which are not described herein again. The modules in the artificial intelligence based voice response processing device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the intelligent interaction device, and can also be stored in a memory in the intelligent interaction device in a software form, so that the processor can call and execute the corresponding operations of the modules.

In one embodiment, an intelligent interactive device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 10. The intelligent interaction device comprises a processor, a memory, a network interface and a database which are connected through a system bus. Wherein the processor of the intelligent interactive device is used to provide computing and control capabilities. The memory of the intelligent interaction device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the intelligent interaction device is used for storing data adopted or generated by executing the artificial intelligence-based voice response processing method process. The network interface of the intelligent interaction device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement an artificial intelligence based speech response processing method.

In an embodiment, an intelligent interaction device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the artificial intelligence based voice response processing method in the foregoing embodiments is implemented, for example, S201 to S204 shown in fig. 2, or shown in fig. 3 to fig. 8, which is not described herein again to avoid repetition. Alternatively, the processor implements the functions of each module/unit in the artificial intelligence based voice response processing apparatus in this embodiment when executing the computer program, for example, the functions of the to-be-processed voice stream acquiring module 901, the to-be-analyzed voice stream acquiring module 902, the playing and analyzing parallel processing module 903, and the response voice real-time playing module 904 shown in fig. 9, and are not described herein again to avoid repetition.

In an embodiment, a computer-readable storage medium is provided, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method for processing a voice response based on artificial intelligence in the foregoing embodiments is implemented, for example, S201 to S204 shown in fig. 2, or shown in fig. 3 to fig. 8, which is not described herein again to avoid repetition. Alternatively, when being executed by a processor, the computer program implements the functions of the modules/units in the above-mentioned artificial intelligence-based voice response processing apparatus, such as the functions of the to-be-processed voice stream obtaining module 901, the to-be-analyzed voice stream obtaining module 902, the playing and analyzing parallel processing module 903, and the response voice real-time playing module 904 shown in fig. 9, which are not described herein again to avoid repetition.

It will be understood by those of ordinary skill in the art that all or a portion of the processes of the methods of the embodiments described above may be implemented by a computer program that may be stored on a non-volatile computer-readable storage medium, which when executed, may include the processes of the embodiments of the methods described above, wherein any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A speech response processing method based on artificial intelligence is characterized by comprising the following steps:

2. The artificial intelligence based voice response processing method according to claim 1, wherein performing a sentence integrity analysis on the voice stream to be processed to obtain a voice stream to be analyzed comprises:

monitoring the voice stream to be processed by adopting a voice activation detection algorithm to obtain a voice pause point and corresponding pause duration;

determining a voice pause point with pause duration larger than a preset duration threshold as a target pause point;

and obtaining the voice stream to be analyzed based on two adjacent target stop points.

3. The artificial intelligence based voice response processing method of claim 1, wherein the invoking the first processing procedure to control a voice playing module to play a target voice word recording comprises:

acquiring voice time corresponding to the voice stream to be analyzed;

and determining a target language word record based on the original language word record matched with the voice duration and controlling a voice playing module to play the target language word record based on the voice duration query system database.

4. The artificial intelligence based voice response processing method according to claim 1, wherein the invoking the second processing procedure to recognize the voice stream to be analyzed and obtain a target response voice includes:

performing voice recognition on the voice stream to be analyzed to obtain a text to be analyzed corresponding to the voice stream to be analyzed;

performing semantic analysis on the text to be analyzed to obtain a target intention corresponding to the text to be analyzed;

inquiring a system database based on the target intention, and acquiring a target response dialect corresponding to the target intention;

and acquiring target response voice based on the target response speech technology.

5. The artificial intelligence based voice response processing method of claim 4, wherein the querying a system database based on the target intention to obtain the target response utterance corresponding to the target intention comprises:

determining an intent type based on the target intent;

if the intention type is a general intention, inquiring a general dialect database based on the target intention to obtain a target response dialect corresponding to the target intention;

if the intention type is a special intention, inquiring a special information database based on the target intention to obtain an intention inquiry result, and obtaining a target response language corresponding to the target intention based on a language template corresponding to the special intention and the intention inquiry result.

6. The artificial intelligence based voice response processing method of claim 5, wherein the obtaining of the target response voice based on the target response utterance comprises:

if the intention type is a general intention, inquiring a system database based on the target response dialect, and determining a general response voice record corresponding to the target response dialect as the target response voice;

and if the intention type is a special intention, carrying out voice synthesis on the target response speech to obtain target response voice.

7. The artificial intelligence based voice response processing method according to claim 1, wherein the monitoring a playing status of the voice playing module playing the target spoken word recording in real time, and controlling the voice playing module to play the target response voice if the playing status is an end of playing includes:

monitoring the playing state of the target language word recording played by the voice playing module in real time, and if the playing state is the end of playing, judging whether the target response voice can be acquired within a preset time period;

if the target response voice can be acquired within the preset time period, playing the target response voice in real time;

and if the target response voice cannot be acquired within the preset time period, executing an emergency treatment mechanism.

8. An artificial intelligence-based voice response processing apparatus, comprising:

9. An intelligent interaction device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the artificial intelligence based voice response processing method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements the artificial intelligence based voice response processing method according to any one of claims 1 to 7.