CN115132192A

CN115132192A - Intelligent voice interaction method and device, electronic equipment and storage medium

Info

Publication number: CN115132192A
Application number: CN202110309632.6A
Authority: CN
Inventors: 戴苏洋; 刘小明; 陈克寒
Original assignee: Alibaba Singapore Holdings Pte Ltd
Current assignee: Alibaba Innovation Co
Priority date: 2021-03-23
Filing date: 2021-03-23
Publication date: 2022-09-30

Abstract

The embodiment of the application provides an intelligent voice interaction method and device, electronic equipment and a storage medium. According to the scheme provided by the embodiment of the application, a voice stream sent by a user is obtained, the voice stream is divided into a plurality of first voice segments according to a first mute interval duration, and the voice stream is divided into at least one second voice segment according to a second mute interval duration, wherein the first mute interval duration is smaller than the second mute interval duration; acquiring a voice bearing voice file corresponding to the first voice fragment; and before responding to the second voice segment, playing the voice bearing voice file corresponding to the first voice segment. According to the scheme, the voice stream of the user is divided according to the finer granularity, and the response is carried out in advance in a voice bearing mode before the whole response is carried out.

Description

Intelligent voice interaction method and device, electronic equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to an intelligent voice interaction method and device, electronic equipment and a storage medium.

Background

It is a common practice in the industry to employ a conversational robot to interact with a user by voice. Such interaction is typically a one-time long connection process, with the user being more sensitive to response delays of the robot. However, in the current intelligent voice interaction scheme commonly used in the industry, the link is long, the response time is usually longer than 2S, and the user cannot respond instantly.

Based on this, a faster voice interaction scheme is needed to improve the user experience.

Disclosure of Invention

In view of the above, embodiments of the present application provide an intelligent voice interaction solution to at least partially solve the above problem.

According to a first aspect of embodiments of the present application, there is provided an intelligent voice interaction method, including:

acquiring a voice stream sent by a user, and dividing the voice stream into a plurality of first voice segments according to a first mute interval duration;

dividing the voice stream into at least one second voice segment according to a second mute interval duration, wherein the first mute interval duration is smaller than the second mute interval duration;

acquiring a voice bearing voice file corresponding to the first voice fragment; and

and before responding to the second voice segment, playing the voice bearing voice file corresponding to the first voice segment.

According to a second aspect of the embodiments of the present application, there is provided an intelligent voice interaction apparatus, including:

the voice acquisition module acquires a voice stream sent by a user;

the first segmentation module is used for acquiring a voice stream sent by a user and segmenting the voice stream into a plurality of first voice segments according to a first mute interval duration;

the second segmentation module is used for segmenting the voice stream into at least one second voice segment according to a second mute interval duration, wherein the first mute interval duration is less than the second mute interval duration;

the file acquisition module is used for acquiring a voice bearing voice file corresponding to the first voice fragment;

and the interaction module is used for playing the tone bearing voice file corresponding to the first voice fragment before responding to the second voice fragment.

According to a third aspect of embodiments of the present application, there is provided an electronic apparatus, including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus; the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the intelligent voice interaction method according to the first aspect.

According to a fourth aspect of embodiments of the present application, there is provided a computer storage medium having a computer program stored thereon, which when executed by a processor, implements the intelligent voice interaction method according to the first aspect.

According to the scheme provided by the embodiment of the application, a voice stream sent by a user is obtained, the voice stream is divided into a plurality of first voice segments according to a first mute interval duration, and the voice stream is divided into at least one second voice segment according to a second mute interval duration, wherein the first mute interval duration is smaller than the second mute interval duration; acquiring a voice bearing voice file corresponding to the first voice fragment; and before responding to the second voice segment, playing the voice bearing voice file corresponding to the first voice segment. The voice stream of the user is divided according to the finer granularity, and the response is carried out in advance through the voice and qi bearing mode before the whole response is carried out, so that the time for the user to wait for the response is reduced, and the user experience is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the description below are only some embodiments described in the embodiments of the present application, and other drawings can be obtained by those skilled in the art according to these drawings.

FIG. 1 is a schematic flow chart of a process involved in currently conducting a voice conversation;

fig. 2 is a schematic flowchart of an intelligent voice interaction method according to an embodiment of the present application;

fig. 3 is a schematic diagram illustrating a dividing manner of a speech segment according to an embodiment of the present application;

FIG. 4 is a flow chart illustrating a speech reception link according to an embodiment of the present application;

FIG. 5 is a logic diagram of a multi-modal detection algorithm provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of an intelligent voice interaction apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the embodiments of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application shall fall within the scope of the protection of the embodiments in the present application.

The following further describes specific implementations of embodiments of the present application with reference to the drawings of the embodiments of the present application.

At present, Speech dialog robots perform conversion between Text and Speech modalities by adding an Automatic Speech Recognition technology (ASR) and Text-to-Speech (TTS) on the basis of a Text dialog robot, so as to achieve the capability of Speech dialog, as shown in fig. 1, where fig. 1 is a schematic flow diagram related to current Speech dialog. In this way, the response time of the interactive link field is often longer, and usually reaches more than 2S. Based on this, the application provides a faster voice interaction scheme to improve user experience.

As shown in fig. 2, fig. 2 is a schematic flowchart of an intelligent voice interaction method provided in an embodiment of the present application, including:

s201, acquiring a voice stream sent by a user.

S203, the voice stream is divided into a plurality of first voice segments according to the first mute interval duration.

When the user interacts with the intelligent robot, the voice stream usually naturally carries silence intervals with different durations because of pause among utterances. Such pauses generally represent the semantic difference between the phrases or sentences expressed in the user's language. In the conventional scheme, the voice stream is divided based on the pause, so that the overall identification and response of the problem which the user wants to solve are realized.

The mute interval refers to the condition that the volume of the sound of the voice stream at any time point in the interval duration does not exceed a certain value, and complete mute is not needed.

In the present application, the specific dividing manner for the voice stream based on the duration of the silence interval may be that, if the duration of the silence interval between two time points exceeds a preset value, the two time points are divided into different voice segments; on the contrary, if the duration of the mute interval between the two time points does not exceed the preset value, the two time points are divided into the same voice segment.

For example, the voice stream of the user is "please ask query, how to update the panning app", the corresponding voice stream is shown in fig. 3, and fig. 3 is a schematic diagram of a voice segment dividing method provided in the embodiment of the present application. The duration of the mute interval between characters "fit" and "pan" is 300ms, the duration of the interval between characters "fit" and "pan" is "500 ms", and the duration of the mute interval between "app" and "what" is 250 ms.

Based on this, the duration of the first mute interval may be preset to be 200ms, so that the voice stream may be divided into "ask", "down", "pan app", and "how to update" multiple first voice segments according to the duration of the first mute interval.

S205, the voice stream is divided into at least one second voice segment according to a second mute interval duration, wherein the first mute interval duration is smaller than the second mute interval duration.

In the previous example, the voice stream may be segmented into a second voice segment "please ask, how the panning app is updated" based on the second mute interval duration. As shown in fig. 3, the dotted line interval in fig. 3 is a plurality of first voice segments obtained by dividing the voice stream, and a silence segment corresponding to the first silence interval duration is located between the plurality of first voice segments; the solid line interval is a second speech segment divided by another division method (i.e. based on the second silence interval).

In practical applications, the first mute interval duration and the second mute interval duration may be preset based on empirical statistics of actual needs. For example, typically the second mute interval duration is 800ms and the first mute interval duration is 400ms or less, 200 ms.

In the embodiment of the present application, the second silence interval duration is generally used for dividing the whole sentence, and the divided second speech segment is generally a complete sentence, so as to implement complete understanding for the user and give a corresponding complete answer. That is, in some scenarios, if the user only speaks a sentence, only one second speech segment may be obtained, and if the user speaks a plurality of sentences, a plurality of second speech segments may be obtained correspondingly.

The first mute interval duration is used for performing finer-grained division on the voice stream, and the first voice segment obtained by the division is usually a semantic unit (usually corresponding to a phrase, or an incomplete sentence) so as to respond based on the finer semantic unit. Obviously, since the duration of the first mute interval is less than that of the second mute interval, the duration of the divided second speech segment is longer than that of the first speech segment.

And S207, acquiring a voice language bearing voice file corresponding to the first voice fragment.

Specifically, the semantics of a first speech segment may be determined first, and then the segment type of the first speech segment is determined according to the semantics, where the segment type includes a sentence-in segment or a sentence-out segment; and then the voice bearing voice file corresponding to the fragment type can be obtained.

Since the mood adapting voice file is not a solution to the user problem (actually, the voice file corresponding to the second voice fragment is usually a solution to the user problem), the mood adapting voice file does not need to be too long, and can be implemented by using a short file.

That is, the language bearing voice file may be a voice file with a duration not exceeding a preset duration (for example, the playing duration does not exceed 2s), or the number of characters included in the language bearing voice file does not exceed a preset number (for example, does not exceed 5 characters), and so on.

When the type of the first speech segment is a sentence segment, some short underwords may be used to respond to the user, for example, to indicate a listening to the user's expression using "joker" or the like, and to drive the user to continue the expression. And the tone bearing voice file corresponding to the segment in the sentence can contain special tones and can interact with the user in a statement mode.

When the type of the first speech segment is an end-of-sentence segment, the language word which is being considered or should be harmonized can be used to enable the user to feel that the expression is understood and respected better. In order to better interact with the user in the end-of-sentence fragment, it is necessary to correctly understand the intention of the user corresponding to the end-of-sentence fragment. And classifying the intention of the user, so that the tone bearing voice file corresponding to the user intention type of the user can be obtained.

Specifically, after the voice stream is divided to obtain a plurality of first voice segments, where the plurality of first voice segments obtained by dividing may include a plurality of end-of-sentence segments, the voice stream between one end-of-sentence segment and a previous end-of-sentence segment (i.e., another end-of-sentence segment that is earlier in the event sequence and has the shortest time interval with the end-of-sentence segment) may be determined as the complete sentence corresponding to the end-of-sentence segment. If a sentence end segment does not have a preceding sentence end segment, the speech stream from the initial time to the sentence end segment can be determined as a complete sentence corresponding to the sentence end segment.

Furthermore, the semantics of the complete sentence can be determined, the user intention type represented by the complete sentence is determined according to the semantics of the complete sentence, and the tone bearing voice file corresponding to the user intention type is acquired. The corresponding relationship between the user intention type and the tone bearing voice file can be preset, and the user intention type can comprise calling, confirming, denying or giving an instruction, other types and the like. As shown in table 1, table 1 is a schematic table of user intention types and inflectional words included in a speech file according to an embodiment of the present application.

In addition, it should be noted that, when obtaining the voice bearing voice file, some possibly risky bearing words may be avoided based on the user intention type. For example, if the user intent type is "instruction," then interaction with risk words containing "good," "OK," etc. with unambiguous meanings may also be avoided to avoid the risk that a spoken answer to the user's question may not actually solve.

Further, in an embodiment, the obtained voice file for speech bearing may be directly obtained as an audio file; or obtaining a text corresponding to the obtained first voice fragment, synthesizing a voice bearing audio corresponding to the text, and determining the voice bearing audio as a corresponding voice bearing voice file.

In an embodiment, the style of the voice in the voice bearing file corresponding to the obtained first voice fragment may also correspond to the user intention type, so that the user experience may be further improved.

Specifically, the tone-bearing voice file which can bear the same words has a plurality of different types of tones; or, the utterance type corresponding to the user intention type may be adopted when synthesizing the speech file based on the adapting word, so as to obtain an utterance adapting speech file containing a certain utterance corresponding to the user intention type

For example, for user intent types of "call in" or "thank you," "affirm," etc., positive tone (e.g., happy tone) may be employed to express the user. For "deny", "instruct" and "other" utterance bearing types, the responses may be made with flat or stated utterances.

S209, before answering to the second voice segment, playing the voice bearing voice file corresponding to the first voice segment.

As mentioned above, the first speech segment is divided for speech adaptation, and no actual user problem is involved. The question related to the specific user is the answer speech segment obtained by processing the second speech segment in a conventional manner, and the specific processing procedure can be seen in the processing flow shown in fig. 1.

In other words, the response to the first speech segment (i.e., the voice bearing link) and the response to the second speech segment (i.e., the conventional voice question-answer link) are actually two different links that are independent of each other and do not affect each other. Fig. 4 is a schematic flow chart of a speech adaptation link according to an embodiment of the present application, as shown in fig. 4. In this diagram, the voice bearer link is independent of the response link of the conventional voice robot, and can run in parallel with the conventional link, and the instant response of the voice bearer link occurs before the response of the conventional link (i.e., the response to the second voice segment) to the user, without affecting the conventional response link.

According to the scheme provided by the embodiment of the application, a voice stream sent by a user is obtained, the voice stream is divided into a plurality of first voice segments according to a first mute interval duration, and the voice stream is divided into at least one second voice segment according to a second mute interval duration, wherein the first mute interval duration is smaller than the second mute interval duration; acquiring a voice bearing voice file corresponding to the first voice fragment; and playing the tone bearing voice file corresponding to the first voice fragment before responding to the second voice fragment. The voice stream of the user is divided according to the finer granularity, and the response is carried out in advance in a voice bearing mode before the whole response is carried out, so that the time for the user to wait for the response is shortened, and the user experience is improved.

In an embodiment, since there are usually a plurality of divided sentence fragments, if the fragment is replied in each sentence, fatigue of the user is easily caused, and therefore, it is necessary to correspondingly control the reply frequency of the first speech fragment with the type of sentence fragment to further improve the user experience.

Specifically, a corresponding decision algorithm can be used to determine whether the first speech segment of the segment in the sentence satisfies a preset condition, and the sentence-in-sentence continuation is performed only when the condition is satisfied, so as to implement the reply to the user, thereby controlling the frequency of the sentence-in-sentence continuation.

For example, a historical response frequency or a historical response proportion for responding to a first voice segment in the voice stream is determined, and when the historical response frequency or the historical response proportion does not exceed a preset value, the voice segment is determined to meet a preset condition.

Assuming that a voice stream is divided into 30 first voice segments and 3 repetitions have been performed before, it is known that the historical response frequency is 3, or the historical response rate is 10%. Therefore, if the preset value of the historical response frequency is set to 5 or the preset value of the historical response scale is set to 0.2, the segments in the sentence can be replied.

For another example, a random number corresponding to the first speech segment may be given, and when the random number does not exceed a preset range, it is determined that the first speech segment satisfies a preset condition. The probability that a first speech piece may be answered is also actually controlled by the range of random numbers. For example, assuming that random number generation can wait for the probability to give all random numbers of 1 to 100, the preset range of random numbers that can be answered can be set to [70,100], thereby effectively controlling the probability that the first speech segment may be answered to about 30%, avoiding excessively high-frequency sentence-to-sentence replies.

As described above, in the embodiment of the present application, different reply methods need to be used for the sentence-in segment and the sentence-end segment. Therefore, in one embodiment, an end-of-sentence detection algorithm may also be employed to perform specific detection of the in-sentence segments and the end-of-sentence segments. Specifically, semantic features of a text corresponding to the first speech segment may be determined, and audio features of the first speech segment may be extracted; and classifying the first voice fragment according to the semantic features and the audio features, and determining the fragment type of the first voice fragment according to the classification result.

For example, the semantic features may be classified according to a text semantic classifier obtained by pre-training to obtain a semantic classification result; and classifying the audio features according to an audio classifier obtained by pre-training to obtain an audio feature classification result. For example, the pre-trained Bert text pre-training structure may be used to classify the text of the speech segments, and the Long Short-Term Memory network (LSTM) may be used to obtain the audio representation of Micro-turn. And finally, fusing semantic features and audio features by an Attention mechanism in deep learning to perform binary classification on the first voice segment.

Fig. 5 is a logic diagram of a multi-modal detection algorithm provided by the embodiment of the present application, as shown in fig. 5, for detecting a segment type of a first speech segment. In the algorithm, a form of multi-task training is adopted, namely classifiers of different modes (namely classifiers including semantic features and classifiers including audio features) are trained at the same time, and then prediction results of the classifiers are fused to obtain a multi-modal prediction result.

Further, in order to avoid the data being too long, the whole first speech segment or the last several seconds (for example, 5s or 3s) of the first speech segment may be used, and the silence segment corresponding to the first speech segment (i.e., the silence segment corresponding to the first silence interval duration next to and adjacent to the first silence interval duration in time series) is merged to obtain the speech sample to be classified as the input.

The mute section is added to capture the audio information ignored by ASR, such as ventilation sound, so that the classification effect is improved. The original audio is then feature extracted: and carrying out equal-length segmentation on the voice sample to be recognized to obtain a plurality of equal-length voice sub-samples. For example, a speech sample to be recognized is framed by a window of 40ms with the same length, so as to obtain a speech sub-sample with each frame length of 40ms, and then an audio feature is extracted from each frame of speech sub-sample.

Specifically, features widely used in voice recognition tasks such as MFCC [10], logfbank, F0, Intensity, etc., and their delta and delta-delta values can be selected to represent dynamic change information of the features, and feature extraction can be performed by means of audio processing libraries such as librosa and praat. And inputting the extracted audio features into the bidirectional LSTM to obtain the audio representation of the first voice segment, and then integrating the text prediction model to obtain a prediction result through the full connection layer and the Softmax.

The sentence end segment level expression in the voice stream is identified through the text pre-training structure and the multi-mode feature fusion algorithm, whether the expression of the user is completed or not is judged, and the sentence end segment and the sentence middle segment can be accurately classified for the plurality of first voice segments obtained through division, so that follow-up accurate response is realized, and the user experience is further improved.

The intelligent voice interaction method of the present embodiment may be executed by any suitable electronic device with data processing capability, including but not limited to: server, mobile terminal (such as mobile phone, PAD, etc.), PC, etc.

In a second aspect of the embodiment of the present application, an intelligent voice interaction apparatus is further provided, as shown in fig. 6, fig. 6 is a schematic structural diagram of the intelligent voice interaction apparatus provided in the embodiment of the present application, and includes:

a voice acquiring module 609, which acquires a voice stream sent by a user;

the first segmentation module 601 is configured to acquire a voice stream sent by a user, and segment the voice stream into a plurality of first voice segments according to a first mute interval duration;

a second dividing module 603, configured to divide the voice stream into at least one second voice segment according to a second mute interval duration, where the first mute interval duration is smaller than the second mute interval duration;

an obtaining module 605, configured to obtain a voice language bearing voice file corresponding to the first voice fragment;

the file interaction module 607 plays the voice bearing voice file corresponding to the first voice segment before responding to the second voice segment.

Optionally, the obtaining module 605 determines semantics of the first speech segment, and determines a segment type of the first speech segment according to the semantics, where the segment type includes a segment in a sentence or a segment at a tail of a sentence; and acquiring a voice bearing voice file corresponding to the fragment type.

Optionally, the obtaining module 605 obtains the voice language bearing voice file corresponding to the fragment type and not longer than the preset duration, or obtains the voice language bearing voice file corresponding to the fragment type and not longer than the preset number of characters.

Optionally, when the segment type is a sentence end segment, the obtaining module 605 determines, from the voice stream, a complete sentence corresponding to the sentence end segment; determining a type of user intent characterized by the full sentence; and acquiring a tone bearing voice file corresponding to the user intention type.

Optionally, the obtaining module 605 determines a previous sentence end segment in the voice stream, and determines the voice stream between the previous sentence end segment and the sentence end segment as a complete sentence corresponding to the sentence end segment.

Optionally, the obtaining module 605 determines the semantics of the complete sentence, and determines the user intention type represented by the complete sentence according to the semantics of the complete sentence.

Optionally, in the apparatus, the user intent types include: at least one of call, acknowledge, deny, or place an instruction.

Optionally, the obtaining module 605 determines semantic features of a text corresponding to the first voice fragment, and extracts audio features of the first voice fragment; and classifying the first voice fragment according to the semantic features and the audio features, and determining the fragment type of the first voice fragment according to the classification result.

Optionally, the obtaining module 605 classifies the semantic features according to a text semantic classifier obtained by pre-training to obtain a semantic classification result; classifying the audio features according to an audio classifier obtained by pre-training to obtain an audio feature classification result; and fusing the semantic classification result and the audio feature classification result to determine the segment type of the first voice segment.

Optionally, the obtaining module 605 merges the first voice segment and the corresponding silence segment to obtain a voice sample to be classified; and carrying out equal-length segmentation on the voice sample to obtain a plurality of equal-length voice sub-samples, and extracting the audio features of the first voice fragment from the voice sub-samples obtained by segmentation.

Optionally, when the segment type is a sentence segment, the obtaining module 605 determines whether the first voice segment meets a preset condition, and when the first voice segment meets the preset condition, obtains a voice context file corresponding to the segment type.

Optionally, the obtaining module 605 determines a historical response frequency or a historical response proportion for responding to the first voice segment in the voice stream, and determines that the voice segment meets a preset condition when the historical response frequency or the historical response proportion does not exceed a preset value; or determining a random number corresponding to the first voice segment, and determining that the first voice segment meets a preset condition when the random number does not exceed a preset range.

Optionally, the obtaining module 605 obtains a text corresponding to the first voice fragment, synthesizes a speech-based audio corresponding to the text, and determines the speech-based audio as a corresponding speech-based file.

The intelligent voice interaction device of this embodiment is used to implement the corresponding intelligent voice interaction processing method in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again. In addition, the functional implementation of each module in the intelligent voice interaction device of this embodiment can refer to the description of the corresponding part in the foregoing method embodiment, and is not repeated here.

Referring to fig. 7, a schematic structural diagram of an electronic device according to an embodiment of the present application is shown, and a specific embodiment of the present application does not limit a specific implementation of the electronic device.

As shown in fig. 7, the electronic device may include: a processor (processor)502, a Communications Interface 504, a memory 506, and a communication bus 508.

Wherein:

the processor 502, communication interface 504, and memory 506 communicate with one another via a communication bus 508.

A communication interface 504 for communicating with other electronic devices or servers.

The processor 502 is configured to execute the program 510, and may specifically execute the relevant steps in the foregoing embodiment of the intelligent voice interaction method.

In particular, program 510 may include program code that includes computer operating instructions.

The processor 502 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present application. The intelligent device comprises one or more processors which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

And a memory 506 for storing a program 510. The memory 506 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 510 may specifically be used to cause the processor 502 to perform the following operations:

acquiring a voice stream sent by a user;

dividing the voice stream into a plurality of first voice segments according to a first mute interval duration; and (c) a second step of,

For specific implementation of each step in the program 510, reference may be made to corresponding steps and corresponding descriptions in units in the foregoing embodiment of the intelligent voice interaction method, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

By the electronic device of the embodiment, a voice stream sent by a user is acquired, the voice stream is divided into a plurality of first voice segments according to a first mute interval duration, and the voice stream is divided into at least one second voice segment according to a second mute interval duration, wherein the first mute interval duration is less than the second mute interval duration; acquiring a voice bearing voice file corresponding to the first voice fragment; and playing the voice bearing voice file corresponding to the first voice segment before responding to the second voice segment. The voice stream of the user is divided according to the finer granularity, and the response is carried out in advance through the voice and qi bearing mode before the whole response is carried out, so that the time for the user to wait for the response is reduced, and the user experience is improved.

It should be noted that, according to implementation needs, each component/step described in the embodiment of the present application may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present application.

The above-described methods according to embodiments of the present application may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the methods described herein may be stored in such software processes on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It is understood that a computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by a computer, processor, or hardware, implements the intelligent voice interaction methods described herein. Further, when a general-purpose computer accesses code for implementing the intelligent voice interaction methods shown herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the intelligent voice interaction methods shown herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

The above embodiments are only used for illustrating the embodiments of the present application, and not for limiting the embodiments of the present application, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present application, so that all equivalent technical solutions also belong to the scope of the embodiments of the present application, and the scope of the patent protection of the embodiments of the present application should be defined by the claims.

Claims

1. An intelligent voice interaction method, comprising:

acquiring a voice stream sent by a user;

dividing the voice stream into a plurality of first voice segments according to a first mute interval duration;

2. The method of claim 1, wherein obtaining the mood adapting voice file corresponding to the first voice fragment comprises:

determining the semantics of the first voice fragment, and determining the fragment type of the first voice fragment according to the semantics, wherein the fragment type comprises a sentence-in fragment or a sentence-out fragment;

and acquiring a voice bearing voice file corresponding to the fragment type.

3. The method of claim 2, wherein obtaining the mood adapter voice file corresponding to the fragment type comprises:

and acquiring a tone bearing voice file which corresponds to the fragment type and does not exceed a preset duration, or acquiring a tone bearing voice file which corresponds to the fragment type and has a character number which does not exceed a preset number.

4. The method according to claim 2, wherein when the segment type is an end-of-sentence segment, acquiring the mood adapting voice file corresponding to the segment type comprises:

determining a complete sentence corresponding to the sentence end segment from the voice stream;

determining a type of user intent characterized by the full sentence;

and acquiring a tone bearing voice file corresponding to the user intention type.

5. The method according to claim 4, wherein determining the complete sentence corresponding to the sentence-ending segment from the speech stream comprises:

determining a preceding sentence end segment in the voice stream, and determining the voice stream between the preceding sentence end segment and the sentence end segment as a complete sentence corresponding to the sentence end segment.

6. The method of claim 4, wherein determining the type of user intent characterized by the full sentence comprises:

determining the semantics of the complete sentence, and determining the user intention type represented by the complete sentence according to the semantics of the complete sentence.

7. The method of claim 6, wherein the user intent types comprise: at least one of call, acknowledge, deny, or place an instruction.

8. The method of claim 2, wherein determining a segment type of the first speech segment from the semantics comprises:

determining semantic features of a text corresponding to the first voice fragment, and extracting audio features of the first voice fragment;

and classifying the first voice fragment according to the semantic features and the audio features, and determining the fragment type of the first voice fragment according to the classification result.

9. The method of claim 8, classifying the first speech segment according to the semantic features and the audio features, determining a segment type of the first speech segment according to the classification result, comprising:

classifying the semantic features according to a text semantic classifier obtained by pre-training to obtain a semantic classification result;

classifying the audio features according to an audio classifier obtained by pre-training to obtain an audio feature classification result;

and fusing the semantic classification result and the audio feature classification result to determine the segment type of the first voice segment.

10. The method of claim 8, wherein extracting audio features of the first speech segment comprises:

combining the first voice segment and the corresponding mute segment to obtain a voice sample to be classified;

and carrying out equal-length segmentation on the voice sample to obtain a plurality of equal-length voice sub-samples, and extracting the audio features of the first voice fragment from the voice sub-samples obtained by segmentation.

11. The method according to claim 2, wherein when the segment type is a sentence segment, obtaining a mood associated voice file corresponding to the segment type, the method comprises:

and determining whether the first voice fragment meets a preset condition, and acquiring a voice bearing voice file corresponding to the fragment type when the first voice fragment meets the preset condition.

12. The method of claim 11, determining whether the first speech segment satisfies a predetermined condition, comprising:

determining historical response frequency or historical response proportion for responding to a first voice segment in the voice stream, and determining that the voice segment meets a preset condition when the historical response frequency or the historical response proportion does not exceed a preset value;

or determining a random number corresponding to the first voice segment, and determining that the first voice segment meets a preset condition when the random number does not exceed a preset range.

13. The method according to claim 1, wherein obtaining the mood adapting voice file corresponding to the first voice fragment comprises:

and acquiring a text corresponding to the first voice fragment, synthesizing a tone bearing audio corresponding to the text, and determining the tone bearing audio as a corresponding tone bearing voice file.

14. An intelligent voice interaction device, comprising:

the voice acquisition module is used for acquiring a voice stream sent by a user;

the second segmentation module is used for segmenting the voice stream into at least one second voice segment according to a second mute interval duration, wherein the first mute interval duration is smaller than the second mute interval duration;

15. An electronic device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the intelligent voice interaction method as claimed in any one of claims 1-13.

16. A computer storage medium having stored thereon a computer program which, when executed by a processor, implements the intelligent voice interaction method of any one of claims 1-13.