CN112053687A

CN112053687A - Voice processing method and device, computer readable storage medium and equipment

Info

Publication number: CN112053687A
Application number: CN202010758331.7A
Authority: CN
Inventors: 李倩; 雷欣; 李志飞
Original assignee: Mobvoi Information Technology Co Ltd
Current assignee: Mobvoi Information Technology Co Ltd; Chumen Wenwen Information Technology Co Ltd
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2020-12-08

Abstract

The invention discloses a voice processing method, a device and equipment, wherein the method comprises the following steps: receiving voice data of voice to be processed, wherein the voice to be processed is voice sent by a first object in a plurality of rounds of voice interaction processes of the first object and a second object; determining the confidence coefficient of the to-be-processed voice as complete voice by utilizing a semantic integrity model according to the voice data; and determining the mute waiting time of the second object responding to the voice to be processed according to the confidence. According to the method, the confidence level that the voice to be processed is complete voice is determined by utilizing the semantic integrity model according to the received voice data of the voice to be processed, and the voice information with incomplete semantics is effectively recognized, so that the mute waiting time of the second object responding to the voice to be processed is dynamically adjusted according to the confidence level, the condition that the user is interrupted when the voice expression of the user is not finished is avoided, the mute time is shortened under the condition that the voice semantic integrity is judged, the interaction efficiency is improved, and the user experience is greatly improved.

Description

Voice processing method and device, computer readable storage medium and equipment

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a speech processing method, apparatus, computer-readable storage medium, and device.

Background

With the development of speech recognition technology, intelligent speech service systems that enable speech interaction between humans and machines are applied in more and more scenarios, for example; intelligent customer service, intelligent robot, etc. In the voice interaction scene, the system is required to automatically judge whether the user stops speaking, and if the system finds that the user has expressed his own idea, the next round of information interaction is automatically executed, for example: information question answering and the like. For the judgment of whether the user stops speaking, whether the user stops speaking is mainly judged by setting a mute time length with fixed time length. For example, in the voice interaction system of the smart client, the set mute time is 200-. For example: if the set mute time is 200 milliseconds, after the user finishes speaking a certain speech, no other effective speech is received within 200 milliseconds, the user is considered to stop speaking, the speech recognition task is finished, an NLP (Natural Language Processing) task is called, and the question of the user is answered.

At present, the value of the mute duration is adjusted according to manual experience, and if the set value of the mute duration is large, it is determined that the time for the user to finish speaking is long, the actual waiting time of the user is prolonged, and the user experience is poor. On the other hand, the actual waiting time of the user becomes short, but there is a problem that in this case, it is very likely that the user is determined to stop speaking when the user does not speak the speech to be expressed and adjusts the breathing for a short time. For example: when the user stops breathing, the system judges that the user stops speaking, and further forces the user to enter and interrupt the information interaction process of the next round. Therefore, the problem of semantic missing is easily caused, and further the next round of information interaction process cannot be normally carried out.

Disclosure of Invention

Embodiments of the present invention provide a method and an apparatus for processing speech, and a computer-readable storage medium to solve the above problems in the speech processing process.

According to a first aspect of the present invention, there is provided a speech processing method, the method comprising: receiving voice data of voice to be processed, wherein the voice to be processed is voice sent by a first object in a plurality of rounds of voice interaction processes of the first object and a second object; determining the confidence coefficient that the voice to be processed is complete voice by utilizing a semantic integrity model according to the voice data; and determining the mute waiting time of the second object responding to the voice to be processed according to the confidence.

According to an embodiment of the present invention, the determining, according to the confidence, a mute waiting time for the second object to respond to the to-be-processed speech includes: determining a confidence interval to which the confidence belongs; and determining a first mute waiting time corresponding to the confidence coefficient according to the confidence coefficient interval and a first relation between a predetermined confidence coefficient interval and the mute waiting time, and using the first mute waiting time as the mute waiting time for the second object to respond to the voice to be processed.

According to an embodiment of the present invention, the determining, according to the confidence, a mute waiting time for the second object to respond to the to-be-processed speech includes: and determining a second mute waiting time corresponding to the confidence coefficient according to the confidence coefficient and a second relation between the predetermined confidence coefficient and the mute waiting time, wherein the second mute waiting time is used as the mute waiting time for the second object to respond to the voice to be processed.

According to an embodiment of the present invention, the semantic integrity model is a BERT model optimized by using at least one of the following operations: defining the corpus length in the model input as the actual length of a corpus to be trained, wherein the corpus to be trained is the corpus subjected to semantic annotation; the number of model layers in the model is reduced.

According to an embodiment of the present invention, the corpus to be trained includes a positive corpus with complete semantics and a negative corpus with missing semantics, and the negative corpus includes a corpus obtained by at least one of the following operations: obtaining a corpus by adopting an LOSS function; and obtaining the corpus by adopting a difficult sample mining technology.

According to a second aspect of the present invention, there is also provided a speech processing apparatus, the apparatus comprising: the receiving module is used for receiving voice data of voice to be processed, wherein the voice to be processed is sent out by a first object in a plurality of rounds of voice interaction processes of the first object and a second object; the integrity determination module is used for determining the confidence coefficient that the voice to be processed is complete voice by utilizing a semantic integrity model according to the voice data; and the mute duration determining module is used for determining the mute waiting duration of the second object responding to the voice to be processed according to the confidence coefficient.

According to an embodiment of the present invention, the mute duration determination module includes: the confidence interval judgment submodule is used for determining the confidence interval to which the confidence interval belongs according to the confidence; and the first time length determining submodule is used for determining a first mute waiting time length corresponding to the confidence coefficient according to the confidence coefficient interval and a first relation between a predetermined confidence coefficient interval and the mute waiting time length, and the first mute waiting time length is used as the mute waiting time length for the second object to respond to the voice to be processed.

According to an embodiment of the present invention, the mute duration determination module includes: and the second duration determining submodule is used for determining a second mute waiting duration corresponding to the confidence coefficient according to the confidence coefficient and a second relation between the predetermined confidence coefficient and the mute waiting duration, and the second mute waiting duration is used as the mute waiting duration for the second object to respond to the voice to be processed.

According to a third aspect of the present invention, there is also provided a computer-readable storage medium comprising a set of computer-executable instructions which, when executed, are operable to perform any of the speech processing methods described above.

According to a fourth aspect of the present invention, there is also provided an apparatus comprising at least one processor, and at least one memory connected to the processor, a bus; the processor and the memory complete mutual communication through the bus; the processor is used for calling the program instructions in the memory so as to execute the voice processing method.

The voice processing method, the voice processing device and the voice processing equipment are mainly applied to voice emitted by a first object in a plurality of rounds of voice interaction processes of the first object and a second object, the confidence coefficient that the voice to be processed is complete voice is determined by utilizing a semantic integrity model according to received voice data of the voice to be processed, and therefore the mute waiting time of the second object responding to the voice to be processed is dynamically adjusted according to the confidence coefficient. The voice information with incomplete semantics is effectively recognized, the mute waiting time is adjusted, the situation that the user is interrupted when the voice expression of the user is not finished is avoided, the mute time is shortened under the situation that the voice semantics are judged to be complete, the interaction efficiency is improved, and the user experience is greatly improved.

It is to be understood that the teachings of the present invention need not achieve all of the above-described benefits, but rather that specific embodiments may achieve specific technical results, and that other embodiments of the present invention may achieve benefits not mentioned above.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

FIG. 1 is a flow chart illustrating a voice processing method according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating an implementation of an example of an application of the speech processing method according to the embodiment of the present invention;

fig. 3 is a schematic diagram showing a composition structure of the apparatus according to the embodiment of the present invention.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given only to enable those skilled in the art to better understand and to implement the present invention, and do not limit the scope of the present invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

To better describe the specific scheme of the embodiment of the present invention, a specific application scenario of the embodiment of the present invention is first described by way of example, and it should be noted that the present invention is not limited to the following application scenario, and may also be applied to other applicable scenarios.

In the application process of intelligent systems with natural language recognition functions, such as intelligent customer service, intelligent robots, communication secretaries and the like, the real voice interaction between a user and the intelligent system is involved, and the intelligent system needs to have a conversation with the user according to the received voice. Therefore, the intelligent system is required to judge whether the current voice of the user is finished or not and determine how long to wait for the user to respond to the user, and the waiting process is usually silent waiting because the voice content of the user is waited, and the waiting time is called as silent waiting time. If the time setting is long, the interaction efficiency may be low, and if the waiting time is too short, a process in which the user's voice is interrupted may occur. The embodiment of the invention aims to judge the semantic integrity of the voice according to the received voice content of the user, thereby dynamically adjusting the mute waiting time according to the semantic integrity. Interaction efficiency is effectively improved, and user experience is effectively improved.

The technical solution of the present invention is further elaborated below with reference to the drawings and the specific embodiments.

Fig. 1 is a schematic flow chart illustrating an implementation of a speech processing method according to an embodiment of the present invention.

Referring to fig. 1, a speech processing method according to an embodiment of the present invention at least includes the following operation flows: operation 101, receiving voice data of a voice to be processed, where the voice to be processed is a voice emitted by a first object in a plurality of rounds of voice interaction processes of the first object and a second object; operation 102, determining a confidence coefficient that the voice to be processed is complete voice by using a semantic integrity model according to the voice data; and operation 103, determining a mute waiting time length of the second object responding to the voice to be processed according to the confidence.

In operation 101, voice data of a to-be-processed voice, which is a voice uttered by a first object during a number of rounds of voice interaction of the first object and a second object, is received.

In an embodiment of the present invention, the several rounds of voice interaction processes of the first object and the second object may be: and (3) a conversation process between the user and the intelligent customer service when the user uses equipment such as a fixed telephone or a mobile phone to make a call for the customer service. It is also possible to have a session between two users of a mobile data service provider, for example: mobile phone, internet phone, fixed phone, etc., where one user has opened an intelligent service function, such as: the user A makes a call to the user B, the user B opens the intelligent service function and can use the intelligent service function provided by a mobile facilitator or an intelligent terminal, when the user B is inconvenient to answer the call, the user B opens the intelligent terminal and automatically answers the call of the user A by using the intelligent service function, and interacts with the user A according to the voice content of the user A, and the intelligent terminal can be a mobile phone, a tablet computer, a phone of a fixed phone and the like. In addition, besides the voice call process of dialing by using a telephone or a mobile phone, the voice call process can also be a voice interaction process by using an application program. For example: and carrying out a voice interaction process by utilizing the instant messaging application program.

In operation 102, a confidence level that the speech to be processed is complete speech is determined using the semantic integrity model based on the speech data.

In an embodiment of the present invention, a BERT (Bidirectional Encoder Representation from transforms) model is optimized by using at least one of the following operations to obtain a semantic integrity model: defining the corpus length in the model input as the actual length of the corpus to be trained, wherein the corpus to be trained is the corpus subjected to semantic annotation; the number of model layers in the model is reduced.

BERT is a pre-trained language characterization model of *** open source, which differs from other language models in two most significant ways: 1. it replaces a small number of words with a Mask or another random word with a reduced probability when training a bi-directional language model. Thereby forcing the BERT model to increase memory of context; 2. a loss function is added to predict the next sentence. Moreover, the BERT model is very deep, with 12 layers, and not wide (wide). The BERT Model belongs to MLM (Masked Language Model), and can simultaneously utilize words on the left and right sides of a Language to be processed. However, the corpora in the BERT model are fixed-length corpora at present, and model training is not performed according to the actual length of the corpora.

In an embodiment of the present invention, the corpus length in the model input is defined as the actual length of the corpus to be trained, and the corpus to be trained is the corpus to which the semantic annotation is completed.

For example, a batch of training voices are obtained as corpora for labeling, for example: and (4) conversation voice of the user and the intelligent customer service. If the semantics of the user's speech in speech are complete, then the notation is complete. If the user is not speaking completely in speech, it is obvious that the user is still speaking, but the audio is over, and the utterance is marked as incomplete. Therefore, the corpus length in the model input is defined as the actual length of the corpus to be trained, based on a great amount of prior knowledge about sentences of the BERT model with good encode, the corpus under a specific scene is added on the basis of the BERT model, a new classification task classified according to the semantic integrity is defined, and a semantic integrity model with higher recognition accuracy can be obtained.

In an embodiment of the invention, the number of model layers in the BERT model is reduced, and the BERT model is optimized to obtain a semantic integrity model.

For example, for a BERT model with a deep depth, in the process of performing natural language processing by using the model, the depth of the model brings a large amount of computation to the language processing process. If the existing BERT model is adopted to judge the integrity of the speech to be processed in the speech interaction process, the problem that the judgment time is long and the user experience is influenced may occur. For example, receiving a voice, determining whether the semantic meaning of the voice is complete takes 300ms, and setting the mute waiting time to be 200ms if the semantic meaning of the received voice is complete. At this time, if the semantic meaning of the voice is determined to be complete, when the mute waiting time for answering the voice is determined to be 200ms, the time has elapsed by 300ms, and the meaning of determining whether the semantic meaning is complete is lost. Therefore, in an embodiment of the present invention, the existing BERT model is optimized by using a method for reducing the number of model layers, so as to obtain a semantic integrity model. For example: the original 12-layer BERT model is reduced to a semantic integrity model with only 3 layers.

In one embodiment of the present invention, the model processing speed is increased by performing the spotting process on the model. For example, parameters in the model may be changed from floating point numbers (real numbers) to integers, thereby reducing the amount of model computation.

In an embodiment of the present invention, the corpus to be trained includes a positive corpus with complete semantics and a negative corpus with missing semantics, and the negative corpus includes a corpus obtained by at least one of the following operations: obtaining a corpus by adopting an LOSS function; and obtaining the corpus by adopting a difficult sample mining technology.

For example, a batch of training voices are obtained as corpora for labeling, for example: and (4) conversation voice of the user and the intelligent customer service. And if the semantic meaning of the user speaking in the voice is complete, marking as the regular corpus. If the user in the speech is incomplete in speaking and lacks semantics, the user can obviously hear that the user still speaks, but the audio is finished, the user is determined as lacking semantics and marked as negative examples corpus. In practical situations, negative examples of semantic missing corpora are relatively few, for example: the negative examples corpora of semantic deletions account for 10% of all corpora acquired. Therefore, some operations are required to increase the negative examples corpus with semantic missing so as to increase the sample size of the negative examples corpus and alleviate the problem of too large sample size difference between the positive examples corpus and the negative examples corpus. For example: all training voices can be randomly truncated, and the first half part of the truncated voices is taken as negative example corpus. The cutoff of the corpus can be obtained by using the LOSS function. More negative examples corpora can also be obtained by using difficult sample mining techniques.

After the corpus is prepared, the corpus is trained, in the embodiment of the invention, a two-classification model based on a BERT model structure is adopted for classification training, a classifier is trained by adopting a complete semantic special corpus, and the probability output of the classifier is taken as the confidence coefficient of complete semantic. For example: the model output is "semantic integrity probability: 80 percent; semantic missing probability: 20% ".

In addition, since the original BERT model has a large delay in actual use, a fixed-point processing, such as: real numbers in the model parameters are processed by methods such as rounding and the like, and are expressed by integers, so that the model operation amount is greatly reduced. Meanwhile, the operation of variable-length input can also be adopted, for example: and defining the corpus length in the model input as the actual length of the corpus to be trained, wherein the corpus to be trained is the corpus subjected to semantic annotation. Further, model number optimization may also be optimized, for example: the original 12-layer BERT model is reduced to a 3-layer semantic integrity model, so that the model processing speed is optimized, and the trained semantic integrity model is better applied.

In operation 103, a mute wait time for the second object to respond to the speech to be processed is determined according to the confidence level.

In an embodiment of the present invention, determining the mute waiting time of the second object responding to the to-be-processed speech according to the confidence level is implemented by the following operations: determining a confidence interval to which the confidence belongs; and determining a first mute waiting time corresponding to the confidence coefficient according to the confidence coefficient interval and a first relation between the predetermined confidence coefficient interval and the mute waiting time, and using the first mute waiting time as the mute waiting time for the second object to respond to the voice to be processed.

For example, the semantic level analysis is performed on the current speech to be processed obtained by decoding, and the analysis outputs a binary judgment and a confidence level, where the binary judgment is whether the received speech to be processed is complete speech. The output information is [ semantic integrity probability: x%; semantic missing probability: (1-X%) ]. For example: the 'I want to about the queen' belongs to the voice with complete semantics, and the 'I want to about' belongs to the voice with incomplete semantics, so that the probability that the user needs to continue speaking is high. Confidence is the degree of confidence that the semantic integrity is given.

For example: setting presetting:

T＝400ms，S∈[0，60％]

T＝300ms，S∈[60％，80％]

T＝200ms，S∈[80％，100％]

wherein T represents the mute waiting time length, and S represents the confidence coefficient of complete semantics.

In an embodiment of the present invention, determining the mute waiting time of the second object responding to the to-be-processed speech according to the confidence level is implemented by the following operations: and determining a second mute waiting time corresponding to the confidence coefficient according to the confidence coefficient and a second relation between the predetermined confidence coefficient and the mute waiting time, and using the second mute waiting time as the mute waiting time for the second object to respond to the voice to be processed.

For example, the second relationship between the confidence level and the mute waiting time may be preset as follows: and f (S), wherein T represents the mute waiting time length, and S represents the confidence coefficient of complete semantics. T ═ f(s) may be a simple linear function, or may be a nonlinear function that is suitably obtained through a number of experiments, and the present invention is not limited thereto.

Similarly, based on the foregoing speech processing method, an embodiment of the present invention further provides a computer-readable storage medium, where a program is stored, and when the program is executed by a processor, the processor is caused to perform at least the following operation steps: operation 101, receiving voice data of a voice to be processed, where the voice to be processed is a voice emitted by a first object in a plurality of rounds of voice interaction processes of the first object and a second object; operation 102, determining a confidence coefficient that the voice to be processed is complete voice by using a semantic integrity model according to the voice data; and operation 103, determining a mute waiting time length of the second object responding to the voice to be processed according to the confidence.

Further, based on the above speech processing method, an embodiment of the present invention further provides a speech processing apparatus according to the second aspect of the present invention, and referring to fig. 2, the apparatus 20 includes: the receiving module 201 is configured to receive voice data of a voice to be processed, where the voice to be processed is sent by a first object in a plurality of rounds of voice interaction processes of the first object and a second object; the integrity determination module 202 is configured to determine, according to the voice data, a confidence that the voice to be processed is complete voice by using the semantic integrity model; and the mute duration determining module 203 is configured to determine, according to the confidence, a mute waiting duration for the second object to respond to the voice to be processed.

In an embodiment of the present invention, the mute duration determining module 203 includes: the confidence interval judgment submodule is used for determining the confidence interval to which the confidence interval belongs according to the confidence; and the first time length determining submodule is used for determining a first mute waiting time length corresponding to the confidence coefficient according to the confidence coefficient interval and a first relation between the predetermined confidence coefficient interval and the mute waiting time length, and the first mute waiting time length is used as the mute waiting time length for the second object to respond to the voice to be processed.

In an embodiment of the present invention, the mute duration determining module 203 includes: and the second duration determining submodule is used for determining a second mute waiting duration corresponding to the confidence coefficient according to the confidence coefficient and a second relation between the predetermined confidence coefficient and the mute waiting duration, and the second mute waiting duration is used as the mute waiting duration for the second object to respond to the voice to be processed.

Further, based on the above speech processing method, an embodiment of the present invention further provides an apparatus, as shown in fig. 3, where the apparatus 30 includes: at least one processor 301, and at least one memory 302, bus 301, connected to processor 301; wherein, the processor 301 and the memory 302 complete the communication with each other through the bus 303; the processor 301 is used to call program instructions in the memory 302 to perform the above-described speech processing method.

Here, it should be noted that: the above description of the embodiments of the speech processing apparatus and the device is similar to the description of the embodiment of the method shown in fig. 1, and has similar beneficial effects to the embodiment of the method shown in fig. 1, and therefore, the description is omitted here for brevity. For technical details that are not disclosed in the embodiment of the speech processing apparatus of the present invention, please refer to the description of the embodiment of the method shown in fig. 1 of the present invention for brevity, and therefore, will not be described again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of a unit is only one logical function division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of speech processing, the method comprising:

receiving voice data of voice to be processed, wherein the voice to be processed is voice sent by a first object in a plurality of rounds of voice interaction processes of the first object and a second object;

determining the confidence coefficient that the voice to be processed is complete voice by utilizing a semantic integrity model according to the voice data;

and determining the mute waiting time of the second object responding to the voice to be processed according to the confidence.

2. The method of claim 1, wherein determining a mute wait time for the second object to respond to the pending speech based on the confidence level comprises:

determining a confidence interval to which the confidence belongs;

and determining a first mute waiting time corresponding to the confidence coefficient according to the confidence coefficient interval and a first relation between a predetermined confidence coefficient interval and the mute waiting time, and using the first mute waiting time as the mute waiting time for the second object to respond to the voice to be processed.

3. The method of claim 1, wherein determining a mute wait time for the second object to respond to the pending speech based on the confidence level comprises: and determining a second mute waiting time corresponding to the confidence coefficient according to the confidence coefficient and a second relation between the predetermined confidence coefficient and the mute waiting time, wherein the second mute waiting time is used as the mute waiting time for the second object to respond to the voice to be processed.

4. The method according to any of claims 1-3, wherein the semantic integrity model is a BERT model optimized using at least one of the following operations:

defining the corpus length in the model input as the actual length of a corpus to be trained, wherein the corpus to be trained is the corpus subjected to semantic annotation;

the number of model layers in the model is reduced.

5. The method according to claim 4, wherein the corpus to be trained comprises semantically complete positive corpus and semantically missing negative corpus, and the negative corpus comprises corpus obtained by at least one of the following operations:

obtaining a corpus by adopting an LOSS function;

and obtaining the corpus by adopting a difficult sample mining technology.

6. A speech processing apparatus, characterized in that the apparatus comprises:

the receiving module is used for receiving voice data of voice to be processed, wherein the voice to be processed is sent out by a first object in a plurality of rounds of voice interaction processes of the first object and a second object;

the integrity determination module is used for determining the confidence coefficient that the voice to be processed is complete voice by utilizing a semantic integrity model according to the voice data;

and the mute duration determining module is used for determining the mute waiting duration of the second object responding to the voice to be processed according to the confidence coefficient.

7. The apparatus of claim 6, wherein the mute duration determination module comprises:

the confidence interval judgment submodule is used for determining the confidence interval to which the confidence interval belongs according to the confidence;

and the first time length determining submodule is used for determining a first mute waiting time length corresponding to the confidence coefficient according to the confidence coefficient interval and a first relation between a predetermined confidence coefficient interval and the mute waiting time length, and the first mute waiting time length is used as the mute waiting time length for the second object to respond to the voice to be processed.

8. The apparatus of claim 6, wherein the mute duration determination module comprises:

and the second duration determining submodule is used for determining a second mute waiting duration corresponding to the confidence coefficient according to the confidence coefficient and a second relation between the predetermined confidence coefficient and the mute waiting duration, and the second mute waiting duration is used as the mute waiting duration for the second object to respond to the voice to be processed.

9. A computer-readable storage medium comprising a set of computer-executable instructions that, when executed, perform the speech processing method of any of claims 1-5.

10. A device comprising at least one processor, and at least one memory, bus connected with the processor; the processor and the memory complete mutual communication through the bus; the processor is configured to invoke program instructions in the memory to perform the speech processing method of any of claims 1-5.