CN112053687A - Voice processing method and device, computer readable storage medium and equipment - Google Patents

Voice processing method and device, computer readable storage medium and equipment Download PDF

Info

Publication number
CN112053687A
CN112053687A CN202010758331.7A CN202010758331A CN112053687A CN 112053687 A CN112053687 A CN 112053687A CN 202010758331 A CN202010758331 A CN 202010758331A CN 112053687 A CN112053687 A CN 112053687A
Authority
CN
China
Prior art keywords
voice
mute
processed
determining
confidence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010758331.7A
Other languages
Chinese (zh)
Inventor
李倩
雷欣
李志飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mobvoi Information Technology Co Ltd
Chumen Wenwen Information Technology Co Ltd
Original Assignee
Mobvoi Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mobvoi Information Technology Co Ltd filed Critical Mobvoi Information Technology Co Ltd
Priority to CN202010758331.7A priority Critical patent/CN112053687A/en
Publication of CN112053687A publication Critical patent/CN112053687A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a voice processing method, a device and equipment, wherein the method comprises the following steps: receiving voice data of voice to be processed, wherein the voice to be processed is voice sent by a first object in a plurality of rounds of voice interaction processes of the first object and a second object; determining the confidence coefficient of the to-be-processed voice as complete voice by utilizing a semantic integrity model according to the voice data; and determining the mute waiting time of the second object responding to the voice to be processed according to the confidence. According to the method, the confidence level that the voice to be processed is complete voice is determined by utilizing the semantic integrity model according to the received voice data of the voice to be processed, and the voice information with incomplete semantics is effectively recognized, so that the mute waiting time of the second object responding to the voice to be processed is dynamically adjusted according to the confidence level, the condition that the user is interrupted when the voice expression of the user is not finished is avoided, the mute time is shortened under the condition that the voice semantic integrity is judged, the interaction efficiency is improved, and the user experience is greatly improved.

Description

Voice processing method and device, computer readable storage medium and equipment
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a speech processing method, apparatus, computer-readable storage medium, and device.
Background
With the development of speech recognition technology, intelligent speech service systems that enable speech interaction between humans and machines are applied in more and more scenarios, for example; intelligent customer service, intelligent robot, etc. In the voice interaction scene, the system is required to automatically judge whether the user stops speaking, and if the system finds that the user has expressed his own idea, the next round of information interaction is automatically executed, for example: information question answering and the like. For the judgment of whether the user stops speaking, whether the user stops speaking is mainly judged by setting a mute time length with fixed time length. For example, in the voice interaction system of the smart client, the set mute time is 200-. For example: if the set mute time is 200 milliseconds, after the user finishes speaking a certain speech, no other effective speech is received within 200 milliseconds, the user is considered to stop speaking, the speech recognition task is finished, an NLP (Natural Language Processing) task is called, and the question of the user is answered.
At present, the value of the mute duration is adjusted according to manual experience, and if the set value of the mute duration is large, it is determined that the time for the user to finish speaking is long, the actual waiting time of the user is prolonged, and the user experience is poor. On the other hand, the actual waiting time of the user becomes short, but there is a problem that in this case, it is very likely that the user is determined to stop speaking when the user does not speak the speech to be expressed and adjusts the breathing for a short time. For example: when the user stops breathing, the system judges that the user stops speaking, and further forces the user to enter and interrupt the information interaction process of the next round. Therefore, the problem of semantic missing is easily caused, and further the next round of information interaction process cannot be normally carried out.
Disclosure of Invention
Embodiments of the present invention provide a method and an apparatus for processing speech, and a computer-readable storage medium to solve the above problems in the speech processing process.
According to a first aspect of the present invention, there is provided a speech processing method, the method comprising: receiving voice data of voice to be processed, wherein the voice to be processed is voice sent by a first object in a plurality of rounds of voice interaction processes of the first object and a second object; determining the confidence coefficient that the voice to be processed is complete voice by utilizing a semantic integrity model according to the voice data; and determining the mute waiting time of the second object responding to the voice to be processed according to the confidence.
According to an embodiment of the present invention, the determining, according to the confidence, a mute waiting time for the second object to respond to the to-be-processed speech includes: determining a confidence interval to which the confidence belongs; and determining a first mute waiting time corresponding to the confidence coefficient according to the confidence coefficient interval and a first relation between a predetermined confidence coefficient interval and the mute waiting time, and using the first mute waiting time as the mute waiting time for the second object to respond to the voice to be processed.
According to an embodiment of the present invention, the determining, according to the confidence, a mute waiting time for the second object to respond to the to-be-processed speech includes: and determining a second mute waiting time corresponding to the confidence coefficient according to the confidence coefficient and a second relation between the predetermined confidence coefficient and the mute waiting time, wherein the second mute waiting time is used as the mute waiting time for the second object to respond to the voice to be processed.
According to an embodiment of the present invention, the semantic integrity model is a BERT model optimized by using at least one of the following operations: defining the corpus length in the model input as the actual length of a corpus to be trained, wherein the corpus to be trained is the corpus subjected to semantic annotation; the number of model layers in the model is reduced.
According to an embodiment of the present invention, the corpus to be trained includes a positive corpus with complete semantics and a negative corpus with missing semantics, and the negative corpus includes a corpus obtained by at least one of the following operations: obtaining a corpus by adopting an LOSS function; and obtaining the corpus by adopting a difficult sample mining technology.
According to a second aspect of the present invention, there is also provided a speech processing apparatus, the apparatus comprising: the receiving module is used for receiving voice data of voice to be processed, wherein the voice to be processed is sent out by a first object in a plurality of rounds of voice interaction processes of the first object and a second object; the integrity determination module is used for determining the confidence coefficient that the voice to be processed is complete voice by utilizing a semantic integrity model according to the voice data; and the mute duration determining module is used for determining the mute waiting duration of the second object responding to the voice to be processed according to the confidence coefficient.
According to an embodiment of the present invention, the mute duration determination module includes: the confidence interval judgment submodule is used for determining the confidence interval to which the confidence interval belongs according to the confidence; and the first time length determining submodule is used for determining a first mute waiting time length corresponding to the confidence coefficient according to the confidence coefficient interval and a first relation between a predetermined confidence coefficient interval and the mute waiting time length, and the first mute waiting time length is used as the mute waiting time length for the second object to respond to the voice to be processed.
According to an embodiment of the present invention, the mute duration determination module includes: and the second duration determining submodule is used for determining a second mute waiting duration corresponding to the confidence coefficient according to the confidence coefficient and a second relation between the predetermined confidence coefficient and the mute waiting duration, and the second mute waiting duration is used as the mute waiting duration for the second object to respond to the voice to be processed.
According to a third aspect of the present invention, there is also provided a computer-readable storage medium comprising a set of computer-executable instructions which, when executed, are operable to perform any of the speech processing methods described above.
According to a fourth aspect of the present invention, there is also provided an apparatus comprising at least one processor, and at least one memory connected to the processor, a bus; the processor and the memory complete mutual communication through the bus; the processor is used for calling the program instructions in the memory so as to execute the voice processing method.
The voice processing method, the voice processing device and the voice processing equipment are mainly applied to voice emitted by a first object in a plurality of rounds of voice interaction processes of the first object and a second object, the confidence coefficient that the voice to be processed is complete voice is determined by utilizing a semantic integrity model according to received voice data of the voice to be processed, and therefore the mute waiting time of the second object responding to the voice to be processed is dynamically adjusted according to the confidence coefficient. The voice information with incomplete semantics is effectively recognized, the mute waiting time is adjusted, the situation that the user is interrupted when the voice expression of the user is not finished is avoided, the mute time is shortened under the situation that the voice semantics are judged to be complete, the interaction efficiency is improved, and the user experience is greatly improved.
It is to be understood that the teachings of the present invention need not achieve all of the above-described benefits, but rather that specific embodiments may achieve specific technical results, and that other embodiments of the present invention may achieve benefits not mentioned above.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
FIG. 1 is a flow chart illustrating a voice processing method according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating an implementation of an example of an application of the speech processing method according to the embodiment of the present invention;
fig. 3 is a schematic diagram showing a composition structure of the apparatus according to the embodiment of the present invention.
Detailed Description
The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given only to enable those skilled in the art to better understand and to implement the present invention, and do not limit the scope of the present invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
To better describe the specific scheme of the embodiment of the present invention, a specific application scenario of the embodiment of the present invention is first described by way of example, and it should be noted that the present invention is not limited to the following application scenario, and may also be applied to other applicable scenarios.
In the application process of intelligent systems with natural language recognition functions, such as intelligent customer service, intelligent robots, communication secretaries and the like, the real voice interaction between a user and the intelligent system is involved, and the intelligent system needs to have a conversation with the user according to the received voice. Therefore, the intelligent system is required to judge whether the current voice of the user is finished or not and determine how long to wait for the user to respond to the user, and the waiting process is usually silent waiting because the voice content of the user is waited, and the waiting time is called as silent waiting time. If the time setting is long, the interaction efficiency may be low, and if the waiting time is too short, a process in which the user's voice is interrupted may occur. The embodiment of the invention aims to judge the semantic integrity of the voice according to the received voice content of the user, thereby dynamically adjusting the mute waiting time according to the semantic integrity. Interaction efficiency is effectively improved, and user experience is effectively improved.
The technical solution of the present invention is further elaborated below with reference to the drawings and the specific embodiments.
Fig. 1 is a schematic flow chart illustrating an implementation of a speech processing method according to an embodiment of the present invention.
Referring to fig. 1, a speech processing method according to an embodiment of the present invention at least includes the following operation flows: operation 101, receiving voice data of a voice to be processed, where the voice to be processed is a voice emitted by a first object in a plurality of rounds of voice interaction processes of the first object and a second object; operation 102, determining a confidence coefficient that the voice to be processed is complete voice by using a semantic integrity model according to the voice data; and operation 103, determining a mute waiting time length of the second object responding to the voice to be processed according to the confidence.
In operation 101, voice data of a to-be-processed voice, which is a voice uttered by a first object during a number of rounds of voice interaction of the first object and a second object, is received.
In an embodiment of the present invention, the several rounds of voice interaction processes of the first object and the second object may be: and (3) a conversation process between the user and the intelligent customer service when the user uses equipment such as a fixed telephone or a mobile phone to make a call for the customer service. It is also possible to have a session between two users of a mobile data service provider, for example: mobile phone, internet phone, fixed phone, etc., where one user has opened an intelligent service function, such as: the user A makes a call to the user B, the user B opens the intelligent service function and can use the intelligent service function provided by a mobile facilitator or an intelligent terminal, when the user B is inconvenient to answer the call, the user B opens the intelligent terminal and automatically answers the call of the user A by using the intelligent service function, and interacts with the user A according to the voice content of the user A, and the intelligent terminal can be a mobile phone, a tablet computer, a phone of a fixed phone and the like. In addition, besides the voice call process of dialing by using a telephone or a mobile phone, the voice call process can also be a voice interaction process by using an application program. For example: and carrying out a voice interaction process by utilizing the instant messaging application program.
In operation 102, a confidence level that the speech to be processed is complete speech is determined using the semantic integrity model based on the speech data.
In an embodiment of the present invention, a BERT (Bidirectional Encoder Representation from transforms) model is optimized by using at least one of the following operations to obtain a semantic integrity model: defining the corpus length in the model input as the actual length of the corpus to be trained, wherein the corpus to be trained is the corpus subjected to semantic annotation; the number of model layers in the model is reduced.
BERT is a pre-trained language characterization model of *** open source, which differs from other language models in two most significant ways: 1. it replaces a small number of words with a Mask or another random word with a reduced probability when training a bi-directional language model. Thereby forcing the BERT model to increase memory of context; 2. a loss function is added to predict the next sentence. Moreover, the BERT model is very deep, with 12 layers, and not wide (wide). The BERT Model belongs to MLM (Masked Language Model), and can simultaneously utilize words on the left and right sides of a Language to be processed. However, the corpora in the BERT model are fixed-length corpora at present, and model training is not performed according to the actual length of the corpora.
In an embodiment of the present invention, the corpus length in the model input is defined as the actual length of the corpus to be trained, and the corpus to be trained is the corpus to which the semantic annotation is completed.
For example, a batch of training voices are obtained as corpora for labeling, for example: and (4) conversation voice of the user and the intelligent customer service. If the semantics of the user's speech in speech are complete, then the notation is complete. If the user is not speaking completely in speech, it is obvious that the user is still speaking, but the audio is over, and the utterance is marked as incomplete. Therefore, the corpus length in the model input is defined as the actual length of the corpus to be trained, based on a great amount of prior knowledge about sentences of the BERT model with good encode, the corpus under a specific scene is added on the basis of the BERT model, a new classification task classified according to the semantic integrity is defined, and a semantic integrity model with higher recognition accuracy can be obtained.
In an embodiment of the invention, the number of model layers in the BERT model is reduced, and the BERT model is optimized to obtain a semantic integrity model.
For example, for a BERT model with a deep depth, in the process of performing natural language processing by using the model, the depth of the model brings a large amount of computation to the language processing process. If the existing BERT model is adopted to judge the integrity of the speech to be processed in the speech interaction process, the problem that the judgment time is long and the user experience is influenced may occur. For example, receiving a voice, determining whether the semantic meaning of the voice is complete takes 300ms, and setting the mute waiting time to be 200ms if the semantic meaning of the received voice is complete. At this time, if the semantic meaning of the voice is determined to be complete, when the mute waiting time for answering the voice is determined to be 200ms, the time has elapsed by 300ms, and the meaning of determining whether the semantic meaning is complete is lost. Therefore, in an embodiment of the present invention, the existing BERT model is optimized by using a method for reducing the number of model layers, so as to obtain a semantic integrity model. For example: the original 12-layer BERT model is reduced to a semantic integrity model with only 3 layers.
In one embodiment of the present invention, the model processing speed is increased by performing the spotting process on the model. For example, parameters in the model may be changed from floating point numbers (real numbers) to integers, thereby reducing the amount of model computation.
In an embodiment of the present invention, the corpus to be trained includes a positive corpus with complete semantics and a negative corpus with missing semantics, and the negative corpus includes a corpus obtained by at least one of the following operations: obtaining a corpus by adopting an LOSS function; and obtaining the corpus by adopting a difficult sample mining technology.
For example, a batch of training voices are obtained as corpora for labeling, for example: and (4) conversation voice of the user and the intelligent customer service. And if the semantic meaning of the user speaking in the voice is complete, marking as the regular corpus. If the user in the speech is incomplete in speaking and lacks semantics, the user can obviously hear that the user still speaks, but the audio is finished, the user is determined as lacking semantics and marked as negative examples corpus. In practical situations, negative examples of semantic missing corpora are relatively few, for example: the negative examples corpora of semantic deletions account for 10% of all corpora acquired. Therefore, some operations are required to increase the negative examples corpus with semantic missing so as to increase the sample size of the negative examples corpus and alleviate the problem of too large sample size difference between the positive examples corpus and the negative examples corpus. For example: all training voices can be randomly truncated, and the first half part of the truncated voices is taken as negative example corpus. The cutoff of the corpus can be obtained by using the LOSS function. More negative examples corpora can also be obtained by using difficult sample mining techniques.
After the corpus is prepared, the corpus is trained, in the embodiment of the invention, a two-classification model based on a BERT model structure is adopted for classification training, a classifier is trained by adopting a complete semantic special corpus, and the probability output of the classifier is taken as the confidence coefficient of complete semantic. For example: the model output is "semantic integrity probability: 80 percent; semantic missing probability: 20% ".
In addition, since the original BERT model has a large delay in actual use, a fixed-point processing, such as: real numbers in the model parameters are processed by methods such as rounding and the like, and are expressed by integers, so that the model operation amount is greatly reduced. Meanwhile, the operation of variable-length input can also be adopted, for example: and defining the corpus length in the model input as the actual length of the corpus to be trained, wherein the corpus to be trained is the corpus subjected to semantic annotation. Further, model number optimization may also be optimized, for example: the original 12-layer BERT model is reduced to a 3-layer semantic integrity model, so that the model processing speed is optimized, and the trained semantic integrity model is better applied.
In operation 103, a mute wait time for the second object to respond to the speech to be processed is determined according to the confidence level.
In an embodiment of the present invention, determining the mute waiting time of the second object responding to the to-be-processed speech according to the confidence level is implemented by the following operations: determining a confidence interval to which the confidence belongs; and determining a first mute waiting time corresponding to the confidence coefficient according to the confidence coefficient interval and a first relation between the predetermined confidence coefficient interval and the mute waiting time, and using the first mute waiting time as the mute waiting time for the second object to respond to the voice to be processed.
For example, the semantic level analysis is performed on the current speech to be processed obtained by decoding, and the analysis outputs a binary judgment and a confidence level, where the binary judgment is whether the received speech to be processed is complete speech. The output information is [ semantic integrity probability: x%; semantic missing probability: (1-X%) ]. For example: the 'I want to about the queen' belongs to the voice with complete semantics, and the 'I want to about' belongs to the voice with incomplete semantics, so that the probability that the user needs to continue speaking is high. Confidence is the degree of confidence that the semantic integrity is given.
For example: setting presetting:
T=400ms,S∈[0,60%]
T=300ms,S∈[60%,80%]
T=200ms,S∈[80%,100%]
wherein T represents the mute waiting time length, and S represents the confidence coefficient of complete semantics.
In an embodiment of the present invention, determining the mute waiting time of the second object responding to the to-be-processed speech according to the confidence level is implemented by the following operations: and determining a second mute waiting time corresponding to the confidence coefficient according to the confidence coefficient and a second relation between the predetermined confidence coefficient and the mute waiting time, and using the second mute waiting time as the mute waiting time for the second object to respond to the voice to be processed.
For example, the second relationship between the confidence level and the mute waiting time may be preset as follows: and f (S), wherein T represents the mute waiting time length, and S represents the confidence coefficient of complete semantics. T ═ f(s) may be a simple linear function, or may be a nonlinear function that is suitably obtained through a number of experiments, and the present invention is not limited thereto.
The voice processing method, the voice processing device and the voice processing equipment are mainly applied to voice emitted by a first object in a plurality of rounds of voice interaction processes of the first object and a second object, the confidence coefficient that the voice to be processed is complete voice is determined by utilizing a semantic integrity model according to received voice data of the voice to be processed, and therefore the mute waiting time of the second object responding to the voice to be processed is dynamically adjusted according to the confidence coefficient. The voice information with incomplete semantics is effectively recognized, the mute waiting time is adjusted, the situation that the user is interrupted when the voice expression of the user is not finished is avoided, the mute time is shortened under the situation that the voice semantics are judged to be complete, the interaction efficiency is improved, and the user experience is greatly improved.
Similarly, based on the foregoing speech processing method, an embodiment of the present invention further provides a computer-readable storage medium, where a program is stored, and when the program is executed by a processor, the processor is caused to perform at least the following operation steps: operation 101, receiving voice data of a voice to be processed, where the voice to be processed is a voice emitted by a first object in a plurality of rounds of voice interaction processes of the first object and a second object; operation 102, determining a confidence coefficient that the voice to be processed is complete voice by using a semantic integrity model according to the voice data; and operation 103, determining a mute waiting time length of the second object responding to the voice to be processed according to the confidence.
Further, based on the above speech processing method, an embodiment of the present invention further provides a speech processing apparatus according to the second aspect of the present invention, and referring to fig. 2, the apparatus 20 includes: the receiving module 201 is configured to receive voice data of a voice to be processed, where the voice to be processed is sent by a first object in a plurality of rounds of voice interaction processes of the first object and a second object; the integrity determination module 202 is configured to determine, according to the voice data, a confidence that the voice to be processed is complete voice by using the semantic integrity model; and the mute duration determining module 203 is configured to determine, according to the confidence, a mute waiting duration for the second object to respond to the voice to be processed.
In an embodiment of the present invention, the mute duration determining module 203 includes: the confidence interval judgment submodule is used for determining the confidence interval to which the confidence interval belongs according to the confidence; and the first time length determining submodule is used for determining a first mute waiting time length corresponding to the confidence coefficient according to the confidence coefficient interval and a first relation between the predetermined confidence coefficient interval and the mute waiting time length, and the first mute waiting time length is used as the mute waiting time length for the second object to respond to the voice to be processed.
In an embodiment of the present invention, the mute duration determining module 203 includes: and the second duration determining submodule is used for determining a second mute waiting duration corresponding to the confidence coefficient according to the confidence coefficient and a second relation between the predetermined confidence coefficient and the mute waiting duration, and the second mute waiting duration is used as the mute waiting duration for the second object to respond to the voice to be processed.
Further, based on the above speech processing method, an embodiment of the present invention further provides an apparatus, as shown in fig. 3, where the apparatus 30 includes: at least one processor 301, and at least one memory 302, bus 301, connected to processor 301; wherein, the processor 301 and the memory 302 complete the communication with each other through the bus 303; the processor 301 is used to call program instructions in the memory 302 to perform the above-described speech processing method.
Here, it should be noted that: the above description of the embodiments of the speech processing apparatus and the device is similar to the description of the embodiment of the method shown in fig. 1, and has similar beneficial effects to the embodiment of the method shown in fig. 1, and therefore, the description is omitted here for brevity. For technical details that are not disclosed in the embodiment of the speech processing apparatus of the present invention, please refer to the description of the embodiment of the method shown in fig. 1 of the present invention for brevity, and therefore, will not be described again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of a unit is only one logical function division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.
Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method of speech processing, the method comprising:
receiving voice data of voice to be processed, wherein the voice to be processed is voice sent by a first object in a plurality of rounds of voice interaction processes of the first object and a second object;
determining the confidence coefficient that the voice to be processed is complete voice by utilizing a semantic integrity model according to the voice data;
and determining the mute waiting time of the second object responding to the voice to be processed according to the confidence.
2. The method of claim 1, wherein determining a mute wait time for the second object to respond to the pending speech based on the confidence level comprises:
determining a confidence interval to which the confidence belongs;
and determining a first mute waiting time corresponding to the confidence coefficient according to the confidence coefficient interval and a first relation between a predetermined confidence coefficient interval and the mute waiting time, and using the first mute waiting time as the mute waiting time for the second object to respond to the voice to be processed.
3. The method of claim 1, wherein determining a mute wait time for the second object to respond to the pending speech based on the confidence level comprises: and determining a second mute waiting time corresponding to the confidence coefficient according to the confidence coefficient and a second relation between the predetermined confidence coefficient and the mute waiting time, wherein the second mute waiting time is used as the mute waiting time for the second object to respond to the voice to be processed.
4. The method according to any of claims 1-3, wherein the semantic integrity model is a BERT model optimized using at least one of the following operations:
defining the corpus length in the model input as the actual length of a corpus to be trained, wherein the corpus to be trained is the corpus subjected to semantic annotation;
the number of model layers in the model is reduced.
5. The method according to claim 4, wherein the corpus to be trained comprises semantically complete positive corpus and semantically missing negative corpus, and the negative corpus comprises corpus obtained by at least one of the following operations:
obtaining a corpus by adopting an LOSS function;
and obtaining the corpus by adopting a difficult sample mining technology.
6. A speech processing apparatus, characterized in that the apparatus comprises:
the receiving module is used for receiving voice data of voice to be processed, wherein the voice to be processed is sent out by a first object in a plurality of rounds of voice interaction processes of the first object and a second object;
the integrity determination module is used for determining the confidence coefficient that the voice to be processed is complete voice by utilizing a semantic integrity model according to the voice data;
and the mute duration determining module is used for determining the mute waiting duration of the second object responding to the voice to be processed according to the confidence coefficient.
7. The apparatus of claim 6, wherein the mute duration determination module comprises:
the confidence interval judgment submodule is used for determining the confidence interval to which the confidence interval belongs according to the confidence;
and the first time length determining submodule is used for determining a first mute waiting time length corresponding to the confidence coefficient according to the confidence coefficient interval and a first relation between a predetermined confidence coefficient interval and the mute waiting time length, and the first mute waiting time length is used as the mute waiting time length for the second object to respond to the voice to be processed.
8. The apparatus of claim 6, wherein the mute duration determination module comprises:
and the second duration determining submodule is used for determining a second mute waiting duration corresponding to the confidence coefficient according to the confidence coefficient and a second relation between the predetermined confidence coefficient and the mute waiting duration, and the second mute waiting duration is used as the mute waiting duration for the second object to respond to the voice to be processed.
9. A computer-readable storage medium comprising a set of computer-executable instructions that, when executed, perform the speech processing method of any of claims 1-5.
10. A device comprising at least one processor, and at least one memory, bus connected with the processor; the processor and the memory complete mutual communication through the bus; the processor is configured to invoke program instructions in the memory to perform the speech processing method of any of claims 1-5.
CN202010758331.7A 2020-07-31 2020-07-31 Voice processing method and device, computer readable storage medium and equipment Pending CN112053687A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010758331.7A CN112053687A (en) 2020-07-31 2020-07-31 Voice processing method and device, computer readable storage medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010758331.7A CN112053687A (en) 2020-07-31 2020-07-31 Voice processing method and device, computer readable storage medium and equipment

Publications (1)

Publication Number Publication Date
CN112053687A true CN112053687A (en) 2020-12-08

Family

ID=73602228

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010758331.7A Pending CN112053687A (en) 2020-07-31 2020-07-31 Voice processing method and device, computer readable storage medium and equipment

Country Status (1)

Country Link
CN (1) CN112053687A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112700769A (en) * 2020-12-26 2021-04-23 科大讯飞股份有限公司 Semantic understanding method, device, equipment and computer readable storage medium
CN112995419A (en) * 2021-02-05 2021-06-18 支付宝(杭州)信息技术有限公司 Voice conversation processing method and system
CN113113013A (en) * 2021-04-15 2021-07-13 北京帝派智能科技有限公司 Intelligent voice interaction interruption processing method, device and system
CN114078478A (en) * 2021-11-12 2022-02-22 北京百度网讯科技有限公司 Voice interaction method and device, electronic equipment and storage medium
CN114582333A (en) * 2022-02-21 2022-06-03 中国第一汽车股份有限公司 Voice recognition method and device, electronic equipment and storage medium
CN115512687A (en) * 2022-11-08 2022-12-23 之江实验室 Voice sentence-breaking method and device, storage medium and electronic equipment
CN115620720A (en) * 2022-11-30 2023-01-17 零犀(北京)科技有限公司 Method and device for muting session, electronic equipment and computer-readable storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107665706A (en) * 2016-07-29 2018-02-06 科大讯飞股份有限公司 Rapid Speech exchange method and system
CN107919130A (en) * 2017-11-06 2018-04-17 百度在线网络技术(北京)有限公司 Method of speech processing and device based on high in the clouds
CN109473104A (en) * 2018-11-07 2019-03-15 苏州思必驰信息科技有限公司 Speech recognition network delay optimization method and device
TW201937480A (en) * 2018-03-01 2019-09-16 聯捷創新股份有限公司 Adaptive waiting time system for voice input system and method thereof
CN110489521A (en) * 2019-07-15 2019-11-22 北京三快在线科技有限公司 Text categories detection method, device, electronic equipment and computer-readable medium
CN110795566A (en) * 2019-09-18 2020-02-14 平安科技(深圳)有限公司 Case recommendation method, device and equipment and computer-readable storage medium
CN110825879A (en) * 2019-09-18 2020-02-21 平安科技(深圳)有限公司 Case decision result determination method, device and equipment and computer readable storage medium
CN110968671A (en) * 2019-12-03 2020-04-07 北京声智科技有限公司 Intent determination method and device based on Bert
CN111292729A (en) * 2020-02-06 2020-06-16 北京声智科技有限公司 Method and device for processing audio data stream
CN111309869A (en) * 2020-02-28 2020-06-19 中国工商银行股份有限公司 Real-time text stream information retrieval method and system
CN111402866A (en) * 2020-03-23 2020-07-10 北京声智科技有限公司 Semantic recognition method and device and electronic equipment

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107665706A (en) * 2016-07-29 2018-02-06 科大讯飞股份有限公司 Rapid Speech exchange method and system
CN107919130A (en) * 2017-11-06 2018-04-17 百度在线网络技术(北京)有限公司 Method of speech processing and device based on high in the clouds
US20190139566A1 (en) * 2017-11-06 2019-05-09 Baidu Online Network Technology (Beijing) Co., Ltd. Cloud-based speech processing method and apparatus
TW201937480A (en) * 2018-03-01 2019-09-16 聯捷創新股份有限公司 Adaptive waiting time system for voice input system and method thereof
CN109473104A (en) * 2018-11-07 2019-03-15 苏州思必驰信息科技有限公司 Speech recognition network delay optimization method and device
CN110489521A (en) * 2019-07-15 2019-11-22 北京三快在线科技有限公司 Text categories detection method, device, electronic equipment and computer-readable medium
CN110795566A (en) * 2019-09-18 2020-02-14 平安科技(深圳)有限公司 Case recommendation method, device and equipment and computer-readable storage medium
CN110825879A (en) * 2019-09-18 2020-02-21 平安科技(深圳)有限公司 Case decision result determination method, device and equipment and computer readable storage medium
CN110968671A (en) * 2019-12-03 2020-04-07 北京声智科技有限公司 Intent determination method and device based on Bert
CN111292729A (en) * 2020-02-06 2020-06-16 北京声智科技有限公司 Method and device for processing audio data stream
CN111309869A (en) * 2020-02-28 2020-06-19 中国工商银行股份有限公司 Real-time text stream information retrieval method and system
CN111402866A (en) * 2020-03-23 2020-07-10 北京声智科技有限公司 Semantic recognition method and device and electronic equipment

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112700769A (en) * 2020-12-26 2021-04-23 科大讯飞股份有限公司 Semantic understanding method, device, equipment and computer readable storage medium
CN112995419A (en) * 2021-02-05 2021-06-18 支付宝(杭州)信息技术有限公司 Voice conversation processing method and system
CN112995419B (en) * 2021-02-05 2022-05-24 支付宝(杭州)信息技术有限公司 Voice conversation processing method and system
CN113113013A (en) * 2021-04-15 2021-07-13 北京帝派智能科技有限公司 Intelligent voice interaction interruption processing method, device and system
CN114078478A (en) * 2021-11-12 2022-02-22 北京百度网讯科技有限公司 Voice interaction method and device, electronic equipment and storage medium
CN114078478B (en) * 2021-11-12 2022-09-23 北京百度网讯科技有限公司 Voice interaction method and device, electronic equipment and storage medium
CN114582333A (en) * 2022-02-21 2022-06-03 中国第一汽车股份有限公司 Voice recognition method and device, electronic equipment and storage medium
CN115512687A (en) * 2022-11-08 2022-12-23 之江实验室 Voice sentence-breaking method and device, storage medium and electronic equipment
CN115620720A (en) * 2022-11-30 2023-01-17 零犀(北京)科技有限公司 Method and device for muting session, electronic equipment and computer-readable storage medium

Similar Documents

Publication Publication Date Title
CN112053687A (en) Voice processing method and device, computer readable storage medium and equipment
CN111477216B (en) Training method and system for voice and meaning understanding model of conversation robot
CN110689877A (en) Voice end point detection method and device
CN110147435B (en) Dialogue generation method, device, equipment and storage medium
CN110853638A (en) Method and equipment for interrupting voice robot in real time in voice interaction process
CN112313930B (en) Method and apparatus for managing maintenance
CN110995943B (en) Multi-user streaming voice recognition method, system, device and medium
WO2023207212A1 (en) Voice dialogue detection method and apparatus
WO2023082752A1 (en) Voice dialog processing method and apparatus based on multi-modal feature, and electronic device
CN113488026B (en) Speech understanding model generation method based on pragmatic information and intelligent speech interaction method
CN117494715A (en) Dialogue processing method and device, electronic equipment and storage medium
CN115512691A (en) Method for judging echo based on semantic level in man-machine continuous conversation
CN114328867A (en) Intelligent interruption method and device in man-machine conversation
CN114860910A (en) Intelligent dialogue method and system
CN110125946B (en) Automatic call method, automatic call device, electronic equipment and computer readable medium
CN113851105A (en) Information reminding method, device, equipment and storage medium
CN112738344A (en) Method and device for identifying user identity, storage medium and electronic equipment
CN111667829A (en) Information processing method and device, and storage medium
CN111274828A (en) Language translation method, system, computer program and handheld terminal based on message leaving
CN111935348A (en) Method and device for providing call processing service
CN113782022B (en) Communication method, device, equipment and storage medium based on intention recognition model
CN116401342A (en) Training method of intention recognition model, intention recognition method, device and medium
CN117351985A (en) Audio processing method, device, electronic equipment and readable storage medium
CN114268694A (en) Service request response method, device, equipment, system and medium
CN117711389A (en) Voice interaction method, device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination