CN112509570A

CN112509570A - Voice signal processing method and device, electronic equipment and storage medium

Info

Publication number: CN112509570A
Application number: CN201910810339.0A
Authority: CN
Inventors: 王阳阳; 李曙光; 韩伟
Original assignee: Beijing Orion Star Technology Co Ltd
Current assignee: Beijing Orion Star Technology Co Ltd
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2021-03-16
Anticipated expiration: 2039-08-29
Also published as: CN112509570B

Abstract

The invention discloses a voice signal processing method, a voice signal processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: performing word segmentation on a temporary recognition result obtained based on audio stream data acquired by intelligent equipment in real time to obtain a plurality of word segmentation segments; inputting a first word segmentation sequence consisting of a plurality of word segmentation segments into a trained sentence segmentation model, and determining a first prediction probability of a sentence segmentation after the first word segmentation sequence according to the output of the sentence segmentation model; acquiring a second prediction probability that a next word segmentation segment after the second word segmentation sequence is an end character, wherein the second prediction probability is determined according to the word frequency data; and if the third prediction probability determined according to the first prediction probability and the second prediction probability is larger than the probability threshold, performing semantic analysis on the temporary recognition result. The technical scheme of the embodiment of the invention can accurately cut off the audio stream data in time and shorten the response time of the intelligent equipment.

Description

Voice signal processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for processing a voice signal, an electronic device, and a storage medium.

Background

When Speech Recognition and Speech processing are performed by current intelligent devices, a section of Speech data with complete semantics is generally acquired, then a Speech Recognition result is obtained through ASR (Automatic Speech Recognition), and then semantic understanding is performed based on the Speech Recognition result to obtain response data corresponding to the Speech data, so as to feed back to a user.

One existing way to obtain voice data with complete semantics is to: the user presses the designated key after inputting a section of voice data to inform the intelligent device that the voice input is finished, and the intelligent device acquires the section of voice data for processing. In another mode, when the smart device continuously receives the Voice, the Voice data continuously input is intercepted by using Voice Activity Detection (VAD) technology to obtain a section of complete Voice data, however, when the environmental noise is large, the situation that the Voice data cannot be intercepted or the interception is misjudged by adopting the method can occur, so that the Voice recognition cannot be performed in time, the response time of the smart device is further prolonged, the user cannot get a timely reply, and the user experience is reduced.

Disclosure of Invention

The embodiment of the invention provides a voice signal processing method and device, electronic equipment and a storage medium, and aims to solve the problem that in the prior art, voice data cannot be effectively cut off, so that intelligent equipment cannot timely and accurately respond.

In a first aspect, an embodiment of the present invention provides a speech signal processing method, including:

carrying out voice recognition on audio stream data acquired by intelligent equipment in real time to obtain a temporary recognition result;

performing word segmentation processing on the temporary recognition result to obtain a plurality of word segmentation segments;

inputting a first word segmentation sequence consisting of the word segmentation segments into a trained sentence segmentation model, and determining a first prediction probability of a sentence segmentation after the first word segmentation sequence according to the output of the sentence segmentation model; acquiring a second prediction probability that a next word segmentation after a second word segmentation sequence is an end character, wherein the second prediction probability is determined according to word frequency data, the word frequency data comprises the number of times of occurrence of each word segmentation sequence in each corpus determined based on the corpus in a corpus, the second word segmentation sequence is a sequence formed by the last N word segmentation in the temporary recognition result, and N is a positive integer;

determining a third prediction probability according to the first prediction probability and the second prediction probability;

and if the third prediction probability is larger than a probability threshold, performing semantic analysis on the temporary recognition result.

Optionally, the obtaining a second prediction probability that a next word segmentation segment after the second word segmentation sequence is an end character specifically includes:

acquiring the times M corresponding to the second word segmentation sequence from the word frequency data; acquiring the frequency K corresponding to a third word segmentation sequence from the word frequency data, wherein the third word segmentation sequence is a sequence obtained by adding the end character after the second word segmentation sequence; determining the second prediction probability according to the K and the M;

or determining probability data corresponding to a second word segmentation sequence as a second prediction probability from pre-configured probability data of a next word segmentation segment after each N-element word segmentation sequence as an end character, wherein the N-element word segmentation sequence is obtained by performing word segmentation processing on the basis of corpus in the corpus, and the probability data is determined according to word frequency data corresponding to the N-element word segmentation sequence and word frequency data corresponding to an N + 1-element word segmentation sequence obtained by adding the end character after the N-element word segmentation sequence.

Optionally, the word frequency data is obtained by:

performing word segmentation processing on each corpus in the corpus to obtain word segmentation segments corresponding to each corpus;

determining a sequence consisting of N continuous word segmentation segments in each corpus as an N-element word segmentation sequence;

determining a sequence consisting of N +1 word segmentation segments in each corpus as an N + 1-element word segmentation sequence;

and counting the occurrence frequency of each N-element word segmentation sequence and each N + 1-element word segmentation sequence in each corpus of the corpus to obtain the word frequency data.

Optionally, the N is equal to 2.

Optionally, the corpora in the corpus are updated by:

if the third prediction probability is smaller than or equal to the probability threshold value and the voice starting point and the voice ending point in the audio stream data are detected, adding the ending character after the final recognition result corresponding to the audio stream data between the voice starting point and the voice ending point, and adding the final recognition result added with the ending character into the corpus as a newly added corpus;

or acquiring a text with complete semantics after the artificial intervention, adding the ending character after the text, and adding the text added with the ending character into the corpus as a newly added corpus.

Optionally, the method further comprises:

if the corpus has a new corpus, performing word segmentation processing on the new corpus to obtain an N-element word segmentation sequence and an N + 1-element word segmentation sequence corresponding to the new corpus;

and updating the word frequency data corresponding to the N-element word segmentation sequence and the N + 1-element word segmentation sequence corresponding to the newly added corpus.

Optionally, the method further comprises:

and if the word frequency data is updated, updating probability data corresponding to each word segmentation sequence according to the updated word frequency data.

Optionally, the method further comprises:

if the third prediction probability is smaller than or equal to the probability threshold, determining word segmentation segments with the maximum probability after the second word segmentation sequence according to the word frequency data;

controlling the intelligent device to output the determined word segmentation segments.

Optionally, the sentence segmentation model is a two-classification model, the two-classification model is used to predict a probability of whether a next segmentation segment after an input segmentation sequence is an end character, and the determining, according to an output of the sentence segmentation model, a first prediction probability that a sentence can be segmented after the first segmentation sequence includes: obtaining a probability value of a next word segmentation segment which is output by the binary classification model and represents the first word segmentation sequence and is an ending character, and determining the probability value as the first prediction probability;

or, the sentence break model is a punctuation mark model, the punctuation mark model is used for predicting probability values of punctuation marks appearing after an input word segmentation sequence, and the first prediction probability that a sentence can be broken after the first word segmentation sequence is determined according to the output of the sentence break model, which specifically includes: and obtaining a probability value of a designated punctuation mark appearing after the first word segmentation sequence output by the punctuation mark model, and determining the probability value as the first prediction probability.

Optionally, the obtaining a probability value of occurrence of a designated punctuation mark after the first word segmentation sequence output by the punctuation mark model is determined as the first prediction probability, and specifically includes:

if the probability values of a plurality of designated punctuations are obtained, determining the sum of the probability values of the designated punctuations as the first prediction probability, or determining the maximum probability value in the probability values of the designated punctuations as the first prediction probability.

Optionally, the two-classification model is trained by:

obtaining a plurality of corpus samples and classification labels of each corpus sample, wherein the classification labels are used for marking whether the corpus samples are end characters or not;

and training a two-classification model according to the corpus sample and the classification label of the corpus sample.

In a second aspect, an embodiment of the present invention provides a speech signal processing apparatus, including:

the voice recognition module is used for carrying out voice recognition on audio stream data acquired by the intelligent equipment in real time to obtain a temporary recognition result;

the word segmentation processing module is used for carrying out word segmentation processing on the temporary recognition result to obtain a plurality of word segmentation segments;

the first prediction module is used for inputting a first word segmentation sequence consisting of the word segmentation segments into a trained sentence segmentation model, and determining a first prediction probability of a sentence segmentation after the first word segmentation sequence according to the output of the sentence segmentation model;

the second prediction module is configured to obtain a second prediction probability that a next participle segment after a second participle sequence is an end character, where the second prediction probability is determined according to word frequency data, the word frequency data includes the number of times that each participle sequence appears in each corpus determined based on the corpus in the corpus, the second participle sequence is a sequence composed of the last N participle segments in the temporary recognition result, and N is a positive integer;

a determining module, configured to determine a third prediction probability according to the first prediction probability and the second prediction probability;

and the analysis module is used for performing semantic analysis on the temporary recognition result if the third prediction probability is greater than a probability threshold.

Optionally, the second prediction module is specifically configured to:

Optionally, the apparatus further includes a word frequency data obtaining module, configured to:

Optionally, the N is equal to 2.

Optionally, the corpora in the corpus are updated by:

Optionally, the word frequency data is updated by:

Optionally, the probability data corresponding to each participle sequence is updated as follows:

Optionally, the apparatus further comprises a word segmentation prediction module configured to:

Optionally, the sentence segmentation model is a two-classification model, the two-classification model is used to predict a probability of whether a next segmentation segment after the input segmentation sequence is an end character, and the first prediction module is specifically configured to: obtaining a probability value of a next word segmentation segment which is output by the binary classification model and represents the first word segmentation sequence and is an ending character, and determining the probability value as the first prediction probability;

optionally, the sentence segmentation model is a punctuation mark model, the punctuation mark model is configured to predict a probability value of each punctuation mark appearing after the input word segmentation sequence, and the first prediction module is specifically configured to: and obtaining a probability value of a designated punctuation mark appearing after the first word segmentation sequence output by the punctuation mark model, and determining the probability value as the first prediction probability.

Optionally, the first prediction module is specifically configured to: if the probability values of a plurality of designated punctuations are obtained, determining the sum of the probability values of the designated punctuations as the first prediction probability, or determining the maximum probability value in the probability values of the designated punctuations as the first prediction probability.

Optionally, the two-classification model is trained by:

In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of any one of the methods when executing the computer program.

In a fourth aspect, an embodiment of the invention provides a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of any of the methods described above.

In a fifth aspect, an embodiment of the invention provides a computer program product comprising a computer program stored on a computer-readable storage medium, the computer program comprising program instructions which, when executed by a processor, implement the steps of any of the methods described above.

The technical scheme provided by the embodiment of the invention can predict the first prediction probability of the end character after the temporary recognition result through the word frequency data, predict the second prediction probability of the end character after the temporary recognition result through the sentence-breaking model, obtain the third prediction probability by fusing the first prediction probability and the second prediction probability, when the third prediction probability is more than the probability threshold, show that the probability of the end character after the temporary recognition result is larger, namely show that the temporary recognition result is a text with complete semantics, at the moment, can carry out the processing of semantic analysis and the like on the temporary recognition result to obtain the corresponding response data, control the intelligent equipment to execute the response data, and can carry out the truncation processing on the continuously input audio stream data timely and accurately, thereby effectively distinguishing a plurality of continuous sentences contained in the audio stream data so as to make a timely response for each sentence in the audio stream data input by a user, the response time of the intelligent device is shortened, and the user experience is improved. In addition, the sentence breaking model is an offline model obtained based on a large amount of corpus training, the prediction accuracy is high, the word frequency data can be updated in real time on line, the prediction probability obtained based on the word frequency data can adapt to the change of the use environment, and the personalized customization requirement is realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a speech signal processing method according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a speech signal processing method according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of acquiring word frequency data according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart illustrating a process for calculating a prediction probability according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart of a training sentence-breaking model according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a speech signal processing apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.

For convenience of understanding, terms referred to in the embodiments of the present invention are explained below:

real-time speech transcription (Real-time ASR) is based on a deep full-sequence convolutional neural network framework, long connection between an application and a language transcription core engine is established through a WebSocket protocol, audio stream data can be converted into character stream data in Real time, a user can generate a text while speaking, and a recognized temporary recognition result is output generally according to morphemes as a minimum unit. For example, the captured audio stream is: the steps of ' today ' day ' gas ' how ' to ' how ' and ' like ' are sequentially identified according to the sequence of the audio stream, the temporary identification result ' today ' is output, then the temporary identification result ' today ' is output, and so on until the whole audio stream is identified, and the final identification result ' how the weather is today ' is obtained. The real-time voice transcription technology can also carry out intelligent error correction on the previously output temporary recognition result based on subsequent audio stream and semantic understanding of context, so as to ensure the accuracy of the final recognition result, that is, the temporary recognition result based on the audio stream real-time output continuously changes along with time, for example, the temporary recognition result output for the first time is gold, the temporary recognition result output for the second time is corrected to be today, the temporary recognition result output for the third time can be today Tian, the temporary recognition result output for the fourth time is corrected to be today weather, and so on, and the accurate final recognition result is obtained through continuous recognition and correction.

Voice Activity Detection (VAD), also called Voice endpoint Detection, refers to detecting the existence of Voice in a noise environment, and is generally used in Voice processing systems such as Voice coding and Voice enhancement, and plays roles of reducing a Voice coding rate, saving a communication bandwidth, reducing energy consumption of a mobile device, improving a recognition rate, and the like. A representative VAD method of the prior art is ITU-T G.729 Annex B. At present, a voice activity detection technology is widely applied to a voice recognition process, and a part of a segment of audio that really contains user voice is detected through the voice activity detection technology, so that a mute part of the audio is eliminated, and only the part of the audio that contains the user voice is recognized.

Any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.

In a specific practical process, one existing way to obtain speech data with complete semantics is: the user presses the designated key after inputting a section of voice data to inform the intelligent device that the voice input is finished, and the intelligent device acquires the section of voice data for processing. In another mode, when the smart device continuously receives the Voice, the Voice data continuously input is intercepted by using Voice Activity Detection (VAD) technology to obtain a section of complete Voice data, however, when the environmental noise is large, the situation that the Voice data cannot be intercepted or the interception is misjudged by adopting the method can occur, so that the Voice recognition cannot be performed in time, the response time of the smart device is further prolonged, the user cannot get a timely reply, and the user experience is reduced.

To this end, the inventor of the present invention previously collected a natural language text having complete semantics, added an end character after the collected text, the end character being used to identify that the text is already a text having complete semantics, and then added the natural language text having complete semantics to which the end character is added as a corpus to the corpus. Then, performing word segmentation processing on each corpus in the corpus, and counting the times of occurrence of each word segmentation sequence in each corpus based on the word segmentation processing result as word frequency data. The word frequency data can be stored specifically by the following format: word segmentation sequence w₁,…,w_nAnd storing the word frequency data in a key-value mode, wherein the word frequency data is used as a key value, and the corresponding times are used as value values, so that the times corresponding to the word segmentation sequences can be conveniently found during use. On the basis of obtaining the word frequency data, the prediction probability corresponding to each participle sequence can be determined according to the word frequency data, and the prediction probability is obtainedThe prediction probability represents the probability that the next word segmentation segment after the corresponding word segmentation sequence is the end character, so that probability data is obtained, the word segmentation sequence and the prediction probability corresponding to the word segmentation sequence are stored in an associated manner, for example, the probability data can be stored in a key-value manner, the word segmentation sequence is used as a key value, and the corresponding prediction probability is used as a value, so that the prediction probability corresponding to the word segmentation sequence can be conveniently found during use. In addition, a sentence break model may be trained in advance based on the corpus in the corpus, which may determine the probability of an ending character occurring after an input word segmentation sequence.

On this basis, the specific speech signal processing process includes: carrying out voice recognition on audio stream data acquired by intelligent equipment in real time to obtain a temporary recognition result; performing word segmentation processing on the temporary recognition result to obtain a plurality of word segmentation segments; inputting a first word segmentation sequence consisting of a plurality of word segmentation segments into a trained sentence segmentation model, and determining a first prediction probability of a sentence segmentation after the first word segmentation sequence according to the output of the sentence segmentation model; acquiring a second prediction probability that a next word segmentation after the second word segmentation sequence is an end character, wherein the second prediction probability is determined according to word frequency data, the word frequency data comprises the number of times of occurrence of each word segmentation sequence in each corpus determined based on the corpus in the corpus, the second word segmentation sequence is a sequence formed by the last N word segmentation segments in the temporary recognition result, and N is a positive integer; determining a third prediction probability according to the first prediction probability and the second prediction probability; and if the third prediction probability is larger than the probability threshold, performing semantic analysis on the temporary recognition result. The method comprises the steps of predicting a first prediction probability of the appearance of an end character after a temporary recognition result through word frequency data, predicting a second prediction probability of the appearance of the end character after the temporary recognition result through a sentence-breaking model, fusing the first prediction probability and the second prediction probability to obtain a third prediction probability, indicating that the probability of the appearance of the end character after the temporary recognition result is larger when the third prediction probability is larger than a probability threshold value, indicating that the temporary recognition result is a text with complete semantics, performing semantic analysis and other processing on the temporary recognition result to obtain corresponding response data, controlling an intelligent device to execute the response data, and being capable of accurately performing truncation processing on continuously input audio stream data in time so as to effectively distinguish a plurality of continuous sentences contained in the audio stream data and make a response in time for each sentence in the audio stream data input by a user, the response time of the intelligent device is shortened, and the user experience is improved. In addition, the sentence breaking model is an offline model obtained based on a large amount of corpus training, the prediction accuracy is high, the word frequency data can be updated in real time on line, the prediction probability obtained based on the word frequency data can adapt to the change of the use environment, and the personalized customization requirement is realized.

Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.

Fig. 1 is a schematic view of an application scenario of a speech signal processing method according to an embodiment of the present invention. During the interaction between the user 10 and the smart device 11, the smart device 11 will continuously collect ambient sounds and continuously send the ambient sounds to the server 12 in the form of audio stream data, where the audio stream data may include ambient sounds around the smart device 11 or other user speaking sounds in addition to the speech sound of the user 10. The server 12 sequentially performs voice recognition processing and semantic parsing processing on the audio stream data continuously sent by the intelligent device 11, determines corresponding response data according to a semantic parsing result, and controls the intelligent device 11 to execute the response data so as to give feedback to the user. The response data in the embodiment of the present invention is not limited to text data, audio data, image data, video data, voice broadcast, or control instructions, and the like, where the control instructions include but are not limited to: instructions for controlling the intelligent equipment to display expressions, instructions for controlling the motion of action components of the intelligent equipment (such as leading, navigation, photographing, dancing and the like) and the like.

In this application scenario, the smart device 11 and the server 12 are communicatively connected through a network, which may be a local area network, a wide area network, or the like. The smart device 11 may be a smart speaker, a robot, or the like, a portable device (e.g., a mobile phone, a tablet, a notebook, or the like), or a Personal Computer (PC). The server 12 may be any server, a server cluster composed of several servers, or a cloud computing center capable of providing voice recognition and semantic parsing services.

Of course, the speech recognition processing and semantic parsing processing of the audio stream data, and the subsequent processing of determining the response data and the like may also be executed on the intelligent device side, and the execution subject is not limited in the embodiment of the present invention. For convenience of description, in each embodiment provided by the present invention, the speech processing is performed at the server side for example, and the process of performing the speech processing at the intelligent device side is similar to this, and is not described herein again.

The speech signal processing method provided by the embodiment of the invention can be used for processing the speech corresponding to any language, such as Chinese, English, Japanese, German and the like. In the embodiment of the present invention, a processing manner of chinese is mainly described as an example, and processing manners of other languages are similar to that described above, and are not described in any way in the embodiment of the present invention.

The following describes a technical solution provided by an embodiment of the present invention with reference to an application scenario shown in fig. 1.

Referring to fig. 2, an embodiment of the present invention provides a speech signal processing method, including the following steps:

s201, voice recognition is carried out on audio stream data collected by the intelligent device in real time, and a temporary recognition result is obtained.

In the embodiment of the invention, after a user starts to talk with the intelligent device, the intelligent device can continuously collect the sound in the surrounding environment of the intelligent device, convert the sound into audio stream data and send the audio stream data to the server. The server can perform voice recognition on continuous audio stream data by using technologies such as real-time voice transcription and the like, and update the temporary recognition result in real time, wherein each update is performed on the basis of the temporary recognition result updated last time. It should be noted that the temporary recognition result may be updated in real time along with new audio stream data uploaded by the smart device, for example, the temporary recognition result obtained at the beginning is "gold", on the basis of the temporary recognition result "gold", the temporary recognition result "gold" is updated based on subsequent audio stream data to obtain an updated temporary recognition result, the updated temporary recognition result may be corrected to "today", the temporary recognition result updated next time may be "today field", the temporary recognition result is continuously updated based on the audio stream data, and the updated temporary recognition result may be corrected to "today weather".

S202, performing word segmentation processing on the temporary recognition result to obtain a plurality of word segmentation segments.

In specific implementation, the temporary recognition result may be subjected to word segmentation processing by using an existing word segmentation tool (e.g., jieba, SnowNLP, THULAC, NLPIR, etc.), so as to divide the temporary recognition result into a plurality of word segmentation segments. For example, if the provisional recognition result is "introduce blue and white porcelain", the word segmentation results are three word segmentation segments of "introduce", "next" and "blue and white porcelain".

S203, inputting a first word segmentation sequence consisting of a plurality of word segmentation segments into the trained sentence segmentation model, and determining a first prediction probability of a sentence segmentation after the first word segmentation sequence according to the output of the sentence segmentation model.

In specific implementation, the sentence-breaking model can be a binary model or a punctuation mark model. The punctuation mark model is used for predicting the probability value of each punctuation mark appearing after the input word segmentation sequence.

In the embodiment of the invention, the ending character is taken as a word segmentation segment. Specifically, the end character is a special character specified in advance and can be distinguished from the character included in the provisional recognition result, for example, when speech processing is performed for chinese, the end character may be "EOS", "#", or the like.

In specific implementation, which word segments are information of rare words can be determined by the word segmentation tool, and if it is determined whether the word segments obtained in step S202 include rare words, each rare word can be segmented again by using a sub-word segmentation algorithm, for example, each rare word is segmented again by using a Byte Pair Encoding (BPE) algorithm, which may also be called BPE processing. The rare words refer to word segmentation segments with low occurrence frequency in the corpus, for example, word segmentation segments with occurrence frequency less than a set number.

And S204, acquiring a second prediction probability that the next word segmentation after the second word segmentation sequence is the ending character, wherein the second prediction probability is determined according to the word frequency data, the word frequency data comprises the times of occurrence of each word segmentation sequence in each corpus determined based on the corpus in the corpus, the second word segmentation sequence is a sequence formed by the last N word segmentation in the temporary recognition result, and N is a positive integer.

Each corpus in the corpus of the embodiment of the present invention is a text with complete semantics, for example, "introduce blue and white porcelain", "what is the same as the weather today", and the last participle segment of each corpus is an end character, that is, each corpus ends with the end character. Taking the ending character as "EOS" as an example, the corpora in the corpus are "introduce blue and white porcelain EOS" and "how much EOS is in the weather today".

Specifically, the second prediction probability of the occurrence of the end character after one word segmentation sequence can be predicted with reference to the following formula:

P(EOS|w_m-N+1,…,w_m)＝C(w_m-N+1,…,w_m,EOS)/C(w_m-N+1,…,w_m)，

wherein, P (EOS | w)_m-N+1,…,w_m) For a sequence of participles { w_m-N+1,…,w_mA second predicted probability of an end character EOS occurring after C (w)_m-N+1,…,w_mEOS) as a sequence of participles { w_m-N+1,…,w_mNumber of occurrences of EOS in each corpus, C (w)_m-N+1,…,w_m) For a sequence of participles { w_m-N+1,…,w_mThe number of occurrences in each corpus, N ═ 1, …, m. Is composed ofTherefore, it is necessary to count each participle sequence { w ] in advance based on the corpus in the corpus_m-N+1,…,w_mEOS } and { w_m-N+1,…,w_mThe number of occurrences in each corpus.

In the embodiment of the invention, N is a positive integer. In specific implementation, the value of N can be determined according to actual requirements. For example, when N is 1, the second prediction probability is P (EOS | w)_m)＝C(w_m,EOS)/C(w_m) Predicting the second prediction probability that the next word is the ending character according to the last word segmentation segment in the temporary recognition result; when N is 2, the second prediction probability is P (EOS | w)_m-1,w_m)＝C(w_m-1,w_m,EOS)/C(w_m-1,w_m) Predicting a second prediction probability that the next word is an end character according to the last two word segmentation segments in the temporary recognition result; when N is 3, the second prediction probability is P (EOS | w)_m-2,w_m-1,w_m)＝C(w_m-2,w_m-1,w_m,EOS)/C(w_m-2,w_m-1,w_m) And predicting the second prediction probability that the next word is the ending character according to the last three word segmentation segments in the temporary recognition result.

For example, if the temporary recognition result is "introduce blue and white porcelain", the word segmentation result is three word segmentation segments of "introduce", "next" and "blue and white porcelain", and if N is 2, the second word segmentation sequence corresponding to "introduce blue and white porcelain" is { next blue and white porcelain }, and the number of times C corresponding to the word segmentation sequence { next blue and white porcelain } and { next blue and white porcelain, EOS } are obtained from the word frequency data₁And C₂Then { once, blue and white porcelain } the second predicted probability of the occurrence of the end character is C₂/C₁(ii) a Assuming that N is 3, the second word segmentation sequence corresponding to the provisional recognition result "introduce blue and white porcelain" is { introduction, blue and white porcelain }, and the times C corresponding to the word segmentation sequence { introduction, blue and white porcelain } and { introduction, blue and white porcelain, EOS } are obtained from the word frequency data₃And C₄Then { introduction, once again, blue and white porcelain } the second predicted probability of the occurrence of the end character is C₄/C₃。

In specific implementation, the corpus and the corresponding word frequency data can be updated online in real time, so that the second prediction probability obtained based on the word frequency data can adapt to the change of the use environment, and the prediction accuracy of the end character is improved.

It should be noted that there is no sequence between step S203 and step S204, and step S203 may be executed first, and then step S204 may be executed; or executing S204 first and then executing S203; s203 and S204 may also be performed simultaneously. Here, the embodiment of the present application is described by taking the steps of first executing S203 and then executing S204 as an example, but the embodiment of the present application is not limited thereto.

And S205, determining a third prediction probability according to the first prediction probability and the second prediction probability.

In particular, the product of the first prediction probability and the second prediction probability may be calculated as the third prediction probability. A weighted average of the first prediction probability and the second prediction probability may also be calculated as the third prediction probability.

In a specific implementation, the third prediction probability may be determined as follows: if the first prediction probability is greater than the second prediction probability, the third prediction probability is equal to the first prediction probability, otherwise, the third prediction probability is equal to the second prediction probability; or, if the first prediction probability is smaller than the second prediction probability, the third prediction probability is equal to the first prediction probability, otherwise, the third prediction probability is equal to the second prediction probability.

In the practical application process, the first prediction probability and the second prediction probability may also be processed by adopting other data processing manners to obtain a third prediction probability, which is not limited in the embodiment of the present application.

And S206, if the third prediction probability is larger than the probability threshold, performing semantic analysis on the temporary recognition result.

In specific implementation, the probability threshold may be determined according to actual requirements, and the embodiment of the present invention is not limited. In specific implementation, if the third prediction probability is not greater than the probability threshold, which indicates that the temporary recognition result does not have complete semantics yet, returning to step S202, and performing word segmentation processing on the next temporary recognition result; if the third prediction probability is greater than the probability threshold, the temporary recognition result is a text with complete semantics, and the temporary recognition result can be subjected to semantic analysis and other processing.

The method of the embodiment of the invention can predict the first prediction probability of the end character after the temporary recognition result through the word frequency data, predict the second prediction probability of the end character after the temporary recognition result through the sentence-breaking model, obtain the third prediction probability by fusing the first prediction probability and the second prediction probability, when the third prediction probability is more than the probability threshold, show that the probability of the end character after the temporary recognition result is larger, namely show that the temporary recognition result is a text with complete semantics, at the moment, can carry out the semantic analysis and other processing on the temporary recognition result to obtain the corresponding response data, and control the intelligent equipment to execute the response data, can carry out the truncation processing on the continuously input audio stream data in time and accurately, thereby effectively distinguishing a plurality of continuous sentences contained in the audio stream data so as to make a timely response for each sentence in the audio stream data input by a user, the response time of the intelligent device is shortened, and the user experience is improved. In addition, the method of the embodiment of the invention does not carry out truncation based on VAD detection results, thereby being better suitable for public and service scenes with noisy voice. The sentence-breaking model in the embodiment of the application is an offline model obtained based on a large amount of corpus training, the prediction accuracy is high, the word frequency data can be updated online in real time, the prediction probability obtained based on the word frequency data can adapt to the change of a use environment, and the personalized customization requirements are realized.

In specific implementation based on any of the above embodiments, referring to fig. 3, the word frequency data may be obtained as follows:

s301, performing word segmentation processing on each corpus in the corpus to obtain word segmentation segments corresponding to each corpus.

In specific implementation, the existing word segmentation tools (such as jieba word segmentation tools) can be used for performing word segmentation processing on each corpus in the corpus so as to divide each corpus into a plurality of word segmentation segments. For example, if the corpus is "introduce blue and white porcelain EOS", the word segmentation results in four word segmentation segments of "introduce", "next", "blue and white porcelain" and "EOS". And in the word segmentation process, the ending character is used as a word segmentation segment.

S302, determining a sequence formed by continuous N word segmentation segments in each corpus as an N-element word segmentation sequence.

In specific implementation, the value of N may be determined according to actual application requirements, for example, N may take values of 2, 3, 4, and the like, and the embodiment of the present invention is not limited. For a corpus, the times of occurrence of the word segmentation sequences with different lengths in each corpus of the corpus can be counted, that is, N can take a plurality of different values, so that the word frequency data corresponding to the corpus will include the times of occurrence of the word segmentation sequences with various lengths in each corpus of the corpus.

S303, determining a sequence consisting of continuous N +1 participle fragments in each corpus as an N + 1-element participle sequence.

S304, counting the occurrence frequency of each N-element word segmentation sequence and each N + 1-element word segmentation sequence in each corpus of the corpus to obtain word frequency data.

For example, a corpus w₁w₂w₃w₄w₅Corresponding word segmentation segment is w₁、w₂、w₃、w₄、w₅The sequence of which is { w₁,w₂,w₃,w₄,w₅When N is 2, the sequence w is divided into two₁,w₂,w₃,w₄,w₅Any 2 continuous word segmentation segments in the sequence are used as a binary word segmentation sequence, so that the following binary word segmentation sequences { w } can be determined₁,w₂}、{w₂,w₃}、{w₃,w₄}、{w₄,w₅H, will sequence { w₁,w₂,w₃,w₄,w₅Any sequence formed by 3 continuous word segmentation segments is used as a ternary word segmentation sequence, so that the following ternary word segmentation sequences { w } can be determined₁,w₂,w₃}、{w₂,w₃,w₄}、{w₃,w₄,w₅Therefore, for corpus w₁w₂w₃w₄w₅In total, the above 7 word segmentation sequences are obtained. The binary word segmentation sequences and the ternary word segmentation sequences corresponding to all the linguistic data in the corpus are obtained through the method, and then the times of the occurrence of the word segmentation sequences in the linguistic data in the corpus are counted.

The larger N is, the more word segmentation segments are contained in the word segmentation sequence, the higher the prediction accuracy is, but the more complicated the process of counting word frequency data is. According to the test, when N is 2, only the times of the binary word segmentation sequence and the ternary word segmentation sequence appearing in each corpus of the corpus are needed to be counted when the word frequency data is counted, the counting process is relatively simple, and meanwhile, the accuracy in the prediction process can be guaranteed.

In specific implementation, different corpora may correspond to the same word segmentation sequence, and in the statistical process, the same word segmentation sequence corresponds to only one statistical result. For example, there are 3 corpora in the corpus, "introduce blue and white porcelain EOS", "introduce beijing gou EOS", and the word segmentation sequences { introduction, next } are obtained for all of the 3 corpora, so the number of occurrences of the binary word segmentation sequences { introduction, next } in each corpus is 3, the number of occurrences of the word segmentation sequences { introduction, next beijing } and { next, beijing } in each corpus is 2, and the number of occurrences of the other word segmentation sequences in each corpus is 1.

In specific implementation, referring to fig. 4, based on the word frequency data obtained through statistics, in the speech signal processing process, the prediction probability that the next word segmentation segment after the second word segmentation sequence is an end character can be obtained on line in the following manner:

s401, obtaining the corresponding times M of the second word segmentation sequence from the word frequency data.

For example, the provisional identification result is w₁w₂…w_m-1w_mThen the second word sequence is { w }_m-N+1,…,w_m}. Then, a second word frequency data is obtainedThe sequence of words is { w_m-N+1,…,w_mThe number of occurrences in the corpus.

N in this step is determined according to the length of the word segmentation sequence in the word frequency data. For example, if N in step S302 is 2, the number of participle segments included in the second participle sequence is 2, and in this case, if the temporary recognition result is w₁w₂…w_m- ₁w_mThe second word sequence is { w }_m-1,w_m}。

S402, obtaining the corresponding times K of a third word segmentation sequence from the word frequency data, wherein the third word segmentation sequence is a sequence obtained by adding a finishing character behind the second word segmentation sequence.

For example, the second word sequence is { w }_m-N+1,…,w_mThe third word sequence is { w }_m-N+1,…,w_mEOS }. Then, a third word segmentation sequence { w ] is obtained from the word frequency data_m-N+1,…,w_mEOS } corresponding number of times.

And S403, determining a second prediction probability according to K and M.

As a possible implementation, the second prediction probability may be determined by the following formula: and P is K/M. Specifically, if M is equal to 0, step S403 is not performed, and the second prediction probability is directly determined to be equal to 0.

For example, if N is 2, and the provisional recognition result is "introduce", the segmentation result is "introduce" and the second segmentation sequence is { introduce, next } and the third segmentation sequence is { introduce, next, EOS }, the number of times that { introduce, next } is obtained from the word frequency data is 1000, and the number of times that { introduce, next, EOS } is obtained from the word frequency data is 2, the second prediction probability that an end character appears after the second segmentation sequence { introduce, next } is 2/1000 ═ 0.002. Processing the next temporary recognition result 'introducing blue and white porcelain', obtaining word segmentation results of 'introduction', 'next introduction', 'blue and white porcelain', the second word segmentation sequence of { next blue and white porcelain }, the third word segmentation sequence of { next blue and white porcelain, EOS }, the number of times corresponding to { next blue and white porcelain } obtained from the word frequency data is 20, the number of times corresponding to { next blue and white porcelain } obtained from the word frequency data is 12, and the second prediction probability that an end character appears after the second word segmentation sequence of { next blue and white porcelain } is 12/20 ═ 0.6.

As another possible implementation, the second prediction probability may also be determined by the following formula: p is K/(M + β), where β is a number much smaller than M, and β may take a number of 1, 0.1, etc., for example, to prevent the occurrence of M being 0.

As still another possible implementation, the prediction probability may also be determined by the following formula: p ═ aK/bM, where a and b are weighting coefficients, a is greater than 0 and less than or equal to 1, and b is greater than 0 and less than or equal to 1, and the specific values of the weighting coefficients can be configured according to the actual application scenario.

Three possible implementation manners for determining the prediction probability according to K and M are given above, but the specific implementation manner for determining the prediction probability is not limited in the embodiment of the present invention, and any manner that obtains the prediction probability based on K and M is applicable in the embodiment of the present invention.

During specific implementation, the prediction probability corresponding to each N-element participle sequence determined based on the corpus in the corpus can be calculated in advance according to the word frequency data, so as to obtain probability data. Specifically, the predicted probability of the occurrence of the end character after each N-gram segmentation sequence can be calculated as follows: obtaining N-element word segmentation sequence { w) from word frequency data₁,…,w_nThe corresponding times U; in N-gram participle sequence { w₁,…,w_nAdd an end character after the sequence to get the sequence w₁,…,w_nEOS, and acquiring a sequence w from the word frequency data₁,…,w_nNumber of times V corresponding to EOS }; determining the N-element word segmentation sequence { w according to U and V₁,…,w_nThe corresponding prediction probability. The probability data obtained finally includes the prediction probability of the occurrence of the ending character after each N-element word segmentation sequence determined based on the corpus. N-gram participle sequence { w₁,…,w_nAnd the prediction probability corresponding to the N-element word segmentation sequence is stored in an associated manner, so that the prediction probability corresponding to the N-element word segmentation sequence can be conveniently and quickly searched in the voice signal processing process. Wherein the N-gram participle sequence { w) is determined according to U and V₁,…,w_nCouple (c)For a specific implementation of the prediction probability, reference may be made to the above-mentioned specific implementation of determining the prediction probability according to K and M, which is not described herein again.

In specific implementation, the corpus in the corpus can be updated in the following manner:

in the first mode, a text with complete semantics after manual intervention is obtained, an ending character is added behind the text, and the text with the ending character added is used as a newly added corpus and added into a corpus.

In specific implementation, texts with complete semantics can be obtained through manual processing, and end characters are added after the texts with complete semantics to obtain a newly added corpus and the newly added corpus is added into the corpus. For example, an operator may perform manual intervention on text data corresponding to voice data acquired by the intelligent device to obtain a text with complete semantics, and add an end character to obtain a new corpus to be added to the corpus; for another example, an operator may perform manual intervention on text data corresponding to the recorded voice data to obtain a text with complete semantics, and add an end character to obtain a new corpus to be added to the corpus; for another example, an operator may obtain a text with complete semantics from a third party (e.g., a network), and add an end character to obtain a new corpus to be added to the corpus.

In the second way, in the speech signal processing process, if the third prediction probability in step S204 is less than or equal to the probability threshold, and the speech start point and the speech end point in the audio stream data obtained by the speech end point detection are added with the end character after the final recognition result corresponding to the audio stream data between the speech start point and the speech end point, and the final recognition result with the end character added is added to the corpus as the new corpus.

In practical application, a voice starting point and a voice ending point contained in an audio data stream can be positioned based on voice end point detection (VAD), so that a temporary recognition result of the audio data stream between the voice starting point and the voice ending point obtained by VAD detection is determined to be a final recognition result with complete semantics, and subsequent processing such as semantic analysis is performed based on the final recognition result. After the final recognition result is obtained, the cached temporary recognition result can be cleared. Meanwhile, the audio stream data collected later is processed by speech recognition and the like in real time. In specific implementation, the voice endpoint identifier is an identifier used for marking the end time of voice in audio stream data, the received voice endpoint identifier indicates that a user has input a complete piece of voice, and the temporary recognition result obtained based on the audio stream data before the voice endpoint identifier is considered to be a sentence with complete semantics, that is, the temporary recognition result is determined to be a final recognition result. Therefore, if the third prediction probability determined based on the temporary recognition result is always smaller than or equal to the probability threshold, which indicates that truncation cannot be achieved based on the current word frequency data, an end character may be added after the final recognition result, and the final recognition result with the end character added may be added to the corpus as a new corpus.

Based on any of the above embodiments, in a specific implementation, if there is a new corpus in the corpus, performing word segmentation processing on the new corpus to obtain an N-gram segmentation sequence and an N + 1-gram segmentation sequence corresponding to the new corpus; and updating the word frequency data corresponding to the N-element word segmentation sequence and the N + 1-element word segmentation sequence corresponding to the newly added corpus. The specific process of updating the word frequency data may refer to the steps shown in fig. 3, and is not described again.

Therefore, the method provided by the embodiment of the invention can update the corpus and the corresponding word frequency data on line in real time according to the data acquired by the intelligent equipment, so that the processing result is continuously optimized, and the prediction result is more accurate.

In specific implementation, if the word frequency data is updated, the prediction probability corresponding to each word segmentation sequence is updated according to the updated word frequency data. The specific process of updating the prediction probability may refer to the step of calculating probability data, and is not described in detail. For example, if the N-gram word sequence { w₁,…,w_nThe corresponding times are updated, or in the N-element word segmentation sequence w₁,…,w_nAdding an end character after the word sequence to obtain an N + 1-element word segmentation sequence (w)₁,…,w_nEOS, the corresponding times are updated, and the N-element word segmentation sequence { w is re-determined based on the updated times₁,…,w_nThe corresponding prediction probability.

Therefore, the method provided by the embodiment of the invention can update the probability data on line in real time according to the data acquired by the intelligent equipment, so that the processing result is continuously optimized, and the prediction result is more accurate.

On the basis of any of the above embodiments, step S204 can be implemented as follows: acquiring exclusive word frequency data corresponding to the intelligent equipment, and determining a second prediction probability that a next word segmentation segment behind a second word segmentation sequence is an end character according to the acquired exclusive word frequency data; and/or acquiring the general word frequency data, and determining a second prediction probability that the next word segmentation segment after the second word segmentation sequence is the ending character.

In specific implementation, different proprietary corpora can be configured for different application scenarios such as each smart device, each user, each enterprise, each service line, and the like, wherein an effective range can be configured for the proprietary corpora, and the effective range includes but is not limited to: the device level effective range, the user level effective range, the enterprise level effective range, the service line level effective range and the like. And carrying out word frequency statistics on word segmentation sequences formed by the corpora in the configured exclusive corpus to obtain exclusive word frequency data corresponding to different exclusive corpora, and determining the effective range of each exclusive word frequency data. And determining the effective range of the exclusive word frequency data according to the effective range of the exclusive corpus corresponding to the exclusive word frequency data. For example, the effective range of the dedicated word frequency data obtained by performing word frequency statistics based on a dedicated corpus is configured to be the same as the effective range of the dedicated corpus. For another example, the effective range of the dedicated word frequency data obtained by performing word frequency statistics based on at least two dedicated corpora is configured to be the same as the effective range of the dedicated corpus with the largest range.

In practical applications, the priority of the validation range may be set, for example, the priority of the device-level validation range is higher than the priority of the user-level validation range, the priority of the user-level validation range is higher than the priority of the enterprise-level validation range, and the priority of the enterprise-level validation range is higher than the priority of the service line-level validation range. Aiming at the same intelligent deviceIf a plurality of pieces of exclusive word frequency data effective to the intelligent equipment exist, selecting the exclusive word frequency data with the highest priority in the effective range according to the priority of the effective range of each exclusive word frequency data, and predicting the temporary identification result of the audio data sent by the intelligent equipment based on the exclusive word frequency data with the highest priority. For example, the specific word frequency data effective for the intelligent device A comprises the specific word frequency data Q of the effective range at the user level_AExclusive word frequency data Q of effective range of equipment level_BThen, the word frequency data Q with higher priority is selected_BAnd (6) performing prediction.

Specifically, the intelligent device uploads audio stream data and simultaneously reports identification information of the intelligent device, and special word frequency data effective to the intelligent device can be obtained through the identification information of the intelligent device.

The general word frequency data in the embodiment of the invention is the word frequency data obtained based on a general corpus. In specific implementation, the effective range of the general word frequency data can be set to be globally effective, that is, all the intelligent devices can use the general word frequency data.

In specific implementation, when the user does not have the exclusive word frequency data, the general word frequency data can be used for determining the second prediction probability that the next word segmentation segment after the second word segmentation sequence is the ending character. Or when the prediction probability cannot be determined through the exclusive word frequency data, the prediction probability that the next word segmentation segment after the second word segmentation sequence is the end character can be determined through the general word frequency data. Or determining the prediction probability that the next word segmentation after the second word segmentation sequence is the end character by using the general word frequency data, determining the prediction probability that the next word segmentation after the second word segmentation sequence is the end character by using the special word frequency data, and determining the second prediction probability corresponding to the second word segmentation sequence according to the prediction probability determined based on the general word frequency data and the prediction probability determined based on the special word frequency data.

The corresponding exclusive probability data can be determined based on the exclusive word frequency data, an effective range can be configured for each exclusive probability data, and the specific configuration mode can refer to the exclusive word frequency data and is not repeated.

For this reason, on the basis of any of the above embodiments, step S204 may be implemented as follows: acquiring exclusive probability data corresponding to the intelligent equipment, and determining a second prediction probability that a next word segmentation segment behind a second word segmentation sequence is an end character according to the acquired exclusive probability data; and/or acquiring general probability data, and determining a second prediction probability that the next word segmentation segment after the second word segmentation sequence is the ending character.

The general probability data in the embodiment of the invention is determined based on the general word frequency data. In specific implementation, the effective range of the general probability data can be set to be globally effective, that is, all the intelligent devices can use the general word frequency data.

During specific implementation, a corpus corresponding to the valid exclusive word frequency data of the intelligent device can be obtained according to the identification information of the intelligent device reported while the intelligent device uploads the audio stream data, a new corpus obtained based on the final recognition result corresponding to the audio stream data is added into the corpus, and then the word frequency data corresponding to the corpus is updated. Or, a corpus corresponding to exclusive probability data that takes effect on the intelligent device can be acquired according to identification information of the intelligent device that is reported while the intelligent device uploads audio stream data, a new corpus obtained based on a final recognition result corresponding to the audio stream data is added to the corpus, and then the word frequency data and the probability data corresponding to the corpus are updated.

Therefore, the method of the embodiment of the invention can obtain different corpora and corresponding exclusive word frequency data or exclusive probability data aiming at application scenes of each intelligent device, each user, each enterprise, each service line and the like so as to adapt to different users or scenes, and can finely adjust the corpora of different users or scenes in the using process in a mode of updating the corpora on line, so that the processing result is more accurate.

On the basis of any of the above embodiments, the method of the embodiment of the present invention further predicts a word segmentation segment that may appear after the temporary recognition result, and specifically includes the following steps: if the third prediction probability is smaller than or equal to the probability threshold, determining word segmentation segments with the maximum probability after the second word segmentation sequence according to the word frequency data; and controlling the intelligent device to output the determined word segmentation segments.

In specific implementation, the first N word segmentation segments can be determined from the word frequency data as the N + 1-element word segmentation sequence of the second word segmentation sequence, the N + 1-element word segmentation sequence with the largest frequency of occurrence in each corpus is selected from the determined N + 1-element word segmentation sequences, and the last word segmentation segment of the selected N + 1-element word segmentation sequence is used as the word segmentation segment with the largest occurrence probability behind the second word segmentation sequence.

For example, the temporary recognition result is "introduction", assuming that N is 2, the second word segmentation sequence is { introduction, next, and a ternary word segmentation sequence { introduction, next, blue and white porcelain }, { introduction, next, beijing }, { introduction, next, and the case } in which the first two word segmentation segments are "introduction" and "next" is determined from the word frequency data, and assuming that the number of times corresponding to the ternary word segmentation sequence { introduction, next, and beijing } is the maximum, the last word segmentation segment "beijing" in the ternary word segmentation sequence { introduction, next, and beijing } is used as the word segmentation segment with the maximum occurrence probability after the second word segmentation sequence { introduction, next, and the intelligent device is controlled to output the word segmentation segment "beijing", so that the intelligent device can realize intention prediction and display the predicted intention.

Based on any of the above embodiments, when the sentence-breaking model is a binary model, step S203 specifically includes: and obtaining the probability value of the next word segmentation segment which is output by the binary classification model and represents the first word segmentation sequence and is the ending character, and determining the probability value as a first prediction probability.

For example, the probability value that the next segmentation segment after the first segmentation sequence is output by the binary model is the ending character is 0.4, and the first prediction probability corresponding to the first segmentation sequence is 0.4.

In one possible implementation, the binary model may be trained by: obtaining a plurality of corpus samples and classification labels of each corpus sample, wherein the classification labels are used for marking whether the corpus samples are end characters or not; and training the two-classification model according to the corpus sample and the classification label of the corpus sample.

Specifically, the binary model may be trained according to the process shown in fig. 5, which includes:

s501, a corpus sample set is obtained, the corpus sample set comprises a plurality of corpus samples and classification labels of the corpus samples, and the classification labels are used for marking whether the corpus samples are end characters or not.

In specific implementation, the classification label of the corpus sample after the corpus sample is the ending character may be "1", and the classification label of the corpus sample after the corpus sample is not the ending character may be "0".

S502, performing word segmentation processing on each corpus sample to obtain a plurality of word segmentation segments corresponding to each corpus sample.

In specific implementation, if it is determined that the word segmentation segments corresponding to the corpus samples contain rare words, performing word segmentation on each rare word again by using a sub-word segmentation algorithm.

S503, inputting word segmentation sequences formed by a plurality of word segmentation fragments corresponding to each corpus sample into the deep learning model, obtaining a prediction probability value of an ending character behind the corpus sample, if the prediction probability value is larger than a preset threshold value, determining that a prediction result is a corresponding corpus sample and is an ending character behind the corpus sample, and otherwise, determining that the prediction result is not an ending character behind the corresponding corpus sample.

In this step, each corpus sample is subjected to word segmentation to obtain a plurality of word segmentation segments, and the word segmentation sequence of the corpus sample can be formed according to the position of each word segmentation segment in the corpus sample. For example, the corpus sample is "introduce blue and white porcelain", and the word segmentation segment obtained after the word segmentation processing of the corpus sample includes "introduce", "next" and "blue and white porcelain", then the word segmentation sequence finally formed according to the occurrence position of each word segmentation segment in the corpus sample is { introduce, next blue and white porcelain }.

In specific implementation, after a word segmentation sequence corresponding to each corpus sample is input into a deep learning model, the deep learning model can analyze context information among the word segmentation segments in the word segmentation sequence, and then a prediction probability value of an end character appearing after the word segmentation sequence is determined according to the context information among the word segmentation segments, if the prediction probability value exceeds a preset threshold value, the end character appearing after the corpus sample is determined, the output prediction result is '1', if the prediction probability value does not exceed the preset threshold value, the end character does not appear after the corpus sample, and the output prediction result is '0'.

S504, adjusting parameters of the deep learning model according to the classification labels and the corresponding prediction results in each corpus sample.

In specific implementation, a loss function for determining the deviation between the original classification label of the corpus sample and the prediction result corresponding to the corpus sample output by the deep learning model can be calculated, and then the parameters of the deep learning model are adjusted by using a gradient descent algorithm so as to reduce the loss function, and the adjustment is stopped until the prediction result corresponding to the corpus sample output by the adjusted deep learning model is consistent with the original classification label of the corpus sample.

And S505, testing the adjusted deep learning model by using the test sample, and determining the prediction accuracy of the deep learning model according to the test result.

Wherein, the test sample is a corpus marked with classification labels.

S506, judging whether the prediction accuracy is greater than a preset accuracy, and if not, entering S507; if yes, the process proceeds to S508.

And S507, training the adjusted deep learning model according to the new corpus sample set, taking the trained deep learning model as the new adjusted deep learning model, and returning to the S505.

The new corpus sample set is different from the corpus samples used in training the two-class model.

And S508, taking the adjusted deep learning model as the established two classification models.

Based on any of the above embodiments, when the punctuation mark model is the punctuation mark model, step S203 specifically includes: and obtaining the probability value of the appointed punctuation mark after the first word segmentation sequence output by the punctuation mark model, and determining the probability value as a first prediction probability.

In specific implementation, the punctuation mark model outputs probability values corresponding to all punctuation marks which may appear after the first word segmentation sequence, and obtains the probability values corresponding to the designated punctuation marks from the data output by the punctuation mark model. Where a given punctuation may be a punctuation representing the end of a piece of text, such as a period, question mark, exclamation mark, and the like. In practical application, which punctuations belong to the designated punctuation can be predetermined according to practical application requirements, and the embodiment of the application is not limited.

In specific implementation, if the probability value of only one designated punctuation mark is obtained, the probability value of the designated punctuation mark is determined as a first prediction probability. For example, the first word segmentation sequence is input into the punctuation mark model, the punctuation mark model outputs a probability value corresponding to "period" of 0.6, a probability value corresponding to "comma" of 0.4, and only if "period" is a designated punctuation mark, the first prediction probability corresponding to the first word segmentation sequence is determined to be 0.6.

In specific implementation, if the probability values of the designated punctuations are obtained, the sum of the probability values of the designated punctuations may be determined as the first prediction summary. For example, the first word segmentation sequence is input into the punctuation mark model, the punctuation mark model outputs a probability value corresponding to "period" of 0.4, a probability value corresponding to "exclamation mark" of 0.3, a probability value corresponding to "comma" of 0.2, a probability value corresponding to "pause mark" of 0.1, wherein "period" and "exclamation mark" are designated punctuation marks, and the first prediction probability corresponding to the first word segmentation sequence is the sum of the probability values corresponding to "period" and "exclamation mark", that is, the first prediction probability is 0.7.

In specific implementation, if the probability values of a plurality of designated punctuations are obtained, the maximum probability value in the probability values of the plurality of designated punctuations can be determined as the first prediction probability. For example, the first lemma sequence is input into the punctuation mark model, the punctuation mark model outputs a probability value corresponding to "period" of 0.4, a probability value corresponding to "exclamation mark" of 0.3, a probability value corresponding to "comma" of 0.2, a probability value corresponding to "pause mark" of 0.1, wherein "period" and "exclamation mark" are designated punctuation marks, and the probability value of "period" is higher than the probability value of "exclamation mark", so that the first prediction probability corresponding to the first lemma sequence is a probability value of "period", that is, the first prediction probability is 0.4.

In practical application, the first prediction probability may also be determined according to the probability value of the punctuation mark output by the punctuation mark model in other manners, which is not limited to the above embodiment.

In one possible implementation, the punctuation marking model may be trained by: obtaining a plurality of corpus sentences and punctuation marking information of each corpus sentence, wherein the punctuation marking information comprises punctuation marks appearing in the corpus sentences and the positions of each punctuation mark in the corpus sentences; and training the punctuation marking model according to the corpus sentences and the punctuation marking information of the corpus sentences. Specifically, the process of training the punctuation mark model comprises the following steps:

step one, obtaining a preset number of sample sentences, wherein the end of each sample sentence is provided with a punctuation mark.

And step two, splicing part or all of the sample sentences, segmenting each spliced sample sentence, and determining the segmented sample sentences as the corpus sentences.

For the situation that speech data needs to be recognized in real time, it is possible that a sentence can be formed only by a part of the recognized character sequence and the character sequence distinguished last time, and at this time, if punctuation prediction is performed on the character sequences, punctuation is likely to appear at the middle position of the character sequence.

Therefore, after a preset number of sample sentences with punctuation marks at the end of the sentences are obtained, part or all of the sample sentences can be spliced, each spliced sample sentence is segmented, for example, the segmented sample sentences are segmented according to a set step length or randomly, and then the segmented sample sentences are used as linguistic sentences for establishing the punctuation marking model, so that the probability of the punctuation marks at the end of the sentences can be reduced, the probability of the punctuation marks in the sentences is improved, the scenes are more fitted, and subsequently, when the established deep learning model is applied to the scenes, the sentence breakage accuracy of the deep learning model is higher.

And step three, performing word segmentation processing on each corpus sentence, determining words contained in the corpus sentence, and performing word segmentation processing on each rare word again by using a sub-word segmentation algorithm if the rare word is determined to exist in the words contained in the corpus sentence.

And step four, inputting a word sequence formed by words obtained after word segmentation and segmentation of each corpus sentence into the deep learning model for punctuation prediction.

In this step, punctuation prediction is performed on the word sequence corresponding to the corpus sentence, that is, punctuation symbols possibly appearing in the corpus sentence and the position of each punctuation symbol in the corpus sentence are predicted.

In this step, each corpus sentence is subjected to word segmentation and segmentation to obtain a plurality of words, and a word sequence of the corpus sentence can be formed according to the position of each word in the corpus sentence. For example, if the corpus sentence is "i want to go to school", and the words obtained after the corpus sentence is subjected to word segmentation and segmentation are "i", "go to school" and "want", the word sequence finally formed according to the appearance position of each word in the corpus sentence is { i, want, go to school }.

And step five, adjusting parameters of the deep learning model according to the punctuation mark information of each corpus sentence and the punctuation prediction result corresponding to the corpus sentence output by the deep learning model.

In specific implementation, a loss function for determining deviation between the punctuation mark information of the corpus sentence and the punctuation prediction result corresponding to the corpus sentence output by the deep learning model can be calculated, and then parameters of the deep learning model are adjusted by using a gradient descent algorithm so as to reduce the loss function, and the adjustment is stopped until the punctuation prediction result corresponding to the corpus sentence output by the adjusted deep learning model is the same as the punctuation mark information of the corpus sentence.

And step six, testing the adjusted deep learning model by using the test sentences, and determining the marking accuracy of the deep learning model according to the test result. Wherein the test sentence is a sentence with known punctuation marking information.

Step seven, judging whether the marking accuracy is greater than the preset accuracy, if not, entering the step eight: if yes, go to step nine.

And step eight, training the adjusted deep learning model according to at least one new corpus sentence, taking the trained deep learning model as the newly adjusted deep learning model, and returning to the step six. The new corpus sentences are newly added corpus sentences, and are different from the corpus sentences used in training the punctuation mark model.

And step nine, taking the adjusted deep learning model as the established punctuation marking model.

In the embodiment of the present application, the binary model and the sentence-segmentation model may be obtained based on any existing Deep learning model, for example, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTM), Deep Neural Networks (DNNs), Deep Belief Networks (DBNs), and other Neural Networks. In addition, the embodiment of the application also provides a network structure of the deep learning model: the method comprises the steps of embedding- > bilstm- > softmax, wherein an arrow represents the sequence of a network structure, wherein an embedding layer is used for coding the semantics of each participle segment in a participle sequence formed by a corpus sample; the bilstm layer is used for analyzing the context semantics of the word according to the semantic codes of a plurality of word segmentation fragments before and after each word segmentation fragment in the word segmentation sequence; the softmax layer is used for determining the probability of the end character after the word segmentation sequence according to the context semantics of each word segmentation segment.

As shown in fig. 6, based on the same inventive concept as the speech signal processing method, the embodiment of the present invention further provides a speech signal processing apparatus 60, which includes a speech recognition module 601, a word segmentation processing module 602, a first prediction module 603, a second prediction module 604, a determination module 605, and a parsing module 606.

The voice recognition module 601 is configured to perform voice recognition on audio stream data acquired by the intelligent device in real time to obtain a temporary recognition result;

a word segmentation processing module 602, configured to perform word segmentation processing on the temporary recognition result to obtain a plurality of word segmentation segments;

a first prediction module 603, configured to input a first word segmentation sequence composed of the word segmentation segments into a trained sentence segmentation model, and determine, according to an output of the sentence segmentation model, a first prediction probability that a sentence can be segmented after the first word segmentation sequence;

a second prediction module 604, configured to obtain a second prediction probability that a next participle segment after a second participle sequence is an end character, where the second prediction probability is determined according to word frequency data, the word frequency data includes the number of times that each participle sequence appears in each corpus determined based on the corpus in the corpus, the second participle sequence is a sequence composed of the last N participle segments in the temporary recognition result, and N is a positive integer;

a determining module 605, configured to determine a third prediction probability according to the first prediction probability and the second prediction probability;

and an analysis module 606, configured to perform semantic analysis on the temporary recognition result if the third prediction probability is greater than a probability threshold.

Optionally, the second prediction module 604 is specifically configured to:

Optionally, the speech signal processing apparatus 60 according to the embodiment of the present invention further includes a word frequency data obtaining module, configured to:

Alternatively, N is equal to 2.

Optionally, the corpora in the corpus are updated by:

Optionally, the word frequency data is updated by:

Optionally, the probability data corresponding to each participle sequence is updated as follows: and if the word frequency data is updated, updating probability data corresponding to each word segmentation sequence according to the updated word frequency data.

Optionally, the speech signal processing apparatus 60 according to the embodiment of the present invention further includes a word segmentation prediction module, configured to:

Optionally, the sentence segmentation model is a binary classification model, the binary classification model is configured to predict a probability that a next segmentation segment after the input segmentation sequence is an end character, and the first prediction module 603 is specifically configured to: obtaining a probability value of a next word segmentation segment which is output by the binary classification model and represents the first word segmentation sequence and is an ending character, and determining the probability value as the first prediction probability;

optionally, the sentence segmentation model is a punctuation mark model, the punctuation mark model is configured to predict probability values of punctuation marks appearing after an input word segmentation sequence, and the first prediction module 603 is specifically configured to: and obtaining a probability value of a designated punctuation mark appearing after the first word segmentation sequence output by the punctuation mark model, and determining the probability value as the first prediction probability.

Optionally, the first prediction module 603 is specifically configured to: if the probability values of a plurality of designated punctuations are obtained, determining the sum of the probability values of the designated punctuations as the first prediction probability, or determining the maximum probability value in the probability values of the designated punctuations as the first prediction probability.

Optionally, the two-classification model is trained by:

The voice signal processing device and the voice signal processing method provided by the embodiment of the invention adopt the same inventive concept, can obtain the same beneficial effects, and are not described again.

Based on the same inventive concept as the voice signal processing method, an embodiment of the present invention further provides an electronic device, which may specifically be a control device or a control system inside an intelligent device, or an external device communicating with the intelligent device, such as a desktop computer, a portable computer, a smart phone, a tablet computer, a Personal Digital Assistant (PDA), a server, and the like. As shown in fig. 7, the electronic device 70 may include a processor 701 and a memory 702.

Memory 702 may include Read Only Memory (ROM) and Random Access Memory (RAM), and provides the processor with program instructions and data stored in the memory. In an embodiment of the present invention, the memory may be used to store a program of a voice signal processing method.

The processor 701 may be a CPU (central processing unit), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or a CPLD (Complex Programmable Logic Device), and implements the voice signal processing method in any of the above embodiments according to an obtained program instruction by calling a program instruction stored in a memory.

An embodiment of the present invention provides a computer-readable storage medium for storing computer program instructions for the electronic device, which includes a program for executing the voice signal processing method.

The computer storage media may be any available media or data storage device that can be accessed by a computer, including but not limited to magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs)), etc.

Based on the same inventive concept as the speech signal processing method, an embodiment of the present invention provides a computer program product comprising a computer program stored on a computer-readable storage medium, the computer program comprising program instructions that, when executed by a processor, implement the speech signal processing method in any of the above embodiments.

The above embodiments are only used to describe the technical solutions of the present application in detail, but the above embodiments are only used to help understanding the method of the embodiments of the present invention, and should not be construed as limiting the embodiments of the present invention. Variations or substitutions that may be readily apparent to one skilled in the art are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A speech signal processing method, comprising:

2. The method according to claim 1, wherein the obtaining of the second prediction probability that the next participle segment after the second participle sequence is an end character specifically comprises:

or,

and determining probability data corresponding to a second word segmentation sequence as a second prediction probability from pre-configured probability data of which the next word segmentation segment after each N-element word segmentation sequence is an end character, wherein the N-element word segmentation sequence is obtained by performing word segmentation processing on the basis of the corpus in the corpus, and the probability data is determined according to word frequency data corresponding to the N-element word segmentation sequence and word frequency data corresponding to an N + 1-element word segmentation sequence obtained by adding the end character after the N-element word segmentation sequence.

3. The method according to claim 1 or 2, wherein the word frequency data is obtained by:

4. The method of claim 1, further comprising:

5. The method according to claim 1, 2 or 4,

the sentence segmentation model is a binary classification model, the binary classification model is used for predicting the probability of whether a next segmentation segment after an input segmentation sequence is an end character, and the first prediction probability that a sentence can be segmented after the first segmentation sequence is determined according to the output of the sentence segmentation model specifically comprises the following steps: obtaining a probability value of a next word segmentation segment which is output by the binary classification model and represents the first word segmentation sequence and is an ending character, and determining the probability value as the first prediction probability;

or,

the punctuation model is used for predicting probability values of punctuation symbols appearing after an input word segmentation sequence, and the first prediction probability that a punctuation can be performed after the first word segmentation sequence is determined according to the output of the punctuation model, and specifically comprises the following steps: and obtaining a probability value of a designated punctuation mark appearing after the first word segmentation sequence output by the punctuation mark model, and determining the probability value as the first prediction probability.

6. The method according to claim 5, wherein the obtaining a probability value of occurrence of a designated punctuation mark after the first word segmentation sequence output by the punctuation mark model is determined as the first prediction probability, specifically comprises:

7. The method of claim 5, wherein the bi-classification model is trained by:

8. A speech signal processing apparatus, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.

10. A computer-readable storage medium having computer program instructions stored thereon, which, when executed by a processor, implement the steps of the method of any one of claims 1 to 7.