CN111710332A

CN111710332A - Voice processing method and device, electronic equipment and storage medium

Info

Publication number: CN111710332A
Application number: CN202010612566.5A
Authority: CN
Inventors: 曲贺; 王晓瑞; 李岩
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2020-09-25
Anticipated expiration: 2040-06-30
Also published as: CN111710332B

Abstract

The present disclosure relates to a voice processing method, apparatus, electronic device, and storage medium, the method comprising: acquiring a voice to be recognized, and performing framing processing on the voice to be recognized to obtain a plurality of voice frames to be detected; extracting the voice characteristics corresponding to each voice frame to be detected; identifying each voice characteristic to obtain a detection result of each voice frame to be detected; and according to the detection result, segmenting the voice to be recognized to obtain a plurality of target voice segments, wherein the length of each target voice segment is smaller than or equal to a first threshold value, and the sum of the lengths of the adjacent target voice segments is larger than or equal to a second threshold value. The length of each target voice segment obtained by the method is within the specified length range, so that the voice recognition efficiency of the target voice segment can be improved; meanwhile, the sum of the lengths of the adjacent target voice segments is greater than or equal to the second threshold value, so that the target voice segments have certain context information, and the accuracy of voice recognition can be improved.

Description

Voice processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to a speech processing method and apparatus, an electronic device, and a storage medium.

Background

With the development of artificial intelligence, speech recognition has been widely applied to various industries. In a Voice recognition system, Voice endpoint Detection (VAD) plays an important role. The voice recognition system is burdened by the fact that a large number of non-voice segments, such as silence, various noises, etc., exist in the voice signal, which severely interferes with the performance of the voice recognition. Therefore, in speech recognition, the end point detection of speech is often performed by the speech recognition system first. That is, given a continuously input speech signal, the starting point and the ending point of the speech segment in the speech signal are output, so that non-speech segments such as silence and noise can be filtered out, and the performance of the speech recognition system is improved.

In the related art, deep learning technology is often used for voice endpoint detection. Specifically, speech features are extracted from an input speech frame; inputting the voice characteristics into a voice activity detection VAD classification model to obtain a classification result of the voice frame; and determining the starting point and the end point of the voice according to the classification result of the voice frame. However, in the related art, there is a problem that the accuracy and efficiency of speech recognition cannot be achieved at the same time when speech recognition is performed based on the start point and the end point of the obtained speech of one segment.

Disclosure of Invention

The present disclosure provides a speech processing method, an apparatus, an electronic device, and a storage medium, to at least solve a problem in the related art that speech recognition is not accurate enough when a speech length of a speech segment is too short, or a problem that speech recognition efficiency is not high when the speech length of the speech segment is too long. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a speech processing method, including:

acquiring a voice to be recognized, and performing framing processing on the voice to be recognized to obtain a plurality of voice frames to be detected;

extracting the voice characteristics corresponding to each voice frame to be detected;

classifying and identifying the voice characteristics corresponding to each voice frame to be detected respectively to obtain the detection result of each voice frame to be detected;

and according to the detection result, segmenting the voice to be recognized to obtain a plurality of target voice segments, wherein the length of each target voice segment is smaller than or equal to a first threshold value, and the sum of the lengths of the adjacent target voice segments is larger than or equal to a second threshold value.

In one embodiment, segmenting the speech to be recognized according to the detection result to obtain a plurality of target speech segments includes:

segmenting the voice to be recognized according to the detection result to obtain a plurality of original voice segments, wherein the length of each original voice segment is smaller than or equal to a first threshold value;

and carrying out fragment fusion on the original voice fragments to obtain a plurality of target voice fragments, wherein the sum of the lengths of the adjacent target voice fragments is greater than or equal to a second threshold value.

In one embodiment, segmenting the speech to be recognized according to the detection result to obtain a plurality of original speech segments includes:

determining a first voice frame in the current original voice segment according to the detection result, and using the first voice frame as a starting point of the current original voice segment;

determining a speech frame and a non-speech frame in the current original speech segment according to the detection result from the starting point, wherein the length of the current original speech segment is the sum of the length of the speech frame and the length of the non-speech frame;

when the length of the current original voice segment is detected to reach a first threshold value, or when the length of the current original voice segment is detected not to reach the first threshold value but the length of a non-voice frame in the current original voice segment is larger than a first value which is changed along with the length of the current voice frame, the length of the non-voice frame in the current original voice segment is detected to reach the first threshold value

And taking the last voice frame to be detected in the current original voice fragment as the end point of the current original voice fragment, and repeating the steps to obtain each original voice fragment.

In one embodiment, the detection result comprises a non-speech frame probability; determining a speech frame and a non-speech frame in the current original speech segment according to the detection result, comprising:

acquiring the probability of a non-speech frame of a current to-be-detected speech frame in a current original speech fragment;

acquiring the length of a current voice frame in the updated current original voice fragment, and updating a second value which is changed along with the length of the current voice frame according to the length of the current voice frame;

and comparing the probability of the non-speech frame of the current voice frame to be detected with the second value, and determining the speech classification result of the current voice frame to be detected according to the comparison result, wherein the speech classification result comprises a speech frame and a non-speech frame.

In one embodiment, determining a speech frame and a non-speech frame in a current original speech segment according to a detection result further includes:

and when the voice classification result of the current voice frame to be detected is determined to be a voice frame, updating the length of the current voice frame in the current original voice segment, and updating the first value according to the length of the current voice frame.

In one embodiment, the larger the length of the current speech frame is, the smaller the first value is; the larger the length of the current speech frame, the smaller the second value.

In one embodiment, the segment fusion of the original speech segments to obtain a plurality of target speech segments includes:

traversing each original voice segment, and merging the adjacent original voice segments when the sum of the lengths of the adjacent original voice segments is smaller than a second threshold value;

and updating the lengths of the fused original voice segments until the sum of the lengths of all the adjacent voice segments is determined to be greater than or equal to a second threshold value, so as to obtain a plurality of target voice segments.

According to a second aspect of the embodiments of the present disclosure, there is provided a speech processing apparatus, comprising:

the framing module is configured to acquire the voice to be recognized and perform framing processing on the voice to be recognized to obtain a plurality of voice frames to be detected;

the feature extraction module is configured to extract voice features corresponding to each voice frame to be detected;

the classification recognition module is configured to perform classification recognition on the voice features respectively corresponding to the voice frames to be detected to obtain a detection result of each voice frame to be detected;

and the voice segment generation module is configured to execute segmentation on the voice to be recognized according to the detection result to obtain a plurality of target voice segments, wherein the length of each target voice segment is smaller than or equal to a first threshold value, and the sum of the lengths of the adjacent target voice segments is larger than or equal to a second threshold value.

In one embodiment, the speech segment generating module includes:

the voice segment segmentation module is configured to segment the voice to be recognized according to the detection result to obtain a plurality of original voice segments, and the length of each original voice segment is smaller than or equal to a first threshold;

and the segment fusion module is configured to perform segment fusion on the original voice segments to obtain a plurality of target voice segments, wherein the sum of the lengths of the adjacent target voice segments is greater than or equal to a second threshold value.

In one embodiment, the speech segment segmentation module includes:

a starting point determining unit configured to determine a first speech frame in the current original speech segment according to the detection result as a starting point of the current original speech segment;

the voice frame determining unit is configured to determine a voice frame and a non-voice frame in a current original voice segment according to a detection result from a starting point, wherein the length of the current original voice segment is the sum of the length of the voice frame and the length of the non-voice frame;

the judging unit is configured to judge whether the length of the current original voice fragment reaches a first threshold value or not, or when the current original voice fragment is detected not to reach the first threshold value, judge whether the length of a non-voice frame in the current original voice fragment is larger than a first value changed along with the length of the current voice frame or not;

and the end point determining unit is configured to execute taking the last sound frame to be detected in the current original voice segment as an end point of the current original voice segment, and so on to obtain each original voice segment.

In one embodiment, the sound frame determination unit includes:

the acquiring unit is configured to acquire the probability of the non-speech frame of the current to-be-detected speech frame in the current original speech fragment;

the second value updating unit is configured to execute the steps of obtaining the length of the current voice frame in the updated current original voice segment and updating a second value which changes along with the length of the current voice frame according to the length of the current voice frame;

and the comparison unit is configured to compare the probability of the non-speech frame of the current voice frame to be detected with the second value, and determine a voice classification result of the current voice frame to be detected according to the comparison result, wherein the voice classification result comprises a speech frame and a non-speech frame.

In one embodiment, the sound frame determination unit further includes:

and the first value updating unit is configured to update the length of the current voice frame in the current original voice segment and update the first value according to the length of the current voice frame when the voice classification result of the current voice frame to be detected is determined to be a voice frame.

In one embodiment, the larger the length of the current speech frame is, the smaller the first value is; the larger the current speech frame length, the smaller the second.

In one embodiment, the segment fusion module is configured to perform:

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the speech processing method in any embodiment of the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium having instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the speech processing method described in any one of the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program stored in a readable storage medium, from which at least one processor of a device reads and executes the computer program, such that the device performs the speech processing method described in any one of the first aspect.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

classifying and identifying the voice characteristics corresponding to each voice frame to be detected to obtain a detection result of each voice frame to be detected; and dividing the voice to be recognized based on the detection result of each voice frame to be detected to obtain a plurality of target voice fragments. When the voice length of the target voice fragment is too long, the length of each target voice fragment is in a specified length range, and the voice recognition efficiency of each target voice fragment can be improved; meanwhile, the sum of the lengths of the adjacent target voice segments is larger than or equal to the second threshold, so that the target voice segments have certain context information when the voice length of the target voice segments is too short, and the accuracy of voice recognition can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a diagram illustrating an application environment for a method of speech processing, according to an example embodiment.

FIG. 2 is a diagram illustrating an application environment for a method of speech processing according to another exemplary embodiment.

FIG. 3 is a flow diagram illustrating a method of speech processing according to an example embodiment.

FIG. 4 is a flowchart illustrating steps for generating a target speech segment in accordance with an exemplary embodiment.

FIG. 5 is a flowchart illustrating one step of generating an original speech segment in accordance with an exemplary embodiment.

FIG. 6 is a flowchart illustrating a step of obtaining speech frames and non-speech frames in accordance with an exemplary embodiment.

FIG. 7 is a flowchart illustrating a step of fusing speech segments according to an exemplary embodiment.

FIG. 8 is a flow diagram illustrating a method of speech processing according to an example embodiment.

FIG. 9 is a flowchart illustrating a step of obtaining an original speech segment for a split in accordance with an exemplary embodiment.

FIG. 10 is a flowchart illustrating a step of fusing speech segments according to an exemplary embodiment.

FIG. 11 is a block diagram illustrating a speech processing apparatus according to an example embodiment.

Fig. 12 is an internal block diagram of an electronic device shown in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The speech processing method provided by the present disclosure can be applied to the application environment shown in fig. 1. Wherein the audio capture device 110 and the terminal 120 are interconnected. The audio capture device 110 may be a stand-alone device or may be a built-in component in the terminal 120. A preprocessing system for performing framing processing and feature extraction on the speech to be recognized is deployed in the terminal 120; the trained deep learning network is deployed and used for carrying out classification and recognition on the voice features corresponding to each voice frame to be detected to obtain the detection result of each voice frame to be detected; and deploying executable files for dividing the voice to be recognized according to the detection result of each voice frame to be detected. Specifically, the terminal 120 obtains the voice to be recognized from the audio collecting device 110; the terminal 120 performs framing processing on the voice to be recognized to obtain a plurality of voice frames to be detected; extracting the voice characteristics corresponding to each voice frame to be detected; classifying and identifying the voice characteristics corresponding to each voice frame to be detected respectively to obtain the detection result of each voice frame to be detected; and according to the detection result, segmenting the voice to be recognized to obtain a plurality of target voice segments, wherein the length of each target voice segment is smaller than or equal to a first threshold value, and the sum of the lengths of the adjacent target voice segments is larger than or equal to a second threshold value. The audio capturing device 110 may be, but is not limited to, various microphones, recording devices, and the like, and the terminal 120 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.

In another exemplary embodiment, the speech processing method provided by the present disclosure can also be applied to the application environment shown in fig. 2. Wherein the terminal 210 and the server 220 interact through a network. The preprocessing system, the deep learning network, the executable file, and the like for voice processing may be deployed in the terminal 210, and may also be deployed in the server 220. For example, deployed in server 220. The user may trigger a request for voice processing through the terminal 210 to cause the server 220 to perform voice processing according to the request for voice processing. For example, for the speech recognition field, the user may input a speech to be recognized through the terminal 210; the terminal 210 sends the voice to be recognized to the server 220, and the server 220 automatically processes the voice to be recognized sent by the terminal 210 to obtain the target voice fragment. Further, the server 220 may perform voice recognition according to the obtained target voice segment, and send a recognition result obtained by the voice recognition to the terminal 210 for displaying. The terminal 210 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 220 may be implemented by an independent server or a server cluster formed by a plurality of servers.

Fig. 3 is a flowchart illustrating a voice processing method according to an exemplary embodiment, where the voice processing method is used in the server 220, as shown in fig. 3, and includes the following steps.

In step S310, a speech to be recognized is obtained, and the speech to be recognized is subjected to framing processing to obtain a plurality of frames to be detected.

In step S320, the speech features corresponding to each frame to be detected are extracted.

The voice to be recognized refers to voice to be subjected to voice processing. Specifically, after the voice to be recognized is obtained, frame processing and feature extraction are performed on the voice to be recognized, so that a plurality of voice features corresponding to the voice to be recognized are obtained. The framing processing and feature extraction of the speech to be recognized can be realized in the following manner. First, the speech to be recognized is pre-emphasized by a high-pass filter. Because the voice signal has short-time stationarity, the voice signal can be subjected to framing processing according to time step, each time step is called a frame, and the time step corresponding to each frame can be a preset value, for example, any value between 20 ms and 30 ms. In order to avoid excessive variation between two adjacent frames, an overlap region may be provided between two adjacent frames. Each frame is then windowed to increase the continuity of the left and right ends of the frame, for example, using a 25ms window for calculation, with shifts being made every 10 ms. And then, carrying out Fourier transform on the windowed voice signal to obtain a spectrogram and filtering the spectrogram so as to make the spectrogram more compact. Finally, the speech features corresponding to each frame to be detected can be obtained by using spectral or cepstrum analysis.

In step S330, the speech features corresponding to each frame to be detected are classified and identified to obtain a detection result of each frame to be detected.

Specifically, the trained deep learning network may be used to classify and identify the speech features corresponding to each frame to be detected. The deep learning network may be any network that can be used for speech feature classification, such as a recurrent neural network, a convolutional neural network, or a network that is a combination of a recurrent neural network and a convolutional neural network. And after the corresponding voice feature of each voice frame to be detected is obtained, recognizing the corresponding voice feature of each voice frame to be detected by adopting the trained deep learning network to obtain a detection result. The detection result includes the probability of each category. There may be multiple categories, including but not limited to speech, silence, noise, etc.

In step S340, according to the detection result, the speech to be recognized is segmented to obtain a plurality of target speech segments, where the length of each target speech segment is less than or equal to the first threshold, and the sum of the lengths of adjacent target speech segments is greater than or equal to the second threshold.

Wherein the first threshold and the second threshold may be pre-configured values. The first threshold may be characterized by time or by the number of audio frames, which is not limited herein. The first threshold and the second threshold may be equal or unequal, depending on the actual situation. Specifically, after the detection result of each frame to be detected is obtained, the category to which each frame to be detected belongs can be determined according to the detection result, and then the speech frame in the frame to be detected is located. And dividing the voice to be recognized according to the distribution of the voice frames to obtain a plurality of target voice segments with the lengths smaller than or equal to a first threshold value and the sum of the adjacent lengths larger than or equal to a second threshold value.

In the voice processing method, the detection result of each voice frame to be detected is obtained by classifying and identifying the voice characteristics corresponding to each voice frame to be detected; and dividing the voice to be recognized based on the detection result of each voice frame to be detected to obtain a plurality of target voice fragments. When the voice length of the target voice fragment is too long, the length of each target voice fragment is in a specified length range, and the voice recognition efficiency of each target voice fragment can be improved; meanwhile, the sum of the lengths of the adjacent target voice segments is larger than or equal to the second threshold, so that the target voice segments have certain context information when the voice length of the target voice segments is too short, and the accuracy of voice recognition can be improved.

In an exemplary embodiment, as shown in fig. 4, in step S340, according to the detection result, the speech to be recognized is segmented to obtain a plurality of target speech segments, which may be implemented by the following steps:

in step S410, the speech to be recognized is segmented according to the detection result to obtain a plurality of original speech segments, and the length of the original speech segment is smaller than or equal to the first threshold.

Specifically, after the detection result of each to-be-detected sound frame output by the deep learning network is obtained, each to-be-detected sound frame can be identified according to the detection result, and the category of each to-be-detected sound frame is determined, and is not limited to include a speech frame and a non-speech frame. And then, segmenting the plurality of voice frames to be detected according to the category identification result of the voice frames to be detected to obtain a plurality of original voice fragments. Illustratively, if it is determined that the continuous frames to be detected are all speech frames according to the detection result of the frames to be detected, or the proportion of the speech frames in the continuous frames to be detected is greater than a certain proportion, it is determined that the length of the continuous frames to be detected does not exceed the first threshold, and the continuous frames to be detected can be directly segmented to serve as an original speech segment. When the continuous voice frame to be detected is judged to be a voice frame and the length of the continuous voice frame to be detected is greater than the first threshold value, the continuous voice frame to be detected can be segmented to obtain a plurality of original voice segments, and the length of each original voice segment does not exceed the first threshold value. Further, after obtaining a plurality of original voice segments, the first frame to be detected of each original voice segment may be used as the starting point of each original voice segment, and the last frame to be detected in each original voice segment may be used as the ending point of each original voice segment.

In step S420, segment fusion is performed on the original voice segments to obtain a plurality of target voice segments, wherein the sum of the lengths of adjacent target voice segments is greater than or equal to a second threshold.

Specifically, after a plurality of original voice segments are obtained, the original voice segments are fused according to the lengths of the adjacent original voice segments until the sum of the lengths of all the adjacent voice segments is greater than or equal to a second threshold value, and a target voice segment is obtained.

In the embodiment, firstly, the length of each obtained original voice segment is within a specified length range by adopting a voice length punishment-based method, so that the voice recognition efficiency of each voice segment can be improved; by adopting the method of voice segment fusion, the sum of the lengths of the adjacent voice segments exceeds the second threshold value, so that the voice segments have certain context information, and the accuracy of voice recognition can be improved.

In an exemplary embodiment, as shown in fig. 5, in step S410, segmenting the speech to be recognized according to the detection result to obtain a plurality of original speech segments, which may be implemented by the following steps:

in step S411, a first speech frame in the current original speech segment is determined according to the detection result, and is used as a starting point of the current original speech segment.

In step S412, from the starting point, the speech frame and the non-speech frame in the current original speech segment are determined according to the detection result, and the length of the current original speech segment is the sum of the length of the speech frame and the length of the non-speech frame.

In step S413, when it is detected that the length of the current original speech segment reaches the first threshold, or when it is detected that the length of the current original speech segment does not reach the first threshold, but the length of a non-speech frame in the current original speech segment is greater than the first value varying with the length of the current speech frame, step S414 is executed.

In step S414, the last frame to be detected in the current original speech segment is used as the end point of the current original speech segment, and so on to obtain each original speech segment.

The first value is changed along with the length of the current voice frame and is used for judging whether the length of the non-voice frame in the current original voice segment meets the preset requirement or not. The first value may be obtained by a pre-configured first function. And when the length of the current voice frame changes, recalculating the first function according to the length of the current voice frame, and taking the obtained value as a first value. Specifically, the class probability of the speech feature corresponding to each frame to be detected can be output through the deep learning network. The class comprises a speech frame and a non-speech frame, and then the speech frame probability and the non-speech frame probability can be obtained. And determining whether the voice frame to be detected is a voice frame or a non-voice frame according to the voice frame probability and the non-voice frame probability. For example, when the probability of a speech frame is greater than 0.6, it is determined that the speech frame to be detected is a speech frame. For the current original speech segment, the first speech frame detected can be used as the starting point of the current original speech segment.

And from the starting point, sequentially judging the category of the voice frame to be detected according to the detection result of the voice frame to be detected. When a voice frame to be detected is detected, the length of the voice frame or the length of the non-voice frame in the current original voice segment can be updated in real time. For example, if the current voice frame to be detected is determined to be a voice frame, the length of the voice frame is updated, the length of the non-voice frame is kept unchanged, and meanwhile, a first function is calculated according to the length of the updated voice frame to obtain a first value; and if the current voice frame to be detected is determined to be a non-voice frame, updating the length of the non-voice frame, and keeping the length of the voice frame unchanged. And taking the sum of the obtained speech frame length and the obtained non-speech frame length as the length of the current original speech segment. And comparing the length of the current original voice segment with a first threshold value, and outputting the original voice segment when the length of the current original voice segment is detected to reach the first threshold value. Or, when the length of the current original voice segment is detected not to reach the first threshold value, but the length of the non-voice frame in the current original voice segment is larger than the first value which is changed along with the length of the current voice frame, the current original voice segment is also output. And taking the last voice frame to be detected of the current original voice segment as an end point. And repeating the steps until the last frame in the plurality of frames to be detected is detected to obtain a plurality of original voice fragments.

In this embodiment, the length of the current original speech segment is updated in real time according to the category of the frame to be detected by sequentially determining the category of each frame to be detected. When the length of the current original voice fragment reaches a first threshold value, or the length of the current original voice fragment does not reach the first threshold value, but the length of a non-voice frame in the current original voice fragment is larger than the first value, segmenting to obtain the current original voice fragment. On one hand, the length of the obtained original voice fragment can be in a specified length range; on the other hand, each original voice segment is allowed to contain a certain number of non-voice frames, so that the efficiency of segmenting the voice segments can be improved.

In an exemplary embodiment, as shown in fig. 6, the detection result includes a non-speech frame probability; in step S412, the speech frame and the non-speech frame in the current original speech segment are determined according to the detection result, which can be implemented by the following steps:

in step S4121, the probability of the non-speech frame of the current to-be-detected speech frame in the current original speech segment is obtained.

In step S4122, the current speech frame length in the updated current original speech segment is obtained, and the second value that varies with the current speech frame length is updated according to the current speech frame length.

In step S4123, the probability of the non-speech frame of the current to-be-detected sound frame is compared with the second value, and the speech classification result of the current to-be-detected sound frame is determined according to the comparison result, where the speech classification result includes a speech frame and a non-speech frame.

The second value changes along with the length of the current voice frame and is used for judging whether the current voice frame is a voice frame or a non-voice frame. The second value may be obtained by a second function configured in advance. And when the length of the current voice frame changes, recalculating the second function according to the length of the current voice frame, and taking the obtained value as a second value. Specifically, in the related art, the class probability output by the deep learning network is usually compared with a threshold value to determine the class to which the object to be detected belongs. However, a detection result output by the deep learning network usually has a certain error, and therefore, in this embodiment, the second function for determining the category of the frame to be detected is configured in advance, so that the accuracy of determining the category of the frame to be detected can be improved. The value of the second function is a function of the current speech frame length in the current original speech segment. Preferably, the larger the current speech frame length, the smaller the value of the second function. And when the category of the current voice frame to be detected is judged, the length of the current voice frame in the current original voice fragment is obtained, and the value of the second function is updated according to the length of the current voice frame. And comparing the probability of the non-speech frame of the current voice frame to be detected with the value of the second function, and determining the category of the current voice frame to be detected according to the comparison result. The category of the current frame to be detected can be determined by the following formula: x > P (L). Wherein, x represents the probability of the non-speech frame of the current speech frame to be detected, L represents the length of the current speech frame in the current original speech segment, and p (L) represents the second function. If x is more than P (L), the current voice frame to be detected is a non-voice frame; otherwise, the current voice frame to be detected is a voice frame. Repeating the steps until the length of the current original voice fragment reaches a first threshold value; or detecting that the current original voice segment does not reach the first threshold value, but the length of the non-voice frame in the current original voice segment is larger than the first value.

For the second value, an initial value may be set, and when a starting point in the current original speech segment (i.e., the first speech frame) is determined, the probability of the non-speech frame of the current speech frame to be detected is compared with the initial value, so as to determine the first speech frame in the current original speech segment.

In this embodiment, the type of the current voice frame to be detected is dynamically determined according to the second value, which is preset by the second value whose value is variable along with the length of the current voice frame in the current original voice segment, so that the recognition accuracy of the current voice frame to be detected can be improved, and the performance of voice recognition can be improved in an auxiliary manner.

In an exemplary embodiment, in step S412, determining a speech frame and a non-speech frame in the current original speech segment according to the detection result, further includes: and when the voice classification result of the current voice frame to be detected is determined to be a voice frame, updating the length of the current voice frame in the current original voice segment, and updating the first value according to the length of the current voice frame.

Specifically, for the current original voice segment, when it is detected that the length of the current original voice segment reaches a first threshold value, or when it is detected that the length of a non-voice frame in the current original voice segment does not reach the first threshold value but is greater than a first value, the current original voice segment is output. The first value varies dynamically with the length of the current speech frame in the current original speech segment. Preferably, the larger the length of the current speech frame, the smaller the first value. When the current voice frame to be detected is detected, the length of the current voice frame in the current original voice segment is obtained, and the first value is updated according to the length of the current voice frame. In this embodiment, by pre-configuring the first value that is variable with the length of the current speech frame in the current original speech segment, and dynamically controlling the length of the non-speech frame in the current original speech segment according to the first value, the speech frame proportion in the original speech segment can be improved, so that the performance of speech recognition can be improved in an auxiliary manner.

In an exemplary embodiment, as shown in fig. 7, in step S420, segment fusion is performed on the original speech segments to obtain a plurality of target speech segments, which can be implemented by the following steps:

in step S421, each original speech segment is traversed, and when the sum of the lengths of the adjacent original speech segments is determined to be smaller than the second threshold, the adjacent original speech segments are merged.

In step S422, the lengths of the fused original speech segments are updated until it is determined that the sum of the lengths of all adjacent speech segments is greater than or equal to the second threshold, so as to obtain a plurality of target speech segments.

Specifically, after a plurality of original voice segments are obtained, the plurality of original voice segments are subjected to cyclic traversal, the sum of the lengths of the current original voice segment and the adjacent original voice segments thereof is calculated, and the calculated sum of the lengths is compared with a second threshold value. And if the sum of the lengths is smaller than a second threshold value, fusing the two adjacent original voice segments to obtain a fused original voice segment, and updating the length, the starting point and the ending point of the fused original voice segment. And repeatedly traversing the fused voice segments until the sum of the lengths of all the adjacent voice segments is greater than or equal to a second threshold value to obtain a plurality of target voice segments. In the embodiment, the voice segments are fused by adopting the voice segment fusion method, so that the target voice segment has certain context information, and the accuracy of voice recognition can be improved.

FIG. 8 is a flowchart illustrating a particular method of speech processing, according to an exemplary embodiment, as shown in FIG. 8, including the following steps.

In step S810, a speech to be recognized is obtained, and the speech to be recognized is subjected to framing processing to obtain a plurality of frames to be detected.

In step S820, the speech features corresponding to each frame to be detected are extracted.

In step S830, the speech features are input to the deep learning network, and the speech frame probability and the non-speech frame probability of each frame to be detected are obtained. The deep learning network may be a neural network.

In step S840, the speech to be recognized is segmented according to the probability of the non-speech frame of each frame to be detected, so as to obtain a plurality of original speech segments, where the length of each original speech segment is less than or equal to the first threshold.

As shown in fig. 9, step S840 may be implemented by the following steps.

In step S841, a first speech frame in the current original speech segment is determined as a starting point of the current original speech segment according to the probability of the non-speech frame of the speech frame to be detected.

For the starting point of the first original speech segment, when the probability of the speech frame to be detected is compared with a threshold value, and it is determined that the continuous N speech frames to be detected are speech frames or the speech frame ratio is greater than a specified threshold value, the first speech frame of the continuous N speech frames to be detected can be used as the starting point of the first original speech segment. The detection is performed sequentially from the first starting point. The starting point of the first original voice segment is determined by the method, and when the beginning of the voice to be recognized has a longer mute segment, the recognition time of the voice frame to be detected can be saved.

In step S842, starting from the starting point, the current speech frame length in the updated current original speech segment is obtained, and the first value changing with the current speech frame length and the second value changing with the current speech frame length are updated according to the current speech frame length.

The first value may be obtained by a first function configured in advance, and the second value may be obtained by a second function configured in advance. In FIG. 9, T (L) represents the first function, and P (L) represents the second function. T (L) and P (L) are functions that vary with the length L of the current speech frame. T (l) and p (l) may be expressed using linear functions, such as t (l) ═ AL + B where a and B are constants that depend on the actual situation; p (l) ═ aL + b, where a and b are constants according to the actual situation. L represents the current speech frame length, and the initial value of L is 0.

In step S843, the probability of the non-speech frame of the current speech frame to be detected is compared with the second value, and the category of the current speech frame to be detected in the current original speech segment is determined. The category of the current voice frame to be detected comprises a voice frame and a non-voice frame. If the speech frame is a speech frame, go to step S844; if the non-speech frame is detected, step S846 is performed.

Specifically, the category of the current frame to be detected can be determined by the following formula: x > P (L). Wherein x represents the probability of the non-speech frame of the current voice frame to be detected. If x is more than P (L), the current voice frame to be detected is a non-voice frame; otherwise, the current voice frame to be detected is a voice frame.

In step S844, if the current frame to be detected is a speech frame, the length L of the speech frame in the current original speech segment is updated. That is, when detecting that the current voice frame to be detected is a voice frame, L is added by 1.

In step S845, the length of the current original speech segment is compared with a first threshold. The length of the current original voice segment is the sum of the length of the current voice frame and the length of the non-voice frame. If the length of the current original voice segment reaches the first threshold, performing step S848 to output the length, the starting point and the ending point of the current original voice segment; otherwise, step S842 is continuously executed to continuously determine the next frame to be detected.

In step S846, if the current to-be-detected speech frame is a non-speech frame, the length S of the non-speech frame in the current original speech segment is updated. The length S of the non-speech frame may be 0 at the beginning, and when the current speech frame to be detected is detected as a non-speech frame, S is added by 1.

In step S847, the non-speech frame length S is compared with a first value. If the length S of the non-speech frame is greater than the first value, go to step S848; otherwise, step S845 is executed.

In step S848, the current original speech segment is output, along with the start point, end point, and length of the current original speech segment.

In step S850, segment fusion is performed on the multiple original voice segments to obtain multiple target voice segments, where a sum of lengths of adjacent target voice segments is greater than a second threshold.

As shown in fig. 10, step S850 may be specifically implemented by the following steps. Assume that a plurality of original speech segments S1, S2.. Sn are obtained through steps S841-S848.

In step S851, it is determined whether the number of original speech segments is less than 2. If the number of the fusion pieces is less than 2, ending the fusion process; otherwise, the process continues to step S852.

In step S852, each original speech segment is traversed, and the original speech segment when the sum of the lengths of the adjacent original speech segments Si and S (i +1) is smaller than the second threshold value is determined.

In step S853, it is determined that i is less than n-1. Step S854 is performed. Otherwise, ending the fusion process.

In step S854, adjacent original speech segments Si and S (i +1) are merged as new Si.

Illustratively, for the ith original speech segment Si and its adjacent segment S (i +1), if the sum of the speech segment lengths of Si and S (i +1) is smaller than the second threshold, Si and S (i +1) are merged as new Si, and the sequence numbers of the other original speech segments are updated accordingly. The starting point of the new Si is the starting point of the Si before merging, the end point of the new Si is the end point of the S (i +1) before merging, and the length of the new Si is the sum of the lengths of the Si and the S (i +1) before merging. And circulating the steps until the sum of the lengths of all the adjacent voice segments is greater than or equal to a second threshold value or the number of the voice segments is less than 2, and obtaining a plurality of target voice segments.

It should be understood that although the various steps in the flow charts of fig. 1-10 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-10 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

FIG. 11 is a block diagram illustrating a speech processing apparatus according to an example embodiment. Referring to fig. 11, the apparatus includes a framing module 1101, a feature extraction module 1102, a classification recognition module 1103, and a speech segment generation module 1104. Wherein the content of the first and second substances,

the framing module 1101 is configured to perform acquisition of a voice to be recognized, perform framing processing on the voice to be recognized, and obtain a plurality of frames to be detected;

a feature extraction module 1102 configured to extract a speech feature corresponding to each frame to be detected;

a classification recognition module 1103 configured to perform classification recognition on the speech features respectively corresponding to each frame to be detected, so as to obtain a detection result of each frame to be detected;

the voice segment generating module 1104 is configured to perform segmentation on the voice to be recognized according to the detection result to obtain a plurality of target voice segments, where the length of each target voice segment is smaller than or equal to a first threshold, and the sum of the lengths of adjacent target voice segments is greater than or equal to a second threshold.

In an exemplary embodiment, the speech segment generation module 1104 includes: the voice segment segmentation module is configured to segment the voice to be recognized according to the detection result to obtain a plurality of original voice segments, and the length of each original voice segment is smaller than or equal to a first threshold; and the segment fusion module is configured to perform segment fusion on the original voice segments to obtain a plurality of target voice segments, wherein the sum of the lengths of the adjacent target voice segments is greater than or equal to a second threshold value.

In an exemplary embodiment, the speech segment segmentation module includes: a starting point determining unit configured to determine a first speech frame in the current original speech segment according to the detection result as a starting point of the current original speech segment; the voice frame determining unit is configured to determine a voice frame and a non-voice frame in a current original voice segment according to a detection result from a starting point, wherein the length of the current original voice segment is the sum of the length of the voice frame and the length of the non-voice frame; the judging unit is configured to judge whether the length of the current original voice fragment reaches a first threshold value or not, or when the current original voice fragment is detected not to reach the first threshold value, judge whether the length of a non-voice frame in the current original voice fragment is larger than a first value changed along with the length of the current voice frame or not; and the end point determining unit is configured to execute, when it is detected that the length of the current original voice segment reaches a first threshold value, or when it is detected that the length of the non-voice frame in the current original voice segment does not reach the first threshold value but is greater than a first value, taking the last voice frame to be detected in the current original voice segment as an end point of the current original voice segment, and so on, to obtain each original voice segment.

In an exemplary embodiment, the sound frame determination unit includes: the acquiring unit is configured to acquire the probability of the non-speech frame of the current to-be-detected speech frame in the current original speech fragment; the second value updating unit is configured to execute the steps of obtaining the length of the current voice frame in the updated current original voice segment and updating a second value which changes along with the length of the current voice frame according to the length of the current voice frame; and the comparison unit is configured to compare the probability of the non-speech frame of the current voice frame to be detected with the second value, and determine a voice classification result of the current voice frame to be detected according to the comparison result, wherein the voice classification result comprises a speech frame and a non-speech frame.

In an exemplary embodiment, the sound frame determination unit further includes: and the first value updating unit is configured to update the length of the current voice frame in the current original voice segment and update the first value according to the length of the current voice frame when the voice classification result of the current voice frame to be detected is determined to be a voice frame.

In an exemplary embodiment, the larger the length of the current speech frame is, the smaller the first value is; the larger the length of the current speech frame, the smaller the second value.

In an exemplary embodiment, the segment fusion module is configured to perform: traversing each original voice segment, and merging the adjacent original voice segments when the sum of the lengths of the adjacent original voice segments is smaller than a second threshold value; and updating the lengths of the fused original voice segments until the sum of the lengths of all the adjacent voice segments is determined to be greater than or equal to a second threshold value, so as to obtain a plurality of target voice segments.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

FIG. 12 is a block diagram illustrating an electronic device 1200 for speech processing in accordance with an example embodiment. For example, the electronic device 1200 may be a server. Referring to fig. 12, electronic device 1200 includes a processing component 1220 that further includes one or more processors, and memory resources, represented by memory 1222, for storing instructions, such as application programs, that are executable by processing component 1220. The application programs stored in memory 1222 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1220 is configured to execute instructions to perform the methods of speech processing described above.

The electronic device 1200 may also include a power supply component 1224 configured to perform power management of the electronic device 1200, a wired or wireless network interface 1226 configured to connect the electronic device 1200 to a network, and an input output (I/O) interface 1228. The electronic device 1200 may operate based on an operating system stored in the memory 1222, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, or the like.

In an exemplary embodiment, a storage medium comprising instructions, such as memory 1222 comprising instructions, executable by a processor of electronic device 1200 to perform the above-described method is also provided. The storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of speech processing, comprising:

and segmenting the voice to be recognized according to the detection result to obtain a plurality of target voice segments, wherein the length of each target voice segment is smaller than or equal to a first threshold value, and the sum of the lengths of the adjacent target voice segments is larger than or equal to a second threshold value.

2. The speech processing method according to claim 1, wherein the segmenting the speech to be recognized according to the detection result to obtain a plurality of target speech segments comprises:

and carrying out fragment fusion on the original voice fragments to obtain a plurality of target voice fragments, wherein the sum of the lengths of the adjacent target voice fragments is greater than or equal to the second threshold value.

3. The speech processing method according to claim 2, wherein the segmenting the speech to be recognized according to the detection result to obtain a plurality of original speech segments comprises:

determining a first voice frame in the current original voice segment according to the detection result, wherein the first voice frame is used as a starting point of the current original voice segment;

when the length of the current original voice segment is detected to reach a first threshold value, or when the length of the current original voice segment is detected not to reach the first threshold value but the length of a non-voice frame in the current original voice segment is larger than a first value which is changed along with the length of the current voice frame, then

And taking the last voice frame to be detected in the current original voice fragment as an end point of the current original voice fragment, and so on to obtain each original voice fragment.

4. The speech processing method according to claim 3, wherein the detection result comprises a probability of a non-speech frame; determining a speech frame and a non-speech frame in the current original speech segment according to the detection result, comprising:

acquiring the probability of a non-speech frame of a current to-be-detected speech frame in the current original speech fragment;

acquiring the updated current voice frame length in the current original voice fragment, and updating a second value which is changed along with the length of the current voice frame according to the current voice frame length;

and comparing the probability of the non-speech frame of the current voice frame to be detected with the second value, and determining a speech classification result of the current voice frame to be detected according to the comparison result, wherein the speech classification result comprises a speech frame and a non-speech frame.

5. The speech processing method according to claim 4, wherein said determining speech frames and non-speech frames in the current original speech segment according to the detection result further comprises:

6. The speech processing method according to claim 5, wherein the larger the length of the current speech frame is, the smaller the first value is; the larger the length of the current speech frame is, the smaller the second value is.

7. The speech processing method according to claim 2, wherein the segment fusing the original speech segments to obtain a plurality of target speech segments comprises:

traversing each original voice fragment, and merging the adjacent original voice fragments when the sum of the lengths of the adjacent original voice fragments is determined to be smaller than the second threshold value;

and updating the lengths of the fused original voice segments until the sum of the lengths of all the adjacent voice segments is determined to be greater than or equal to the second threshold value, so as to obtain the target voice segments.

8. A speech processing apparatus, comprising:

the frame dividing module is configured to acquire a voice to be recognized and perform frame dividing processing on the voice to be recognized to obtain a plurality of voice frames to be detected;

and the voice segment generation module is configured to segment the voice to be recognized according to the detection result to obtain a plurality of target voice segments, wherein the length of each target voice segment is smaller than or equal to a first threshold value, and the sum of the lengths of the adjacent target voice segments is larger than or equal to a second threshold value.

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the speech processing method of any of claims 1 to 7.

10. A storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the speech processing method of any of claims 1 to 7.