CN111402908A

CN111402908A - Voice processing method, device, electronic equipment and storage medium

Info

Publication number: CN111402908A
Application number: CN202010235282.9A
Authority: CN
Inventors: 李泽帅; 黄远望
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-07-10

Abstract

The application provides a voice processing method, a voice processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: decoding original coded data obtained by voice sampling to obtain decoded audio data; if the sampling rate and/or the sampling bit number of the decoded audio data are/is determined to be larger than the set threshold value, the decoded audio data are subjected to down-sampling to obtain target audio data; and sending the target audio data to the server to acquire the text obtained by voice recognition of the target audio data from the server. Therefore, the audio data with high sampling rate and/or high sampling digit is subjected to down-sampling processing, and then the down-sampled target audio data is transmitted to the server side so as to obtain the text obtained by voice recognition from the server side, so that the data transmission quantity is reduced, and the data transmission rate is improved.

Description

Voice processing method, device, electronic equipment and storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech processing method and apparatus, an electronic device, and a storage medium.

Background

Speech-to-text (STT) systems are a way to convert spoken words into text files for subsequent use. For STT, a common scheme at present is to directly transmit the collected audio files (such as MP3, M4A, AMR, and other format audio) to a server, perform voice conversion processing on the audio data by the server, and return the converted text.

In order to ensure the sound quality, the sampling rate, the sampling digit and the bit rate can be greatly improved in the recording process, so that the volume of a transmitted audio file is increased, the burden of the audio file in the process of transmitting to a server is increased, and the transmission efficiency is reduced.

Disclosure of Invention

The present application is directed to solving, at least to some extent, one of the technical problems in the related art.

An embodiment of a first aspect of the present application provides a speech processing method, including:

decoding original coded data obtained by voice sampling to obtain decoded audio data;

if the sampling rate and/or the sampling bit number of the decoded audio data are/is determined to be larger than a set threshold value, the decoded audio data are subjected to down-sampling to obtain target audio data;

and sending the target audio data to a server side so as to obtain a text obtained by voice recognition of the target audio data from the server side.

As a first possible implementation manner of the embodiment of the present application, the downsampling the decoded audio data includes:

the decoded audio data is down-sampled using a Synchronous Sample Rate Conversion (SSRC) algorithm.

As a second possible implementation manner of the embodiment of the present application, the down-sampling the decoded audio data by using a synchronous sampling rate conversion SSRC algorithm includes:

filtering the set length sequence in the decoded audio data by adopting a finite length single-bit impulse response FIR filter;

adding the target sequence with the set length after the sequence with the set length is obtained after filtering to obtain an input sequence of Fourier transform; wherein, the value of each element in the target sequence is zero;

performing fast Fourier transform on the input sequence to obtain a frequency domain sequence;

filtering the frequency domain sequence, and performing inverse fast Fourier transform to obtain a time domain sequence;

and resampling the time domain sequence according to a set down-sampling rate to obtain the target audio data.

As a third possible implementation manner of the embodiment of the present application, before sending the target audio data to the server, the method further includes:

and if the target audio data comprises the double-channel data, one channel data in the double-channel data is removed.

As a fourth possible implementation manner of the embodiment of the present application, the removing one channel data of the binaural data includes:

determining the data length occupied by single channel data in the target audio data;

and eliminating a section of data which accords with the data length for the target audio data at intervals of the data length.

As a fifth possible implementation manner of the embodiment of the present application, before sending the target audio data to the server, the method further includes:

performing voice endpoint detection according to the target audio data to extract a voiced part and an unvoiced part from the target audio data and remove a mute part;

wherein the energy value of the voiced parts is greater than a first energy threshold;

the energy value of the unvoiced part is greater than a second energy threshold;

the first energy threshold is greater than the second energy threshold.

As a sixth possible implementation manner of the embodiment of the present application, before sending the target audio data to the server, the method further includes:

if the bit rate of the target audio data is lower than the set bit rate, performing compression coding by adopting a linear prediction coding mode;

and if the bit rate of the target audio data is not lower than the set bit rate, performing compression coding by adopting a transform coding mode.

The voice processing method of the embodiment of the application decodes the original coded data obtained by voice sampling to obtain decoded audio data; if the sampling rate and/or the sampling bit number of the decoded audio data are/is determined to be larger than the set threshold value, the decoded audio data are subjected to down-sampling to obtain target audio data; and sending the target audio data to the server to acquire the text obtained by voice recognition of the target audio data from the server. Therefore, the audio data with high sampling rate and/or high sampling digit is subjected to down-sampling processing, and then the down-sampled target audio data is transmitted to the server side so as to obtain the text obtained by voice recognition from the server side, so that the data transmission quantity is reduced, and the data transmission rate is improved.

An embodiment of a second aspect of the present application provides a speech processing apparatus, including:

the decoding module is used for decoding the original coded data obtained by voice sampling to obtain decoded audio data;

the down-sampling module is used for down-sampling the decoded audio data to obtain target audio data if the sampling rate and/or the sampling bit number of the decoded audio data are/is determined to be greater than a set threshold value;

and the sending module is used for sending the target audio data to a server so as to obtain a text obtained by voice recognition of the target audio data from the server.

The voice processing device of the embodiment of the application decodes the original coded data obtained by voice sampling to obtain decoded audio data; if the sampling rate and/or the sampling bit number of the decoded audio data are/is determined to be larger than the set threshold value, the decoded audio data are subjected to down-sampling to obtain target audio data; and sending the target audio data to the server to acquire the text obtained by voice recognition of the target audio data from the server. Therefore, the audio data with high sampling rate and/or high sampling digit is subjected to down-sampling processing, and then the down-sampled target audio data is transmitted to the server side so as to obtain the text obtained by voice recognition from the server side, so that the data transmission quantity is reduced, and the data transmission rate is improved.

An embodiment of a third aspect of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the speech processing method described in the embodiment of the first aspect.

A fourth aspect of the present application is directed to a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the speech processing method described in the first aspect.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart of a first speech processing method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a second speech processing method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a third speech processing method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a fourth speech processing method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

In the related technology, the STT detection has a very high recognition rate on a single-channel audio with a sampling rate of 16kHz and a sampling bit number of 16 bits, so that in the STT process, the audio with the sampling rate exceeding 16kHz and the sampling bit number exceeding 16 bits cannot greatly improve the recognition rate of the STT, but can increase the resource loss in the audio transmission process due to the size of an audio file.

In order to solve the technical problem, the application provides a speech processing method, which includes decoding original coded data obtained by speech sampling to obtain decoded audio data, if it is determined that the sampling rate and/or the number of sampling bits of the decoded audio data is greater than a set threshold, downsampling the decoded audio data to obtain target audio data, and sending the target audio data to a server to obtain a text obtained by speech recognition of the target audio data from the server. The method carries out down-sampling processing on the audio data with high sampling rate and/or high sampling digit number, and then transmits the down-sampled target audio data to the server end so as to obtain the text obtained by voice recognition from the server end, thereby reducing the data transmission quantity and improving the data transmission rate.

A voice processing method, apparatus, electronic device, and storage medium according to embodiments of the present application are described below with reference to the accompanying drawings.

Fig. 1 is a flowchart illustrating a first speech processing method according to an embodiment of the present application.

The embodiment of the present application is exemplified by the voice processing method being configured in a voice processing apparatus, and the voice processing apparatus can be applied to any electronic device, so that the electronic device can execute a voice processing function.

The electronic device may be a Personal Computer (PC), a cloud device, a mobile device, and the like, and the mobile device may be a hardware device having various operating systems, such as a mobile phone, a tablet Computer, a Personal digital assistant, a wearable device, and an in-vehicle device.

As shown in fig. 1, the speech processing method includes the steps of:

step 101, decoding the original encoded data obtained by voice sampling to obtain decoded audio data.

The original coding data obtained by voice sampling refers to original coding data obtained by acquiring a voice signal from hardware equipment and performing analog-to-digital conversion on the voice signal.

There are two broad categories of sources for communication systems: analog signals and digital signals. For example: the voice signal output by the microphone belongs to an analog signal; and the text and computer data belong to digital signals. Compared with analog signals, the digital signals have the advantages of strong anti-interference capability and no noise accumulation. Therefore, if the input is an analog signal, the input analog signal needs to be digitized in the source coding part of the digital communication system.

After the speech signal is collected from the hardware device, three steps are required for digitizing the speech signal: sampling, quantizing, and encoding. Sampling is the discretization of an analog signal in time by replacing the original continuous signal in time with a sequence of signal samples at regular intervals. The quantization is to approximate the original continuously changing amplitude value with a finite amplitude value, and change the continuous amplitude of the analog signal into a finite number of discrete values with certain intervals. The coding is to code the quantized signal to form a code group composed of multi-bit binary codes to represent the sample value, and to complete the conversion from analog signal to digital signal.

It should be noted that the audio in the data format obtained by encoding the collected original data is stored in an audio file, where the format of the audio file is MP3, M4A, AMR, WAV, and the like.

As a possible implementation, Pulse Code Modulation (PCM) may be used to encode the acquired raw data. The main process of coding is to sample analog signals such as voice and image at regular intervals to make them discretize, at the same time round the sampled values by hierarchical unit to make rounding quantization, at the same time, the sampled values represent the amplitude of sampled pulse by a group of binary codes.

In the embodiment of the present application, after the original coded data obtained by voice sampling is obtained, the original coded data obtained by voice sampling needs to be decoded to obtain decoded audio data.

As a possible implementation manner, an audio file byte stream storing original encoded data may be filled in an input data buffer of MediaCodec, the MediaCodec adopts a consumer mode, and reads the byte stream from the data buffer in an asynchronous manner and then performs decoding processing, thereby finally obtaining decoded audio data.

And 102, if the sampling rate and/or the sampling bit number of the decoded audio data are/is determined to be greater than the set threshold value, the decoded audio data are subjected to down-sampling to obtain target audio data.

The sampling rate of the audio data refers to the sampling frequency of the sound signal within one second, and the higher the sampling rate is, the truer and more natural the sound is restored. Down-sampling, also known as down-sampling, is a technique for multirate digital signal processing or a process for reducing the signal sampling rate, usually used to reduce the data transmission rate or data size.

In the embodiment of the application, when the sampling rate and/or the sampling bit number of the decoded audio data are/is high, the volume of the transmitted audio file is large, the transmission load is increased, and therefore the transmission efficiency is low. Therefore, in the present application, after the original encoded data obtained by voice sampling is decoded to obtain decoded audio data, if it is determined that the sampling rate and/or the number of sampling bits of the decoded audio data is greater than the set threshold, the decoded audio data is subjected to down-sampling processing to obtain target audio data. Therefore, the size of the audio data is reduced by performing down-sampling processing on the decoded audio data, thereby being beneficial to improving the transmission rate of the data.

For example, when it is determined that the sampling rate of the decoded audio data is greater than 16kHz or the number of sampling bits is greater than 16 bits, the decoded audio data is down-sampled to obtain the target audio data in order to reduce the audio data transmission data.

As a possible implementation manner, it is determined that the sampling Rate and/or the sampling bit number of the decoded audio data are greater than a set threshold, and a Synchronous Sampling Rate Conversion (SSRC) algorithm may be used to perform downsampling on the decoded audio data to obtain target audio data.

It should be noted that, when the SSRC algorithm is used to down-sample the decoded audio data, the sampling rate must be an integer before and after conversion, and the SSRC algorithm does not support conversion between arbitrary frequencies.

In the embodiment of the application, when the SSRC algorithm is used to perform downsampling on decoded audio data, a Finite length Impulse Response (FIR) filter is used to filter a set length sequence in the decoded audio data; adding a target sequence with a set length to the sequence with the set length obtained after filtering to obtain an input sequence of Fourier transform; wherein, each element in the target sequence takes a value of zero; performing fast Fourier transform on the input sequence to obtain a frequency domain sequence; filtering the frequency domain sequence, and performing inverse fast Fourier transform to obtain a time domain sequence; and re-sampling the time domain sequence according to the set down-sampling rate to obtain target audio data.

As an example, the SSRC algorithm employs an n-point fast fourier transform to down-sample the decoded audio data. Firstly, the first n/2 coded input samples are subjected to first 9-order FIR digital filtering through y (n) (a 0 x (n)), + a1 x (n-l) + · + a7 x (n-8) to obtain an input of Fast Fourier Transform (FFT); adding n/2 0 s behind n/2 outputs obtained by filtering, then carrying out fast Fourier transform on n data, carrying out complex field windowing and filtering on frequency domain data in a frequency domain, then carrying out reverse fast Fourier transform, carrying out inverse transform to obtain time domain data again, carrying out deletion and envelope processing according to the length requirement of resampled output data, and finally outputting resampled target audio data.

And 103, sending the target audio data to the server to acquire a text obtained by voice recognition of the target audio data from the server.

In the embodiment of the application, after the decoded audio data are subjected to down-sampling to obtain the target audio data, the target audio data are sent to the server side. Furthermore, the server performs voice recognition on the received target audio data to obtain a text corresponding to the target audio data, and then the text corresponding to the target audio data can be obtained from the server.

According to the voice processing method, original coded data obtained by voice sampling are decoded to obtain decoded audio data; if the sampling rate and/or the sampling bit number of the decoded audio data are/is determined to be larger than the set threshold value, the decoded audio data are subjected to down-sampling to obtain target audio data; and sending the target audio data to the server to acquire the text obtained by voice recognition of the target audio data from the server. Therefore, the audio data with high sampling rate and/or high sampling digit is subjected to down-sampling processing, and then the down-sampled target audio data is transmitted to the server side so as to obtain the text obtained by voice recognition from the server side, so that the data transmission quantity is reduced, and the data transmission rate is improved.

On the basis of the foregoing embodiment, after the decoded audio data is downsampled in step 102 to obtain the target audio data, if the target audio data includes binaural data, one piece of binaural data needs to be removed to reduce the volume of the target audio data that needs to be transmitted, which is beneficial to improving the transmission rate of the data. Referring to fig. 2, fig. 2 is a schematic flowchart of a second speech processing method according to an embodiment of the present application.

As shown in fig. 2, the speech processing method may further include the following steps:

step 201, determining the data length occupied by a single channel data in the target audio data.

The data length refers to the byte occupied by data.

In the embodiment of the present application, the decoded audio data whose sampling rate and/or sampling bit number is greater than the set threshold value is down-sampled, and the obtained target audio data may include binaural data. Since the space occupied by the binaural data is twice as much as that occupied by the monaural data, one of the binaural data can be discarded in order to reduce the data transmission amount.

Specifically, after the target audio data is obtained, the data length occupied by the single channel data in the target audio data may be determined. For example, the data length occupied by one channel data may be 2 bytes or 1 byte.

Step 202, for each interval data length of the target audio data, a section of data meeting the data length is removed.

In the embodiment of the application, after the data length occupied by a single channel data in the target audio data is determined, a section of data which is in accordance with the data length can be removed at intervals of the data length.

For example, assuming that the data length occupied by a single channel of data in the target audio data is 2 bytes, the data of 2 bytes can be removed every 2 bytes, so that a single-channel target audio data can be obtained separately.

For example, the following formula can be adopted to reject left channel data in the target audio data:

f(n)＝f(0)+f(1)+f(4)+f(5)+...+f(2n-1)+f(2n)；

the left channel data in the target audio data can be removed by the following formula:

f(n)＝f(2)+f(3)+f(6)+f(7)+...+f(2n-3)+f(2n-2)。

according to the voice processing method, if the target audio data comprises the two-channel data, when one channel data in the two-channel data is removed, a section of data which is in accordance with the data length can be removed from the target audio data at intervals of the data length by determining the data length occupied by the one channel data in the target audio data. Therefore, the target audio data of the double channels are removed to obtain the target audio data of the single channel, so that the volume of the target audio data needing to be transmitted is reduced, and the transmission rate of the data is improved.

In a possible case, after the decoded audio data is down-sampled to obtain the target audio data, before the target audio data is sent to the server, voice endpoint detection may be performed on the target audio data to extract a voiced part and an unvoiced part from the target audio data and remove a mute part. Therefore, the long mute period is identified and eliminated from the target audio data, so that the effect of saving speech channel resources under the condition of not reducing the service quality is achieved. The above process is described in detail with reference to fig. 3, and fig. 3 is a flowchart illustrating a third speech processing method according to an embodiment of the present application.

As shown in fig. 3, the speech processing method may further include the following steps:

step 301, decoding the original encoded data obtained by voice sampling to obtain decoded audio data.

Step 302, if the sampling rate and/or the sampling number of bits of the decoded audio data is determined to be greater than the set threshold, the decoded audio data is down-sampled to obtain the target audio data.

In the embodiment of the present application, the implementation processes of step 301 and step 302 may refer to the implementation processes of step 101 and step 102 in the foregoing embodiment, and are not described herein again.

Step 303, performing voice endpoint detection according to the target audio data to extract a voiced part and an unvoiced part from the target audio data and remove a mute part.

Voice endpoint Detection (VAD) is used to identify the presence and absence of Voice in target audio data.

In the present application, target audio data obtained by down-sampling decoded audio data may include a voiced part, an unvoiced part, and a mute part, and in order to reduce the volume of an audio file during transmission of the target audio data, voice endpoint detection may be performed on the target audio data to extract the voiced part and the unvoiced part from the target audio data, and the mute part is removed.

In the present application, when performing voice endpoint detection on target audio data, the audio data may be first subjected to framing processing, and then, features are extracted from each frame of data, a classifier is trained on a data frame set of a known voice and silence signal region, and unknown framed data is classified to determine whether it belongs to a voiced part, an unvoiced part, or a silent part.

As a possible case, when feature extraction is performed on target audio data, energy of each frame of audio data may be extracted. It is to be explained that the energy value of the voiced parts and the energy value of the unvoiced parts in the target audio data are greater than a first energy threshold and a second energy threshold, wherein the first energy threshold is greater than the second energy threshold. Therefore, it is possible to extract a voiced part and an unvoiced part in the target audio data by setting the energy threshold.

As a possible implementation manner, when performing voice endpoint detection on target audio data, the audio data may be subjected to framing processing, an energy value of each frame of audio data is extracted, the energy value of each frame of audio data is compared with a first energy threshold, and if the energy value of the audio data is greater than the first energy threshold, the frame of audio data is determined to be a voiced part. Thereby, a voiced part can be extracted from the target audio data.

Further, the energy value of each frame of audio data of the target audio data after the voiced part is extracted is compared with a second energy threshold, and if the energy value of the audio data is greater than the second energy threshold, the frame of audio data is determined to be an unvoiced part.

In the embodiment of the application, the energy of the unvoiced part and the energy of the mute part in the target audio data can be distinguished through a short-time zero-crossing rate threshold, and after the unvoiced part is distinguished, the unvoiced part can be removed, so that the volume of a voice file needing to be transmitted in a TTS process can be reduced from the audio file by discarding the silent audio segment.

And step 304, sending the target audio data to the server side so as to obtain a text obtained by voice recognition of the target audio data from the server side.

It should be noted that the target audio data sent to the server side in step 304 is data after the mute section is removed.

In the embodiment of the present application, the implementation process of step 304 may refer to the implementation process of step 103 in the foregoing embodiment, and is not described herein again.

The voice processing method of the embodiment of the application comprises the steps of decoding original coded data obtained by voice sampling to obtain decoded audio data, if the sampling rate and/or the sampling digit of the decoded audio data are/is determined to be larger than a set threshold value, then down-sampling the decoded audio data to obtain target audio data, carrying out voice endpoint detection according to the target audio data to extract a voiced part and an unvoiced part from the target audio data and remove the mute part, and sending the target audio data to a server end to obtain a text obtained by voice recognition of the target audio data from the server end. Therefore, the mute part is removed before the target audio data is sent to the server side, so that the transmission quantity of the audio data is reduced, and the data transmission rate is improved.

Under a possible condition, after the decoded audio data are subjected to down sampling to obtain target audio data, before the target audio data are sent to the server, the bit rate of the target audio data can be compared with a set bit rate to determine that the target audio data are subjected to compression coding by adopting a corresponding coding mode, and then the target audio data subjected to compression coding are sent to the server. The above process is described in detail with reference to fig. 4, and fig. 4 is a flowchart illustrating a fourth speech processing method according to an embodiment of the present application.

As shown in fig. 4, before the step 103, the following steps may be further included:

step 401, decoding the original encoded data obtained by voice sampling to obtain decoded audio data.

And step 402, if the sampling rate and/or the sampling bit number of the decoded audio data are determined to be greater than the set threshold value, the decoded audio data are subjected to down-sampling to obtain target audio data.

In the embodiment of the present application, the implementation processes of step 401 and step 402 may refer to the implementation processes of step 101 and step 102 in the foregoing embodiment, and are not described herein again.

In step 403, the bit rate of the target audio data is compared with the set bit rate.

The bit rate of the audio data refers to the amount of binary data in unit time after the analog sound signal is converted into the digital sound signal, and is an index for indirectly measuring the audio quality. Wherein, the higher the bit rate, the better the quality of the audio, but the larger the volume of the encoded audio file; the lower the bit rate, the poorer the audio quality, but the smaller the volume of the encoded audio file.

It can be understood that, if the target audio data is directly transmitted to the server without compression coding, the target audio data will occupy a very large bandwidth, and a huge data volume will bring pressure to the transmission and storage of the audio data, so that after the target audio data is obtained, the target audio data can be subjected to compression coding to reduce the volume of an audio file in the data transmission process, thereby reducing the data transmission volume.

For example, after the decoded audio data is down-sampled and compression-encoded, the audio data is sent to the server side, which reduces the data transmission amount by about 80% compared with the case of directly sending the decoded audio data.

In the embodiment of the application, after the target audio data is obtained, the bit rate of the target audio data is compared with the set bit rate to determine the mode of compression coding the target audio data.

As a possible implementation manner, after the target audio data is obtained, the target audio data may be compression-encoded by using Opus encoding.

Wherein, Opus code is combined and coded by two coding algorithms, namely, Silk and Celt, has low algorithm delay and extremely high compression ratio, the encoding end and the decoding end both use filters provided by Broadcom, and in the coding process, the pre-filter can reserve the low-frequency part of the audio signal, weaken the high-frequency part and improve the coding efficiency.

Opus can seamlessly adjust high and low bit rates, using linear prediction coding at lower bit rates and transform coding at high bit rates inside the encoder. Therefore, in the embodiment of the present application, the bit rate of the target audio data is compared with the set bit rate to determine which encoding method is used for compression encoding.

In step 404, if the bit rate of the target audio data is lower than the set bit rate, a linear predictive coding method is adopted for compression coding.

In a possible case, the bit rate of the target audio data is lower than the set bit rate, and the target audio data is compression-encoded by using a linear prediction encoding method.

The linear predictive coding is mainly used for audio signal processing and speech processing, and is a tool for representing the spectral envelope of a digital speech signal in a compressed form according to information of a linear predictive model.

Step 405, if the bit rate of the target audio data is not lower than the set bit rate, performing compression coding by using a transform coding mode.

The transform coding is not to directly code the spatial domain image signal, but to map and transform the spatial domain image signal to another orthogonal vector space (transform domain or frequency domain) to generate a batch of transform coefficients, and then to code the transform coefficients. The transformation coding is an indirect coding method, wherein the key problems are that when the time domain or the space domain is described, the correlation among data is large, the redundancy of the data is large, the data correlation is greatly reduced, the redundancy of the data is reduced, the parameters are independent, the data quantity is small after the transformation is described in the transformation domain, and the coding can obtain a larger compression ratio by quantizing.

Step 406, sending the compressed and encoded target audio data to the server side, so as to obtain a text obtained by speech recognition of the target audio data from the server side.

The voice processing method of the embodiment of the application obtains decoded audio data by decoding original coded data obtained by voice sampling, performs down-sampling on the decoded audio data to obtain target audio data if the sampling rate and/or the sampling bit number of the decoded audio data are determined to be greater than a set threshold value, and performs compression coding by adopting a linear prediction coding mode if the bit rate of the target audio data is lower than the set bit rate; and if the bit rate of the target audio data is not lower than the set bit rate, performing compression coding by adopting a transform coding mode, and sending the target audio data subjected to compression coding to the server side so as to obtain a text obtained by performing voice recognition on the target audio data from the server side. Therefore, the audio data is compressed and encoded before being transmitted so as to realize the transmission of the audio data after being encoded, thereby reducing the data transmission volume and improving the data transmission rate.

It should be noted that, on the basis of the foregoing embodiment, the original encoded data obtained by voice sampling is decoded to obtain decoded audio data, and if it is determined that the sampling rate and/or the number of sampling bits of the decoded audio data is greater than a set threshold, the decoded audio data is downsampled to obtain target audio data, and after it is determined that the target audio data includes binaural data, one piece of binaural data is removed, and then voice endpoint detection is performed on the removed monaural target audio data to remove a mute portion. Further, the audio data without the mute part is compressed and encoded, and the compressed and encoded audio data is sent to the server side. Therefore, the data transmission quantity in the data transmission process is reduced, and the data transmission efficiency is improved.

In order to implement the above embodiments, the present application further provides a speech processing apparatus.

As shown in fig. 5, the speech processing apparatus 500 may include: a decoding module 510, a down-sampling module 520, and a transmitting module 530.

The decoding module 510 is configured to decode original encoded data obtained by speech sampling to obtain decoded audio data.

The down-sampling module 520 is configured to down-sample the decoded audio data to obtain target audio data if it is determined that the sampling rate and/or the number of sampling bits of the decoded audio data is greater than a set threshold.

The sending module 530 is configured to send the target audio data to the server, so as to obtain a text obtained by speech recognition on the target audio data from the server.

As a possible scenario, the down-sampling module 520 may further be configured to:

the decoded audio data is down-sampled using a synchronous sample rate conversion SSRC algorithm.

As another possible scenario, the down-sampling module 520 may further be configured to:

filtering a set length sequence in the decoded audio data by adopting a finite length single-bit impulse response FIR filter;

adding a target sequence with a set length to the sequence with the set length obtained after filtering to obtain an input sequence of Fourier transform; wherein, each element in the target sequence takes a value of zero;

and re-sampling the time domain sequence according to the set down-sampling rate to obtain target audio data.

As another possible case, the speech processing apparatus 500 may further include:

and the eliminating module is used for eliminating one sound channel data in the two-channel data if the target audio data comprises the two-channel data.

As another possible scenario, the culling module may be further configured to:

and eliminating a section of data which accords with the data length for the target audio data at every interval of the data length.

the detection module is used for carrying out voice endpoint detection according to the target audio data so as to extract a voiced part and an unvoiced part from the target audio data and remove a mute part;

the first energy threshold is greater than the second energy threshold.

the compression coding module is used for performing compression coding by adopting a linear prediction coding mode if the bit rate of the target audio data is lower than a set bit rate; and if the bit rate of the target audio data is not lower than the set bit rate, performing compression coding by adopting a transform coding mode.

It should be noted that the foregoing explanation of the embodiment of the speech processing method is also applicable to the speech processing apparatus of the embodiment, and is not repeated here.

The voice processing device of the embodiment of the application obtains decoded audio data by decoding the original coded data obtained by voice sampling; if the sampling rate and/or the sampling bit number of the decoded audio data are/is determined to be larger than the set threshold value, the decoded audio data are subjected to down-sampling to obtain target audio data; and sending the target audio data to the server to acquire the text obtained by voice recognition of the target audio data from the server. Therefore, the audio data with high sampling rate and/or high sampling digit is subjected to down-sampling processing, and then the down-sampled target audio data is transmitted to the server side so as to obtain the text obtained by voice recognition from the server side, so that the data transmission quantity is reduced, and the data transmission rate is improved.

In order to implement the foregoing embodiments, the present application also proposes an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the program, the electronic device implements the speech processing method in the foregoing embodiments.

In order to implement the above embodiments, the present application also proposes a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the speech processing method as in the above embodiments.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A method of speech processing, the method comprising:

2. The speech processing method of claim 1, wherein the downsampling the decoded audio data comprises:

3. The speech processing method of claim 2 wherein the down-sampling the decoded audio data using a Synchronous Sample Rate Conversion (SSRC) algorithm comprises:

adding the target sequence with the set length to the sequence with the set length obtained after filtering to obtain an input sequence of Fourier transform; wherein, the value of each element in the target sequence is zero;

4. The speech processing method according to claim 1, wherein before sending the target audio data to the server, the method further comprises:

5. The speech processing method of claim 4, wherein said removing one of the two-channel data comprises:

6. The speech processing method according to claim 1, wherein before sending the target audio data to the server, the method further comprises:

the first energy threshold is greater than the second energy threshold.

7. The speech processing method according to any one of claims 1 to 6, wherein before sending the target audio data to the server, the method further comprises:

8. A speech processing apparatus, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the speech processing method according to any of claims 1-7 when executing the program.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the speech processing method according to any one of claims 1 to 7.