CN113450809B

CN113450809B - Voice data processing method, system and medium

Info

Publication number: CN113450809B
Application number: CN202110999948.2A
Authority: CN
Inventors: ***; 朱勇; 王尧; 叶东翔
Original assignee: Barrot Wireless Co Ltd
Current assignee: Barrot Wireless Co Ltd
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2021-11-30
Anticipated expiration: 2041-08-30
Also published as: CN113450809A

Abstract

The application discloses a voice data processing method, a system and a medium, belonging to the technical field of Bluetooth audio data processing. The method comprises the following steps: analyzing a current audio frame to be coded, and determining a coding rate corresponding to the current audio frame, wherein the coding rate comprises a full code rate, a medium code rate and a low code rate, the full code rate is greater than the medium code rate, and the medium code rate is greater than the low code rate; setting retransmission times according to the coding rate, wherein the coding rate is positively correlated with the retransmission times; and coding the current audio frame according to the coding rate, and transmitting the coding result. The method and the device select the corresponding coding code rate according to the characteristics of the audio frame to code the audio frame, and correspondingly set different retransmission times. The retransmission times of some audio frames without effective voice are reduced, even the retransmission is not carried out, the bandwidth of audio transmission is prevented from being wasted, the power consumption is reduced, and the time delay is reduced.

Description

Voice data processing method, system and medium

Technical Field

The present application relates to the field of bluetooth audio data processing technologies, and in particular, to a method, a system, and a medium for processing voice data.

Background

In the latest bluetooth Low Energy Audio (LE Audio) specification, a point-to-point synchronous stream transmission link technology (CIS) is introduced to implement Low-latency Audio transmission. When a CIS link is established, corresponding Quality of Service (QoS) parameters are configured, where the QoS parameters include a Timeout (FT), a synchronous link time Interval (ISO Interval), a maximum Number of sub-events (NSE) in a time Interval, and a Number of packets (Burst Number, BN) allowed to be sent in a time Interval, and these parameters determine the maximum retransmission times of each packet and cannot be modified during transmission.

When a certain data packet reaches the retransmission times and cannot be correctly received by the receiving end, the data packet (including the audio packet) is discarded, so that the katon phenomenon of the receiving end is caused.

The patent ' data transmission method, device, equipment, system and medium ', CN202080001621.5 ', proposes a method to alleviate the stuck phenomenon, which is briefly described as follows: firstly, monitoring the packet loss rate at a receiving end, and sending indication information to a transmitting end when the packet loss rate is greater than a set threshold; and then after receiving the indication information, the transmitting end reduces the audio coding rate of the transmitting end and generates a plurality of same audio frames, and the reduction of the rate causes the audio frames to become smaller, so that one data packet can contain more same audio frames. Although the whole code rate is unchanged, the sending end sends the same frame more times, which is equivalent to increase the retransmission times, thereby reducing the frame loss probability.

The method can effectively reduce the probability of subsequent seizure, but if the method is applied to voice communication, the method has the following defects: in voice communication, only about 35% of the time of any party is speaking, the rest time is in a silent state, sometimes no effective voice exists in a lost packet, and whether retransmission has little influence on the voice quality or not is not large, but the invention still tries to retransmit, thereby not only wasting bandwidth but also making little contribution to the voice quality; in addition, in the above method, data packets without valid voice in the audio are still repeatedly retransmitted, resulting in wasting valuable bandwidth, increasing power consumption and increasing delay.

Disclosure of Invention

The application provides a voice data processing method, system and medium, aiming at solving the problems that in the prior art, when audio data is retransmitted by an audio card, audio frames are not distinguished, the audio frames without effective voice are still retransmitted, the bandwidth of audio transmission is wasted, and the power consumption and the time delay are increased.

In one aspect of the present application, a method for processing voice data is provided, including: analyzing a current audio frame to be coded, and determining a coding rate corresponding to the current audio frame, wherein the coding rate comprises a full code rate, a medium code rate and a low code rate, the full code rate is greater than the medium code rate, and the medium code rate is greater than the low code rate; setting retransmission times according to the coding rate, wherein the coding rate is positively correlated with the retransmission times; and coding the current audio frame according to the coding rate, and transmitting the coding result.

Optionally, analyzing the current audio frame to determine the coding rate corresponding to the current audio frame includes: detecting effective voice in the current audio frame; if the effective voice contained in the current audio frame meets the preset requirement, setting the coding code rate as the full code rate; if the effective voice contained in the current audio frame does not meet the preset requirement, detecting the effective bandwidth corresponding to the current audio frame; if the effective bandwidth is in the preset bandwidth range, setting the coding rate as a medium code rate; otherwise, setting the coding rate as a low code rate.

Optionally, detecting valid speech contained in the current audio frame includes: detecting a fundamental tone existence mark of the current audio frame, wherein if the fundamental tone existence mark is a preset value, the effective voice contained in the current audio frame meets the preset requirement; detecting a first pitch delay of a current audio frame, and if a difference value between the first pitch delay and a second pitch delay of a previous audio frame is greater than a preset threshold, enabling effective voice contained in the current audio frame to meet a preset requirement; and or detecting the voice frequency band energy entropy corresponding to the current audio frame, wherein if the voice frequency band energy entropy is smaller than a first preset energy entropy threshold, the effective voice contained in the current audio frame meets the preset requirement.

Optionally, the voice data processing method of the present application further includes: and if the effective voice contained in the current audio frame does not meet the preset requirement, detecting the voice frequency band energy entropy corresponding to the current audio frame, and if the voice frequency band energy entropy is smaller than a second preset energy entropy threshold, setting the coding code rate as the medium code rate.

Optionally, the voice data processing method of the present application further includes: the audio receiving end decodes the received coding result and analyzes the decoding result, wherein if the decoding result has errors and the coding rate corresponding to the coding result is a full code rate, a retransmission request is sent to the audio transmitting end; if the decoding result has errors and the coding rate corresponding to the coding result is a medium code rate, sending a retransmission request to an audio transmitting end; and if the decoding result has errors and the coding code rate corresponding to the coding result is a low code rate, starting the PLC service to hide the errors.

Optionally, encoding the current audio frame according to the encoding rate, and transmitting the encoding result, including: if the audio transmitting end receives the retransmission request, the coding rate corresponding to the current audio frame is judged, if the current audio frame is coded by the [ FL1] full code rate, the full code rate is reduced, the frame data corresponding to the coding result is recombined according to the reduced updated coding rate, and the recombined result is retransmitted until the audio receiving end successfully receives the retransmission result or the retransmission times meet the first retransmission time requirement; and if the coding rate corresponding to the current audio frame is the middle code rate, directly retransmitting the frame data corresponding to the coding result until the audio receiving and receiving end successfully receives the retransmission result or the retransmission times meet the requirement of the second retransmission times.

Optionally, the process of recombining comprises: and analyzing the frame data corresponding to the coding result obtained by the full-code-rate coding, and writing the frame data into the code stream corresponding to the updated coding rate according to a preset sequence to obtain a recombination result.

Optionally, judging the coding rate corresponding to the audio receiving end; and if the coding code rate corresponding to the audio receiving end is the full code rate, starting the PLC service and hiding the error.

Optionally, if the coding rate of the audio receiving end is the full rate, and then the PLC service is started to hide the error, further comprising: judging whether the decoding result corresponding to the previous audio frame has errors or not; if the decoding result corresponding to the previous audio frame has errors, the audio receiving end sends a retransmission request; and if the decoding result corresponding to the previous audio frame has no error, the audio receiving end starts the PLC service to conceal the error.

In one aspect of the present application, there is provided a voice data processing system including: the code rate determining module analyzes the current audio frame to be coded and determines a coding code rate corresponding to the current audio frame, wherein the coding code rate comprises a full code rate, a medium code rate and a low code rate, the full code rate is greater than the medium code rate, and the medium code rate is greater than the low code rate; the retransmission time determining module is used for setting retransmission times according to the coding code rate, wherein the coding code rate is positively correlated with the retransmission times; and the coding and transmission module is used for coding the current audio frame according to the coding rate and transmitting the coding result.

The beneficial effect of this application is: the voice data processing method of the application detects the audio frame to be coded, selects the corresponding coding rate according to the characteristics of the audio frame to code the audio frame, and correspondingly sets different retransmission times. When the audio frame contains effective voice, the coding is carried out by using a higher coding rate, and a higher retransmission time is correspondingly set; when the audio frames do not contain the effective voice, the lower coding rate is used for coding, and the retransmission times at the bottom of the audio frames are correspondingly set, so that the retransmission times of the audio frames without the effective voice are reduced, even the retransmission is not carried out, the waste of the bandwidth of audio transmission is avoided, the power consumption is reduced, and the time delay is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 shows an example of retransmission of voice data;

FIG. 2 illustrates an exemplary application scenario of the speech data processing method of the present application;

FIG. 3 illustrates an embodiment of a speech data processing method of the present application;

FIG. 4 illustrates an example of a speech data processing method of the present application;

FIG. 5 illustrates an embodiment of a speech data processing system of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the above-described drawings (if any) are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of steps or elements is not necessarily limited to those elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

When bluetooth audio transmission is performed, the wireless environment of bluetooth is complicated and variable, for example, when a call is made or music is listened to in a public place, the CIS link quality is deteriorated due to the increase of the complicated interference of the wireless environment, the bluetooth transmission fails, and at this time, a lost packet needs to be retransmitted. Where figure 1 shows an example of retransmission of voice data. As shown in fig. 1: when the Bluetooth receiving end detects that one packet is correct, an ACK signal is returned to the Bluetooth transmitting end to indicate that the correct packet is received; when the Bluetooth receiving end detects a packet error, a NACK is returned to the Bluetooth transmitting end to indicate that a correct packet is not received; after receiving the ACK signal, the transmitting end sends the next packet according to the set time sequence; after receiving the NACK signal, the audio transmitting end retransmits the failed packet, and the number of retransmissions depends on the system design. If the transmitting end does not receive the ACK signal or the NACK signal within the specified time, namely the condition of time-out, the audio packet is retransmitted at the moment.

Although the prior art effectively reduces the probability of subsequent jamming, if the prior art is applied to voice communication, the defects are that: in voice communication, only about 35% of the time of any party is speaking, the rest time is in a silent state, sometimes, no effective voice exists in a lost packet, whether retransmission has little influence on the voice quality, but if retransmission is still carried out, bandwidth is wasted, and the voice quality is not greatly contributed; in addition, in the above method, data packets without valid voice in the audio are still repeatedly retransmitted, resulting in wasting valuable bandwidth, increasing power consumption and increasing delay.

The application provides a voice data processing method, system and medium, aiming at the problems that in the method, audio frames are not distinguished, audio frames without valid tones are still retransmitted, the bandwidth of audio transmission is wasted, and the power consumption and the time delay are increased. Firstly, analyzing a currently coded audio frame, coding the audio frame according to a selected proper coding rate, and setting corresponding retransmission times. If the audio frame contains effective voice, setting a higher coding rate to code the audio frame, and setting a higher retransmission time; when the audio frame does not contain active sound and corresponds to a silent state during mobile phone conversation, the audio frame is coded by using a lower coding rate, and a lower retransmission frequency is set, so that the retransmission of invalid audio frames is reduced, the waste of bandwidth during audio data transmission is avoided, the power consumption is reduced, and the time delay is reduced.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 2 shows a typical application scenario of the speech data processing method of the present application.

As shown in fig. 2, the handset and bluetooth headset at the near end are connected wirelessly via bluetooth, as is the far end. The near-end mobile phone and the far-end mobile phone are connected with each other through a 2G/3G/4G/5G wireless network and/or a public switched telephone network PSTN. The present invention is exemplified mainly by the proximal end, and the idea is the same for the distal end. The near-end mobile phone is used as a Bluetooth transmitting end, encodes voice audio signals and transmits an encoding result; the far-end mobile phone is used as a Bluetooth receiving end and used for receiving and decoding the coding result sent by the near-end mobile phone.

Fig. 3 shows an embodiment of the speech data processing method of the present application.

In the embodiment shown in fig. 3, the speech data processing method of the present application includes: the process S301 analyzes a current audio frame to be encoded, and determines an encoding rate corresponding to the current audio frame, where the encoding rate includes a full rate, a medium rate, and a low rate, where the full rate is greater than the medium rate, and the medium rate is greater than the low rate.

In this embodiment, during the encoding of the current audio frame, there are cases such as silence during a speech call, which results in different data amounts contained in the data of the current audio frame. Therefore, the current audio frame is analyzed, and the proper coding rate is selected for subsequent coding, so that the requirement of the voice quality of the voice data is met, and the waste of the coding rate is avoided. When the audio frame contains more audio data, the high code rate is adopted for coding, and high tone quality is realized; when the audio frame contains less audio data, the low code rate is adopted for coding, so that code rate waste is avoided, wherein the coding code rate is divided into a full code rate, a medium code rate and a low code rate.

Specifically, the full code rate corresponds to a reference code rate in an audio encoding process, and the full code rate mainly encodes an audio frame containing audio data in the audio frame, for example, under a condition of containing effective speech; the specific code rate value of the medium code rate can be set according to the type and model of the audio encoder, for example, the specific code rate value can be set to be a half of the full code rate, wherein the medium code rate corresponds to the situation that the audio frame data contains less audio data; the low code rate is set according to the type and model of the audio encoder, and may be set to, for example, one-third of the full code rate, where the low code rate corresponds to a situation where the audio frame is a mute frame and the like and has little audio data. The current audio frame is analyzed, and the corresponding coding rate is set to code the current audio frame, so that the waste of the code rate is avoided while the audio tone quality is ensured.

Optionally, analyzing the current audio frame to determine the coding rate corresponding to the current audio frame includes: detecting effective voice in the current audio frame; if the effective voice contained in the current audio frame meets the preset requirement, setting the coding code rate as the full code rate; if the effective voice contained in the current audio frame does not meet the preset requirement, detecting the effective bandwidth corresponding to the current audio frame; if the effective bandwidth is within the preset bandwidth range, setting the coding rate as a medium code rate; otherwise, setting the coding rate as a low code rate.

In the optional embodiment, when the coding rate corresponding to the current audio frame is determined, detecting the effective speech in the current audio frame, wherein if the effective speech in the current audio frame meets a preset requirement, in a corresponding application scene, two parties in a call are in the process of communication, and setting the coding rate to be a full code rate to ensure the tone quality of the call; if the effective voice in the current audio frame is detected not to meet the preset requirement, the current audio frame only contains less audio data, and the coding rate is set to be the middle rate at the moment in order to avoid the waste of the code rate; otherwise, the coding rate is set to be a low rate. In the corresponding application scene, both parties of the call do not speak, and the state is a silent state in the voice call process. The corresponding coding rate is set according to different conditions of the current audio frame to code the audio frame, so that the waste of the coding rate is avoided while the tone quality is ensured.

Specifically, when bandwidth judgment is performed, two situations are mainly adopted: one of these is detection as full bandwidth: through the judgment of the effective voice, when the full bandwidth is the effective voice, the full code rate is used for coding, and when the full bandwidth corresponds to the mute condition, the low code rate is used for coding; secondly, detecting as a non-full bandwidth: 1/3 full bandwidth, 1/4 full bandwidth, etc., indicating strong energy, usually there is speech, although the effective bandwidth is very narrow, since the speech features that the main energy is concentrated in the low frequency part, it still needs to be coded using the middle code rate, so as to ensure the sound quality, for example, when the sampling rate of 48kHz, the full bandwidth is 24kHz, if the effective bandwidth is 8kHz, the narrow frequency bandwidth corresponds to less information amount, the middle code rate is used. When the preset bandwidth range of the non-full bandwidth is set, reasonable selection can be performed according to different audio bandwidth characteristics and different requirements of audio tone quality, for example, a middle rate coding is used in a part from 1/6 full bandwidth to 1/2 full bandwidth, and the specific preset bandwidth range is determined according to actual situations, which is not particularly limited in the present application.

Optionally, detecting valid speech contained in the current audio frame includes: detecting a fundamental tone existence mark of the current audio frame, wherein if the fundamental tone existence mark is a preset value, the effective voice contained in the current audio frame meets the preset requirement; detecting a first pitch delay corresponding to a current audio frame, wherein if the difference value of the first pitch delay and a second pitch delay of a previous audio frame is greater than a preset threshold, effective voice contained in the current audio frame meets a preset requirement; and or detecting the voice frequency band energy entropy corresponding to the current audio frame, wherein if the voice frequency band energy entropy is smaller than a first preset energy entropy threshold, the effective voice contained in the current audio frame meets the preset requirement.

In this alternative embodiment, the present application may perform a plurality of determination methods when detecting whether the current audio frame contains valid speech. And acquiring a Pitch existence flag Pitch _ present in the encoding process of the current audio frame, wherein if the Pitch existence flag is a preset value, namely when the Pitch existence flag is 1, the Pitch existence flag indicates that the Pitch exists in the audio frame and meets the preset requirement of effective voice, and the current audio frame contains the effective voice in the voice call process corresponding to the application scene when a user is speaking. Or detecting a first Pitch delay Pitch _ curr corresponding to the current audio frame, and judging that the first Pitch delay Pitch _ curr is compared with a second Pitch delay Pitch _ last corresponding to the previous audio frame, wherein if the difference value of the first Pitch delay Pitch _ curr and the second Pitch delay Pitch _ last is greater than a preset threshold, the Pitch delay change is fast, the preset requirement of effective voice is met, and corresponding to a scene that a user is speaking in the voice call process, the current audio frame contains the effective voice at the moment.

Specifically, the setting of the preset threshold between the first pitch delay and the second pitch delay may be performed between 10 and 20, and preferably, the preset threshold is 15. The determination of the preset threshold may be reasonably selected according to the type and model of the actual audio encoder, and the specific preset threshold is not specifically limited in the present application.

Note that the pitch lag described above corresponds to the pitch lag index in the audio codec, and for example, for the LC3 audio codec, the pitch lag index is calculated based on a 12.8kHz sampling rate, and the calculated pitch lag index range is 32-228. If the conversion is to be actual time, the corresponding time range is 2.5ms-17.8125ms, and the corresponding conversion mode can refer to: (32/12800) × 1000=2.5 ms.

In this optional embodiment, when determining whether the current audio frame contains valid speech, the determination may be performed according to the energy entropy of the speech band corresponding to the current audio frame. And calculating the energy entropy of the voice frequency band corresponding to the current voice frame, comparing the energy entropy with a first preset energy entropy threshold, and when the energy entropy of the voice frequency band is smaller than the first preset energy threshold, enabling the effective voice contained in the current voice frequency frame to meet the preset requirement. Wherein, regarding the calculation of the frequency band energy entropy, it can be performed by the methods in the prior art, and is not further described in this application, preferably, the first preset energy entropy threshold can be set to 0.6.

Specifically, the determination of whether the current audio frame contains valid speech may be performed by the following method. Acquiring an attach impact mark in the encoding process of a current audio frame, and if the attach impact mark is a preset threshold value 1, indicating that an energy value corresponding to the current audio frame is suddenly changed and corresponding to the beginning of speaking in an actual conversation scene; a prediction gain Pred _ gain value in the encoding process may also be obtained, where if the Pred _ gain value is greater than a preset threshold, it indicates that the current audio frame contains valid speech, where an available value of the Pred _ gain threshold is 1.4-1.6, and regarding the determination of the specific threshold, a reasonable selection may be performed according to the type and model of the actual audio encoder and some encoding requirements, which is not specifically limited in this application.

It should be noted that, in the above-described various methods for determining valid speech in the current audio frame, in a specific operation process, one method may be adopted for determination, or multiple methods may be adopted for determination at the same time, so as to improve the accuracy of determining valid speech in the current audio frame.

Optionally, the voice data processing method of the present application further includes: and if the effective voice contained in the current audio frame does not meet the preset requirement, detecting the voice frequency band energy entropy corresponding to the current audio frame, and if the voice frequency band energy entropy is within the second preset energy entropy range, setting the coding code rate as the medium code rate.

In this optional embodiment, after the valid speech included in the current audio frame is judged to meet the preset requirement, the full bit rate is set. If the effective voice contained in the current audio frame does not meet the preset requirement, judging whether the voice frequency band energy entropy corresponding to the current audio frame is within the preset energy entropy range, and if so, setting the coding rate as the middle code rate.

Specifically, the lower limit value of the preset energy entropy range is a first preset energy entropy threshold, the value of the lower limit value is 0.6, the value of the upper limit value of the preset energy entropy range is 1.0, and if the energy entropy of the voice frequency band corresponding to the current audio frame is between 0.6 and 1.0, the coding rate is set to be a middle rate. The value of the preset energy entropy range can be reasonably selected according to the requirements of the type and the model of an actual audio encoder, the audio tone quality and the like, and the method is not particularly limited.

In an actual application scenario, when the speech band energy entropy of the audio is greater than 1.0, the corresponding audio is usually in a mute state; when the energy entropy of the voice frequency band of the audio is in the range of 0.6-1.0, the audio is clear with high probability, and according to the voice coding theory, the requirement of tone quality can be met by using the code rate coding; when the energy entropy of the voice frequency band of the audio is below 0.6, the audio is usually voiced, and full rate coding is required to meet the requirement of sound quality.

In the specific embodiment shown in fig. 3, the voice data processing method of the present application includes a process S302 of setting retransmission times according to a coding rate, where the coding rate is positively correlated with the retransmission times.

In this embodiment, after determining the coding rate corresponding to the current audio frame according to the characteristics of the current audio frame, the audio codecs negotiate with each other, and determine the retransmission times according to a predetermined rule, where the coding rate is positively correlated with the retransmission times, that is, the higher the coding rate is, the more the corresponding retransmission times are, that is, the retransmission times corresponding to the full code rate is greater than or equal to the retransmission times corresponding to the medium code rate, and the retransmission times corresponding to the medium code rate is greater than or equal to the retransmission times corresponding to the low code rate.

Specifically, in the speech data processing method, full-rate coding is performed on a current audio frame, effective speech contained in the audio frame meets a preset requirement and contains more effective speech, so that the problems of influence, delay and blockage of speech and tone quality caused by the loss of the effective speech in the transmission process are avoided, a larger retransmission frequency is set for the current audio frame using the full-rate coding, when the audio frame is not successfully transmitted, multiple retransmission can be performed, the loss of effective speech data is avoided, and the call quality and the tone quality are influenced. When the current audio frame is subjected to medium-code-rate coding, although the effective voice contained in the audio frame does not meet the preset requirement, the current audio frame contains less effective voice, and if the audio frame data is lost, the voice quality is also affected, so that the retransmission times can be set to be less. When the current audio frame is subjected to low-code-rate coding, the audio frame hardly contains effective voice, both parties of a conversation are in a silent state corresponding to an actual application scene, the audio frame is useless data, the retransmission times are set to be very low, even retransmission is not carried out, on one hand, even if the audio frame is lost, the integrity of the effective audio data cannot be influenced, and the requirement on the tone quality is not high because of the silent state; meanwhile, the low retransmission times are set, even no retransmission is carried out, the bandwidth of audio transmission can be further prevented from being wasted, the power consumption is reduced, and the time delay is reduced.

In this alternative embodiment, different retransmission operations are correspondingly performed after a decoding error occurs according to different received encoding results. After the coding result coded by the full code rate is wrong, the current code stream is found to have errors when a modem of an audio receiver decodes the coding result, or the current code stream can be found to have errors when the modem of the audio receiver decodes the coding result, the coding result coded by the full code rate contains more effective voices, in order to avoid the influence of tone quality and other call quality caused by the loss of the effective voices, a retransmission request is sent to an audio transmitting end by an audio receiving end at the moment, and the audio transmitting end retransmits the coding result of the current audio frame; after the coding result coded by the medium code rate is wrong, similarly, the audio receiving end sends a retransmission request to the audio transmitting end, and the audio transmitting end retransmits the coding result of the current audio frame; after the coding result coded by the low-bit-rate coding is wrong, because the coding result coded by the low-bit-rate coding hardly contains effective voice, corresponding to the silent state of both parties in a conversation in an application scene, even if the audio frame is lost, the conversation quality is not affected, and if the coding result coded by the low-bit-rate coding is retransmitted for a plurality of times, the waste of audio transmission bandwidth is caused. Therefore, errors exist in the decoding result, and when the coding rate corresponding to the coding result is a low code rate, the PLC service is started to hide the errors and not retransmit the errors.

Optionally, if there is an error in the decoding result and the coding rate corresponding to the coding result is a full code rate, sending a retransmission request to the audio transmitting end, further comprising: judging the coding rate of an audio receiving end; and if the coding code rate of the audio receiving end is the full code rate, starting the PLC service and hiding the error.

In this alternative embodiment, the encoding result of the current audio frame is decoded, and the decoding result is analyzed. And when an error occurs in a decoding result corresponding to the current audio frame subjected to the full-code-rate coding, judging the coding code rate of the audio receiving end at the moment. When the coding rate of the audio receiving end is the full rate, it indicates that the audio receiving end is currently coding effective voice at the moment, and two parties performing mobile phone communication in the corresponding actual communication scene speak with each other, for example, a noisy frame. At this time, for the current audio frame with errors, the audio receiver does not send out a retransmission request any more, but directly starts the PLC service to perform error concealment. Although the retransmission of the audio frame containing the effective voice is not performed at this time, even if the voice of the opposite party is lost, the local speaker cannot notice the voice, so that in this case, whether the opposite party is speaking or not is not noticed, and the PLC error concealment is directly used for the lost frame, and the retransmission is not requested any more. Therefore, under the condition of not influencing the conversation, the computational power of the audio transmitting end is reduced, and the power consumption is reduced.

Optionally, if the coding code rate of the audio receiving end is the full code rate, the PLC service is started to conceal the error, further comprising: judging whether the last audio frame has errors or not; if the last audio frame has errors, the audio receiving end sends a retransmission request; and if the last audio frame has no error, the audio receiving end starts the PLC service to conceal the error.

In this optional embodiment, if the coding rate of the audio receiving end is the full rate, when the PLC service is ready to be started, it is further determined whether an error exists in the previous audio frame, and if the error exists in the previous audio frame, the audio receiving end sends a retransmission request, and does not perform error concealment by the PLC service any longer. And if the last audio frame has no error, the audio receiving end starts the PLC service to conceal the error.

Specifically, because the PLC algorithm usually compensates the current frame based on the correct historical frame, if the historical frame is wrong, the PLC algorithm generally starts up the Ramping down, the sound quality is rapidly reduced, and in order to avoid error expansion caused by the continuous PLC, when it is detected that the decoding result of the previous audio frame has an error, the PLC service is not performed for error concealment at this time. And sending a repeat request at an audio receiving end to retransmit the audio frame, thereby ensuring the tone quality.

In the embodiment shown in fig. 3, the speech data processing method of the present application further includes a process S303, where the current audio frame is encoded according to the encoding rate, and the encoding result is transmitted.

In this embodiment, at the audio transmitting end, the current audio frame is encoded according to the determined encoding rate and transmitted. If a retransmission request from the bluetooth receiving end is received, the current audio frame needs to be reassembled [ FL2] and transmitted.

Optionally, encoding the current audio frame according to the encoding rate, and transmitting the encoding result, including: if the audio transmitting end receives the retransmission request, the coding rate corresponding to the current audio frame is judged; if the coding rate corresponding to the current audio frame is the full code rate, reducing the full code rate, recombining the frame data corresponding to the coding result according to the reduced updated coding rate and retransmitting the recombined result until an audio receiving end successfully receives the recombined result or the retransmission times meet the first retransmission time requirement; and if the coding rate corresponding to the current audio frame is the middle code rate, retransmitting the frame data corresponding to the coding result until the audio receiving end successfully receives the retransmission result or the retransmission times meet the requirement of the second retransmission times.

In this alternative embodiment, after the audio transmitting end encodes the current audio frame, the encoding result is sent to the audio receiving end. At the audio receiving end, the encoding result of the current audio frame is decoded, and the decoding result is analyzed. If the decoding result has errors and the coding rate corresponding to the coding result is a full code rate, a retransmission request is sent to the audio transmitting end, the full code rate is reduced at the audio transmitting end, frame data corresponding to the coding result is recombined according to the reduced updated coding rate and the recombined result is retransmitted until the audio receiving end successfully receives the recombined result or the retransmission times meets the first retransmission times requirement, if the decoding result has errors and the coding rate corresponding to the coding result is a medium code rate, a retransmission request is sent to the audio transmitting end, and the frame data corresponding to the coding result is retransmitted at the audio transmitting end until the audio receiving end successfully receives the retransmission results or the retransmission times meets the second retransmission times requirement.

Specifically, the first retransmission number is greater than the second retransmission number. The method has the advantages that more retransmission times are set for the current audio frame containing more effective voices, errors occur in the decoding result corresponding to the current audio frame code, and the loss of the effective voices is avoided by setting more retransmission times. And setting the first retransmission times and the second retransmission times. The requirements for audio quality and bandwidth in the audio transmission process can be reasonably selected according to the actual device models of the audio transmitting end and the receiving end, for example, the first retransmission time is set to 4, and the second retransmission time is set to 2.

Specifically, after the current audio frame encoded by using the full code rate is transmitted at the audio transmitting end and a retransmission request is received, the current full code rate is reduced, for example, to a full code rate of a certain proportion, for example, 1/2. And recombining the coding result of the current audio frame by using the updated coding rate to obtain a recombination result, and sending the recombination result. Because the coding rate is lower than that of the first coding during retransmission, the success rate of audio transmission is improved, and jamming is avoided. And when the audio frame is successfully transmitted or the retransmission times meet the first retransmission time requirement, ending the processing of the current audio frame and starting to encode the next audio frame.

In this optional embodiment, if a retransmission request sent by an audio receiving end is received at an audio transmitting end, if the intermediate code rate of the current audio frame is lower than the full code rate, the stored coding result coded by the intermediate code rate is directly retransmitted until the audio receiving end successfully receives the retransmission result or the retransmission times meet the requirement of the second retransmission times.

In this optional embodiment, if the decoding result is incorrect after the audio receiving end analyzes the decoding result, the retransmission operation needs to be performed. And after the audio transmitting end receives the retransmission request, the audio transmitting end recombines the coding result of the current audio frame, analyzes the frame data corresponding to the coding result obtained by full-code-rate coding, writes the code stream corresponding to the reduced updated coding rate in sequence to obtain a recombination result, and retransmits the recombination result.

Specifically, after receiving the retransmission request, the updated coding rate may be obtained according to a change rule of the preselected setting, for example, 1/2 where the updated coding rate is the full coding rate. Analyzing the coding result subjected to full code rate coding, writing the coding result into a code stream corresponding to the updated coding rate according to the sequence of auxiliary information, TNS information coded by an arithmetic [ FL3], spectrum information coded by an arithmetic [ FL4], residual information or LSB information (space significan Bit), and discarding the residual data until the available Bit number is used up to finally obtain a recombination result.

In a specific example, if the encoding rate is updated to 1/2 with full rate, the corresponding auxiliary information in the encoding result, the arithmetically encoded TNS information, will be written into the code stream corresponding to 1/2 full rate, the arithmetically encoded spectrum information will be written into partial information corresponding to important low and intermediate frequency information because of the limitation of the size of the code stream, wherein partial intermediate and high frequency information is discarded, and the residual information and LSB information will also be discarded by FL 5. The LSB information represents the least significant bit, namely the least significant bit of the quantized spectral coefficient, and has small contribution to the whole tone quality, and the code rate can be reduced by discarding the LSB, so that the loss of the tone quality is small. Part of information can be abandoned in the recombination process, the tone quality is reduced to some extent, but the tone quality is enough to meet the actual call requirement, and is more natural than the tone quality of the PLC.

The voice data processing method of the application detects the audio frame to be coded, selects the corresponding coding rate according to the characteristics of the audio frame to code the audio frame, and correspondingly sets different retransmission times. When the effective voice contained in the audio frame meets the preset requirement, namely contains more effective voices, the effective voice is coded by using a higher coding rate, and a higher retransmission time is correspondingly set; when the effective sound contained in the audio frame does not meet the preset requirement, namely less effective voice is contained, a lower coding rate is used for coding, and a lower retransmission time is correspondingly set, so that the retransmission time is reduced or even not carried out on some audio frames without the effective sound, the bandwidth of audio transmission is prevented from being wasted, the power consumption is reduced, and the time delay is reduced.

Fig. 4 shows an example of the speech data processing method of the present application.

As shown in fig. 4, during a call, a bluetooth device is used as both a transmitter and a receiver, for example, a bluetooth headset is used as a transmitter to acquire, encode and transmit human voice to a mobile phone, and is used as a receiver to receive, decode and play back voice signals from the mobile phone through a speaker.

Firstly, initializing, negotiating parameters, and determining the type of a coder/decoder, the sampling rate, the code rate, the size of a transmission packet and the retransmission times through parameter negotiation between Bluetooth devices; the bluetooth device here may be a mobile phone and a bluetooth headset. And determining the corresponding relation between the coding code rate and the retransmission times according to a preset rule.

In the transmitting and processing part, when the audio frame is coded, the Bluetooth transmitting module is responsible for judging whether the current audio frame contains effective voice or not by acquiring various parameters in the coding process, such as a fundamental tone existence mark and the like, and determining which coding rate is used for coding the audio frame. When receiving the retransmission request, corresponding audio frame reassembly is performed according to the difference of the coding rate during the first coding, which is described above and is not repeated here.

In the receiving and processing section, at the audio receiving end, the decoding result of the received encoded frame data is analyzed, and a retransmission request is sent or a PLC service is started for error concealment according to whether the decoding result data has an error and the corresponding encoding code rate, or the data is successfully received and directly processed for the next audio frame, where the specific process is described above and is not repeated here.

In the embodiment shown in fig. 5, the speech data processing system of the present application includes: a code rate determining module 501, configured to analyze a current audio frame and determine a coding code rate corresponding to the current audio frame, where the coding code rate includes a full code rate, a medium code rate, and a low code rate, where the full code rate is greater than the medium code rate, and the medium code rate is greater than the low code rate; a retransmission number determining module 502, configured to set a retransmission number according to a coding rate, where the coding rate is positively correlated with the retransmission number; and an encoding and transmitting module 503, which encodes the current audio frame according to the encoding rate and transmits the encoding result.

In a particular embodiment of the present application, a computer-readable storage medium stores computer instructions, wherein the computer instructions are operable to perform the voice data processing method described in any one of the embodiments. Wherein the storage medium may be directly in hardware, in a software module executed by a processor, or in a combination of the two.

A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium.

The Processor may be a Central Processing Unit (CPU), other general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), other Programmable logic devices, discrete Gate or transistor logic, discrete hardware components, or any combination thereof. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

In one embodiment of the present application, a computer device includes a processor and a memory, the memory storing computer instructions, wherein: the processor operates the computer instructions to perform the voice data processing method described in any of the embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above embodiments are merely examples, which are not intended to limit the scope of the present disclosure, and all equivalent structural changes made by using the contents of the specification and the drawings, or any other related technical fields, are also included in the scope of the present disclosure.

Claims

1. A method for processing voice data, comprising:

analyzing a current audio frame to be coded, and determining a coding rate corresponding to the current audio frame, wherein the coding rate comprises a full code rate, a medium code rate and a low code rate, the full code rate is greater than the medium code rate, and the medium code rate is greater than the low code rate;

setting retransmission times according to the coding rate, wherein the coding rate is positively correlated with the retransmission times;

coding the current audio frame according to the coding rate, and transmitting a coding result;

and the audio receiving end decodes the received coding result and analyzes the decoding result, wherein,

if the decoding result has errors and the coding rate corresponding to the coding result is the full code rate, sending a retransmission request to an audio transmitting terminal, wherein the retransmission request comprises:

judging the coding rate corresponding to the audio receiving end;

and if the coding code rate corresponding to the audio receiving end is the full code rate, starting a PLC service to hide the error.

2. The method of claim 1, wherein the analyzing the current audio frame to be encoded to determine the coding rate corresponding to the current audio frame comprises:

detecting valid speech in the current audio frame;

if the effective voice contained in the current audio frame meets the preset requirement, setting the coding code rate as the full code rate;

if the effective voice contained in the current audio frame does not meet the preset requirement, detecting the effective bandwidth corresponding to the current audio frame;

if the effective bandwidth is in a preset bandwidth range, setting the coding code rate as the medium code rate;

otherwise, setting the coding rate as the low code rate.

3. The method of claim 2, wherein the detecting the valid speech in the current audio frame comprises:

detecting a fundamental tone existence flag corresponding to the current audio frame, wherein if the fundamental tone existence flag is a preset value, effective voice contained in the current audio frame meets the preset requirement;

detecting a first pitch delay corresponding to the current audio frame, wherein if a difference value between the first pitch delay and a second pitch delay corresponding to a previous audio frame is greater than a preset threshold, effective voice contained in the current audio frame meets the preset requirement; and/or

And detecting the energy entropy of the voice frequency band corresponding to the current audio frame, wherein if the energy entropy of the voice frequency band is smaller than a first preset energy entropy threshold, the effective voice contained in the current audio frame meets the preset requirement.

4. The voice data processing method according to claim 2, further comprising:

if the effective voice contained in the current audio frame does not meet the preset requirement, detecting the voice frequency band energy entropy corresponding to the current audio frame, and if the voice frequency band energy entropy is smaller than a second preset energy entropy threshold, setting the coding code rate as the medium code rate.

5. The speech data processing method according to claim 1, wherein the audio receiving end decodes the received encoding result and analyzes the decoding result, and further comprising:

if the decoding result has errors and the coding rate corresponding to the coding result is the medium code rate, sending a retransmission request to the audio transmitting terminal;

and if the decoding result has errors and the coding rate corresponding to the coding result is the low coding rate, starting the PLC service to hide the errors.

6. The method of claim 1, wherein the encoding the current audio frame according to the encoding rate and transmitting the encoding result comprises:

if the audio transmitting end receives a retransmission request, the coding rate corresponding to the current audio frame is judged;

if the coding rate corresponding to the current audio frame is the full code rate, reducing the full code rate, recombining the frame data corresponding to the coding result according to the reduced updated coding rate, and retransmitting the recombination result until an audio receiving end successfully receives the recombination result or the retransmission times meet the requirement of first retransmission times;

and if the coding rate corresponding to the current audio frame is the medium coding rate, retransmitting the frame data corresponding to the coding result until the audio receiving end successfully receives the coding result or the retransmission times meet the requirement of second retransmission times.

7. The method according to claim 6, wherein the recombining comprises:

and analyzing the frame data corresponding to the coding result obtained by the full-code-rate coding, and writing the frame data into the code stream corresponding to the updated coding rate according to a preset sequence to obtain the recombination result.

8. The method of claim 1, wherein if the coding rate corresponding to the audio receiving end is the full code rate, a PLC service is started to conceal the error, and the method further comprises:

judging whether the decoding result corresponding to the previous audio frame has errors or not;

if the decoding result corresponding to the previous audio frame is wrong, the audio receiving end sends a retransmission request;

and if the decoding result corresponding to the previous audio frame has no error, the audio receiving end starts the PLC service to conceal the error.

9. A speech data processing system, comprising:

the device comprises a code rate determining module, a coding rate determining module and a coding rate determining module, wherein the code rate determining module analyzes a current audio frame to be coded and determines a coding rate corresponding to the current audio frame, and the coding rate comprises a full code rate, a medium code rate and a low code rate, the full code rate is greater than the medium code rate, and the medium code rate is greater than the low code rate;

a retransmission number determining module, configured to set a retransmission number according to the coding rate, where the coding rate is positively correlated with the retransmission number;

the coding and transmission module is used for coding the current audio frame according to the coding rate and transmitting a coding result;

a decoding and analyzing module for decoding the received encoding result at the audio receiving end and analyzing the decoding result, wherein,

if the decoding result has errors and the coding rate corresponding to the coding result is the full code rate, sending a retransmission request to an audio transmitting terminal, wherein,

judging the coding rate corresponding to the audio receiving end;