WO2019000178A1 - Frame loss compensation method and device - Google Patents

Frame loss compensation method and device Download PDF

Info

Publication number
WO2019000178A1
WO2019000178A1 PCT/CN2017/090035 CN2017090035W WO2019000178A1 WO 2019000178 A1 WO2019000178 A1 WO 2019000178A1 CN 2017090035 W CN2017090035 W CN 2017090035W WO 2019000178 A1 WO2019000178 A1 WO 2019000178A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
information
future
historical
current
Prior art date
Application number
PCT/CN2017/090035
Other languages
French (fr)
Chinese (zh)
Inventor
高振东
肖建良
刘泽新
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2017/090035 priority Critical patent/WO2019000178A1/en
Priority to CN201780046044.XA priority patent/CN109496333A/en
Publication of WO2019000178A1 publication Critical patent/WO2019000178A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L1/00Arrangements for detecting or preventing errors in the information received

Definitions

  • the present application relates to the field of voice processing technologies, and in particular, to a frame loss compensation method and device.
  • PS packet switching
  • VoIP Internet Protocol based voice
  • VoIP Internet Protocol based voice
  • the vocoder has a Packet Loss Concealment (PLC) function, and can estimate the code stream information of the current frame loss according to the good frame (history frame) information before the current frame loss.
  • the code stream information of the current frame loss includes formant spectrum information, pitch frequency, fractional pitch, adaptive codebook gain, fixed codebook gain or energy.
  • PLC Packet Loss Concealment
  • the present application provides a frame loss compensation method and device, which can improve the accuracy of frame loss compensation.
  • the embodiment of the present application provides a frame loss compensation method, including:
  • Receiving a sequence of voice code streams acquiring historical frame information and future frame information in a sequence of voice code streams, wherein the sequence of voice code streams includes frame information of a plurality of voice frames, the plurality of voice frames including at least one history frame, at least one current a frame and at least one future frame, at least one historical frame is located before the at least one current frame in the time domain, at least one future frame is located after the at least one current frame in the time domain, and the historical frame information is frame information of at least one historical frame, and the future The frame information is frame information of at least one future frame; and frame information of at least one current frame is estimated according to the historical frame information and the future frame information, thereby improving the accuracy of the frame loss compensation.
  • the sequence of voice code streams includes frame information of a plurality of voice frames, the plurality of voice frames including at least one history frame, at least one current a frame and at least one future frame, at least one historical frame is located before the at least one current frame in the time domain, at least one future frame is located after the at least
  • the type or state of the voice frame in the sequence of voice code streams may be determined, including: determining whether there is a good frame, at least one current frame before the at least one current frame. Whether the previous good frame is a silent frame, whether there is a valid future frame, and so on. For different types or different states of the speech frames in the speech stream sequence, different compensation measures are taken for the current frame, so that the recovered signal is closer to the original signal, and a better frame loss compensation effect is achieved.
  • the voice code stream sequence can be stored in a buffer, such as an AJB buffer. Then, the frame information of the voice code stream sequence in the buffer is decoded to obtain the decoded history frame information and the undecoded future frame information in the buffer.
  • the historical frame information includes formant spectrum information of the historical frame
  • the future frame information includes formant spectrum information of the future frame.
  • the formant spectrum information of the at least one current frame may be determined according to the formant spectrum information of the historical frame and the formant spectrum information of the future frame.
  • the formant spectrum information is the excitation response of the channel at the time of utterance.
  • the frame state in the sequence of the voice code stream may be performed before determining the formant spectrum information of at least one current frame according to the formant spectrum information of the historical frame and the formant spectrum information of the future frame.
  • the judgment includes: judging how many frames are lost, whether there are future good frames, whether there are good frames before the current frame, and the like. Then, according to the frame state in the speech code stream sequence, different methods are used to calculate the formant spectrum information of the current frame loss.
  • the historical frame information includes the pitch value of the historical frame
  • the future frame information includes the pitch value of the future frame.
  • the pitch value of at least one current frame may be determined based on the pitch value of the historical frame and the pitch value of the future frame.
  • the pitch value is the gene frequency of the vocal cord vibration at the time of sounding
  • the gene period is the value of the inverse of the pitch frequency.
  • the frame state of the voice code stream sequence may be determined, including: determining how much is lost. Frame, whether there are future good frames, whether there are good frames before the current frame, etc., and then according to the frame state of the voice code stream sequence, different methods are used to calculate the pitch value of the current frame loss.
  • the size of the spectral tilt of the at least one current frame is determined according to the size of the time domain signal obtained by decoding the historical frame; determining the at least one current frame according to the size of the spectral tilt of the at least one current frame.
  • Frame type For example, the time domain signal is a time domain representation of the decoded frame information after decoding.
  • a pitch change state of a plurality of subframes in at least one current frame may be acquired; and a frame type of at least one current frame is determined according to a pitch change state of the plurality of subframes.
  • the voiced sound is mainly caused by the vocal cord vibration, there is a pitch, and the vocal cord vibration changes relatively slowly, and the pitch changes relatively slowly.
  • each sub-frame has a pitch, so the pitch is used to determine the frame type of at least one current frame.
  • the frame type of at least one current frame is determined, and at least one of an adaptive codebook gain and a fixed codebook gain of the at least one current frame is determined according to the frame type.
  • the current frame includes a pitch speech frame and a noise speech frame, the adaptive codebook gain is the energy gain of the pitch portion, and the fixed codebook gain is the energy gain of the noise portion.
  • the adaptive codebook gain of at least one current frame is determined according to an adaptive codebook gain and a pitch period of a historical frame, and an energy gain of at least one current frame. And taking the average of the fixed codebook gains of the plurality of historical frames as the fixed codebook gain of the at least one current frame.
  • the energy gain of the at least one current frame is determined according to a time domain signal size in the decoded historical frame information and a length of each subframe in the historical frame.
  • the energy gain of the current frame includes the energy gain of the current frame in the voiced sound or the energy gain of the current frame in the unvoiced sound.
  • the embodiment of the present application provides a frame loss compensation apparatus, where the apparatus is configured to implement the method and function performed by the user equipment in the foregoing first aspect, implemented by hardware/software, and the hardware/software includes the foregoing.
  • the corresponding unit of function is configured to implement the method and function performed by the user equipment in the foregoing first aspect, implemented by hardware/software, and the hardware/software includes the foregoing. The corresponding unit of function.
  • the present application provides a frame loss compensation device, comprising: a vocoder, a memory, and a communication bus, the memory being coupled to the vocoder via the communication bus; wherein the communication bus is used for A connection communication between the processor and the memory is implemented, and the vocoder executes the program stored in the memory for implementing the steps in the frame loss compensation method provided by the above first aspect.
  • Yet another aspect of the present application provides a computer readable storage medium having stored therein instructions that, when executed on a computer, cause the computer to perform the methods described in the above aspects.
  • Yet another aspect of the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the methods described in the various aspects above.
  • FIG. 1 is a schematic structural diagram of a frame loss compensation system according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a sequence of voice sequences provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a frame loss compensation method according to an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of another frame loss compensation provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a frame loss compensation apparatus according to an embodiment of the present disclosure.
  • FIG. 6 is a schematic structural diagram of a frame loss compensation apparatus according to an embodiment of the present application.
  • FIG. 1 is a schematic structural diagram of a frame loss compensation system according to an embodiment of the present application.
  • the system can be applied to PS voice calls (including but not limited to VoLTE, VoWiFi and VoIP) scenarios, and the system can also be applied to Circuit Switched (CS) calls with increased buffering.
  • the system includes a base station and a receiving device, and the receiving device may refer to a device that provides a voice and/or data connection to the user, ie, a user device, or may be connected to a computing device such as a laptop or desktop computer, or It can be a standalone device such as a Personal Digital Assistant (PDA).
  • PDA Personal Digital Assistant
  • a receiving device may also be referred to as a system, subscriber unit, subscriber station, mobile station, mobile station, remote station, access point, remote terminal, access terminal, user terminal, user agent, or user device.
  • a base station may be an access point, a Node B, an evolved node (eNB), or a 5G base station, referred to as a device in an access network that communicates with a wireless terminal over one or more sectors over an air interface.
  • eNB evolved node
  • 5G base station referred to as a device in an access network that communicates with a wireless terminal over one or more sectors over an air interface.
  • IP Internet Protocol
  • the base station can act as a signal relay device between the wireless terminal and the rest of the access network, which can include an internetwork protocol network.
  • the base station can also coordinate the management of the attributes of the air interface.
  • Both the base station and the user equipment in this embodiment may adopt the frame loss compensation method mentioned in the following embodiments, and include a corresponding frame loss compensation device to implement frame loss compensation for the voice signal.
  • any device receives the voice code stream sequence from the peer device, it can decode the voice code stream sequence to obtain the decoded frame information, and compensate for the lost frame and perform subsequent decoding.
  • FIG. 2 is a schematic diagram of a voice code stream sequence provided by an embodiment of the present application.
  • a common voice signal that is, a voice frame
  • a current frame at least one frame at time T
  • a future frame at least one frame at time T
  • the time T is a unit of time or a point in time.
  • the history frame includes an N-1 frame, an N-2 frame, an N-3 frame, an N-4 frame, and the like
  • the current frame is a lost frame, including: an N frame and an N+1 frame
  • the future frame includes an N+2 frame.
  • the lost frame or lost frame involved in frame loss compensation in this embodiment may include a lost frame in transmission, and is damaged. Frames, frames that cannot be received or decoded correctly, or frames that are not available for a specific reason.
  • FIG. 3 is a schematic flowchart of a frame loss compensation method according to an embodiment of the present application. As shown in FIG. 3, the method in this embodiment of the present application includes:
  • the sequence of voice code streams includes frame information of multiple voice frames, where the multiple voice frames include at least one history frame, at least one a current frame and at least one future frame, the at least one historical frame being located before the at least one current frame in the time domain, the at least one future frame being located after the at least one current frame in the time domain, the historical frame information being the Frame information of at least one history frame, the future frame information being frame information of the at least one future frame.
  • the at least one current frame may be a frame loss caused by various reasons.
  • the voice code stream sequence may be stored in a buffer of a memory or a buffer, such as a buffer of the AJB.
  • the frame information of the voice code stream sequence in the buffer is then decoded to obtain the decoded history frame information and the undecoded future frame information in the buffer.
  • the decoded history frame is a voice analog signal
  • the historical frame before decoding is a voice digital signal.
  • the future frame is not decoded, and the frame loss compensation device or system can parse the future frame to obtain partially valid frame information, such as formant spectrum information and pitch value.
  • information or data of a plurality of frames including a history frame, a current frame, and a future frame are buffered in the buffer.
  • the type or state of the voice frame in the voice code stream sequence may be determined, including: determining whether there is a good frame before the frame loss (ie, a normal frame that can be used as compensation), and before the frame loss is good. Whether the frame is a silent frame, whether there is a valid future frame, and so on. For different types or different states of the speech frames in the sequence of the speech code stream, different compensation measures are taken for the current frame in S302, so that the recovered signal is closer to the original signal, and a better frame loss compensation effect is achieved.
  • FIG. 4 is a schematic flowchart of another frame loss compensation method according to an embodiment of the present disclosure. include:
  • S401 Determine whether the current frame is a dropped frame or a bad frame.
  • S402. If the current frame is a good frame, decode the good frame.
  • S403. If the current frame is a bad frame or a dropped frame, determine whether the good frame before the current frame is a Silence Insertion Descriptor (SID).
  • S404. If the good frame before the current frame is a silent frame, the silent frame is directly decoded.
  • S405. If the good frame before the current frame is not a silent frame, determine whether there is a valid future frame after the current frame.
  • S406. If there is no valid future frame after the current frame, the current frame is compensated according to the historical frame information.
  • S407. If there is a valid future frame, the current frame loss is compensated according to the historical frame information and the future frame information. The specific implementation manner of this step is described in detail below.
  • S408. After compensating the frame information of the current frame, decoding the compensated current frame.
  • step S407 is described in detail, and the current frame loss can be compensated by considering the historical frame information and the future frame information.
  • the specific method is as follows:
  • the historical frame information includes formant spectrum information of the historical frame
  • the future frame information includes formant spectrum information of the future frame.
  • the formant spectrum information of the at least one current frame may be determined according to the formant spectrum information of the historical frame and the formant spectrum information of the future frame.
  • the formant spectrum information is the excitation response of the channel when the sound is emitted, including the Immittance Spectral Frequency (ISF).
  • ISF Immittance Spectral Frequency
  • Information uses an ISF vector to represent formant spectrum information.
  • the N-2 frame and the N-1 frame in the speech code stream sequence are good frames, N frames and N+1 frames are lost, and there are future N+2 frames and N+3 frames, and the first-order polynomial fitting is used to calculate N.
  • the formant spectrum information of the frame is .
  • ISF i (N-1) a+b ⁇ (N-1);
  • ISF i (N+2) a+b ⁇ (N+2)
  • the formant spectrum information of the N-1 frame and the N+2 frame is represented by the peak spectrum information of the plurality of points, and the formant spectrum information is processed by the filter, and each point represents a coefficient pair of the filter.
  • ISF i (N-1) is the formant spectrum information corresponding to the i-th point of the N-1 frame
  • ISF i (N+2) is the formant spectrum information corresponding to the i-th point of the N+2 frame
  • ISF i (N) a+b ⁇ N
  • the frame state in the sequence of the voice code stream may be The judgment includes: judging how many frames are lost, whether there are future good frames, whether there are good frames before the current frame, and the like. Then, according to the frame state in the speech code stream sequence, different methods are used to calculate the formant spectrum information of the current frame loss.
  • the first-order polynomial in the prior art is used to fit the two frames before the current frame loss.
  • the previous frame of the current frame loss is a good frame, one frame or more is lost, and there is no future good frame
  • the previous frame of the current frame loss is used to fit with the ISF mean (i).
  • Loss of 3 frames or more, with good future frames using first-order polynomial to fit with ISF mean (i) and future good frames.
  • the first-order polynomial is used to fit the good frame before the current frame loss and the future good frame. This situation has been described in detail above. .
  • ISF mean (i) is calculated as follows:
  • past_ISF q (i) is the formant spectrum information corresponding to the i-th point of the previous frame of the current frame loss
  • is a preset constant
  • ISF const _ mean (i) is the average of the formant spectrum information over a period of time.
  • the historical frame information includes a pitch value of the historical frame
  • the future frame information includes a pitch value of the future frame.
  • the pitch value of the at least one current frame may be determined according to a pitch value of the historical frame and a pitch value of the future frame.
  • the pitch value is the pitch frequency of the vocal cord vibration at the time of sounding, and is the reciprocal of the pitch period.
  • the pitch value is the pitch period.
  • a sequence of speech streams includes four pitches per frame.
  • Set N-2 frame, N-1 frame is a good frame, there are no N frames and N+1 frames, there are N+2 frames and N+3 frames, and second-order polynomials are used according to N-1 frames and N+2 frames.
  • the pitch value pitch 1 (N-1), pitch 2 (N-1), pitch 3 (N-1), pitch 4 (N-1), and N+2 frame pitch value pitch of the N-1 frame are known. 1 (N+2), pitch 2 (N+2), pitch 3 (N+2), pitch 4 (N+2).
  • pitch represents a pitch value
  • N represents a frame number
  • a subscript indicates a position of the subframe within each frame, for example, each sub-frame corresponds to a pitch.
  • the second-order polynomial is as follows:
  • Y a 0 + a 1 x + a 2 x 2
  • a 0 , a 1 , and a 2 are coefficients of the fitting curve, respectively, which may be preset according to engineering design experience. According to the principle of the smallest squared deviation, the following matrix can be obtained:
  • x i is the time point of the ith subframe in the N-1 frame and the N+2 frame
  • y i is N-
  • the pitch 1 (N-1) time point is defined as 4 ⁇ (N-1)+1
  • the pitch 2 (N-1) time point is defined as 4 ⁇ (N-1)+2
  • the time point is defined as 4 ⁇ (N+2)+4.
  • the frame state of the voice code stream sequence may be determined, including: determining the loss. How many frames, whether there are future good frames, whether there are good frames before the current frame loss, etc., and then according to the frame state of the voice code stream sequence, different methods are used to calculate the pitch value of the current frame loss.
  • the second-order polynomial is used to fit the pitch values of the current frame loss by using the good frame before the current frame loss.
  • the current frame loss includes a pitched speech frame and a noisy speech frame. If 4 frames or more is lost, the pitch energy is reduced, and only the pitch value of the noise is compensated.
  • the three frames before the current frame loss are good frames, 1-3 frames are lost, and there are future good frames.
  • the second-order polynomial is used to fit the pitch values of the current frame loss by using the good frame and the future good frame before the current frame loss. This situation has been introduced above.
  • S303 Determine a frame type of the current frame.
  • the frame type includes unvoiced and voiced.
  • the vocal characteristics of voiced and unvoiced voices are quite different.
  • the frame type of the current frame is different, and the frame loss compensation strategy used is different.
  • the difference between voiced and unvoiced is that the voiced signal has significant periodicity due to vocal cord vibration.
  • Periodic detection can employ algorithms such as zero-crossing rate, correlation, spectral tilt, or pitch change rate. Among them, the zero-crossing rate and correlation calculation are widely used in the prior art and will not be described. The following describes the frame state of a speech signal by spectral tilt and pitch change rate.
  • the size of the spectral tilt of the at least one current frame may be determined according to the size of the time domain signal obtained by decoding the historical frame; and determining the size according to the size of the spectral tilt of the at least one current frame.
  • the frame type of at least one current frame wherein, the pitch frequency of the voiced speech signal is below 500 Hz, and the periodic signal can be determined according to the spectral tilt. Calculate the spectral tilt formula as follows:
  • tilt is the magnitude of the spectral tilt of the current frame
  • s is the size of the simulated time domain signal obtained by decoding the historical frame
  • i is the time point of the time domain signal in the time direction.
  • a pitch change state of the plurality of subframes in the at least one current frame may be acquired; and a frame type of the at least one current frame is determined according to a pitch change state of the multiple subframes.
  • the voiced sound is mainly caused by the vocal cord vibration, there is a pitch, and the vocal cord vibration changes relatively slowly, and the pitch changes relatively slowly.
  • Each sub-frame has a pitch, so the pitch is used to determine the frame type of the current dropped frame.
  • pitch change is the pitch change state of 4 subframes in a frame
  • pitch(i) is the pitch value of the i-th subframe
  • a frame type of the at least one current frame may be determined, and at least one of an adaptive codebook gain and a fixed codebook gain of the at least one current frame is determined according to the frame type.
  • the current frame includes a pitched speech frame and a noisy speech frame, the adaptive codebook gain is the energy gain of the pitch portion, and the fixed codebook gain is the energy gain of the noise portion.
  • determining the self of the at least one current frame according to an adaptive codebook gain and a pitch period of a historical frame, and an energy gain of the at least one current frame.
  • the codebook gain is adapted and the average of the fixed codebook gains of the plurality of historical frames is used as the fixed codebook gain of the at least one current frame.
  • a history frame may be the latest history frame before the current frame.
  • determining the fixing of the at least one current frame according to a fixed codebook gain and a pitch period of a historical frame, and an energy gain of the at least one current frame.
  • the codebook gains and averages the adaptive codebook gains of the plurality of historical frames as the adaptive codebook gain of the at least one current frame.
  • the current lost frame is enhanced in energy adjustment, including: if the frame state of the current dropped frame is voiced, the adaptive codebook gain of the current dropped frame
  • the fixed codebook gain of the current frame loss g c median5(g c (n-1),...,g c (n-5)), where g p (n-1) is the most recent one of the history frames
  • G voice is the energy gain of the current frame loss in the voiced sound
  • T c is the pitch period of the most recent history frame
  • median5(g c (n-1),...,g c (n-5 )) is the average of the fixed codebook gains for the last five historical frames.
  • the adaptive codebook gain of the current frame loss g p median5(g p (n-1),...,g p (n-5)), the current frame loss Fixed codebook gain
  • median5(g p (n-1),...,g p (n-5)) is the average of the adaptive codebook gains of the last five historical frames
  • g c (n-1) is the nearest one.
  • the fixed codebook gain of the history frame G noise is the energy gain of the current frame loss in the unvoiced sound, and the T c is the pitch period of the most recent history frame.
  • median5(g p (n-1),...,g p (n-5)) is the average of the adaptive codebook gains of the last five historical frames
  • median5(g c (n-1), ..., g c (n-5)) is the average of the fixed codebook gains of the last five historical frames
  • the at least one current energy gain may be determined according to a time domain signal size in the decoded historical frame information and a length of each subframe in the historical frame.
  • the energy gain of the current frame includes an energy gain of the current dropped frame in the voiced sound or an energy gain of the current dropped frame in the unvoiced sound. Since the future frame is not decoded at the current time, the energy gain of the current frame loss can only be determined based on the history frame information. When the current good frame is voiced, the energy gain of the current frame loss is calculated as follows:
  • S is the time domain signal size obtained by decoding the previous good frame of the current frame
  • the length is 4L subfr
  • L subfr represents the length of one subframe
  • T c represents the pitch period of the previous good frame
  • i is the time domain signal. The point in time in the direction of time. In order to prevent the G voice from being too large or too small, the energy of the restored frame is unpredictable, and G voice is limited to [0, 2].
  • the energy gain of the current frame is calculated as follows:
  • S is the time domain signal size obtained by decoding the previous good frame of the current frame
  • L subfr is the length of one subframe
  • i is the time point of the time domain signal in the time direction.
  • the historical frame information and the future frame information in the sequence of the voice code stream are first obtained, and then the formant spectrum information of the current frame loss in the voice signal is estimated according to the historical frame information and the future frame information. Pitch value, fixed codebook gain, adaptive codebook gain and energy, etc.
  • FIG. 5 is a schematic structural diagram of a frame loss compensation apparatus according to an embodiment of the present disclosure.
  • the device may be a vocoder, and may include, for example, a receiving module 501, an obtaining module 502, and a processing module 503. A detailed description of the module is as follows.
  • the receiving module 501 is configured to receive a voice code stream sequence.
  • the obtaining module 502 is configured to acquire historical frame information and future frame information in the sequence of the voice code stream, where the voice code stream sequence includes frame information of multiple voice frames, where the multiple voice frames include at least one history a frame, at least one current frame, and at least one future frame, the at least one historical frame being located before the at least one current frame in the time domain, the at least one future frame being located after the at least one current frame in the time domain, the historical frame
  • the information is frame information of the at least one history frame
  • the future frame information is frame information of the at least one future frame;
  • the processing module 503 is configured to estimate frame information of the at least one current frame according to the historical frame information and the future frame information.
  • sequence of the voice code stream is stored in a buffer
  • the processing module 503 is specifically configured to: decode frame information of multiple voice frames of the voice code stream sequence in the buffer to obtain the decoded history frame information; Obtaining the undecoded future frame information.
  • the historical frame information includes formant spectrum information of the at least one historical frame, and the future frame information includes formant spectrum information of the at least one future frame;
  • the processing module 503 is specifically configured to: determine formant spectrum information of the at least one current frame according to formant spectrum information of the historical frame and formant spectrum information of the future frame.
  • the historical frame information includes a pitch value of the at least one historical frame
  • the future frame information includes a pitch value of the at least one future frame
  • the processing module 503 is specifically configured to: determine a pitch value of the at least one current frame according to a pitch value of the at least one historical frame and a pitch value of the at least one future frame.
  • the historical frame information includes energy of the at least one historical frame
  • the future frame information includes the The energy of at least one future frame
  • the processing module 503 is specifically configured to: determine, according to the energy of the at least one historical frame and the energy of the at least one future frame, the energy of the at least one current frame.
  • the processing module 503 is specifically configured to: determine a frame type of the at least one current frame, where the frame type includes unvoiced or voiced sound;
  • the processing module 503 is further configured to determine a size of a spectral tilt of the at least one current frame
  • the processing module 503 is further configured to acquire a pitch change state of the multiple subframes in the at least one current frame.
  • the processing module 503 is specifically configured to: if the frame type is voiced, determine the at least one according to an adaptive codebook gain and a pitch period of a historical frame, and an energy gain of the at least one current frame. An adaptive codebook gain of the current frame, and an average of the fixed codebook gains of the plurality of historical frames as a fixed codebook gain of the at least one current frame.
  • the processing module 503 is specifically configured to: if the frame type is unvoiced, determine the at least one current according to a fixed codebook gain and a pitch period of a historical frame, and an energy gain of the at least one current frame.
  • the fixed codebook gain of the frame, and the average of the adaptive codebook gains of the plurality of historical frames is used as the adaptive codebook gain of the at least one current frame.
  • the processing module 503 is further configured to determine the at least one current energy gain according to the size of the time domain signal in the decoded historical frame information and the length of each subframe in the historical frame.
  • each module may also perform the methods and functions performed in the foregoing embodiments corresponding to the corresponding descriptions of the method embodiments shown in FIG.
  • FIG. 6 is a schematic structural diagram of a frame loss compensation device according to the present application.
  • the apparatus can include at least one vocoder 601, such as an Adaptive Multi-Rate Wideband (AMR-WB), at least one communication interface 602, at least one memory 603, and at least one communication bus 604. .
  • the communication bus 604 is used to implement connection communication between these components.
  • the communication interface 602 of the device in the embodiment of the present application is used for signaling or data communication with other node devices.
  • the memory 603 may be a high speed RAM memory or a non-volatile memory such as at least one disk memory.
  • the memory 603 can optionally also be at least one storage device located remotely from the vocoder 601.
  • a set of program codes is stored in the memory 603, and may be further used to store temporary data such as intermediate operation data of the vocoder 601.
  • the vocoder 601 executes the program code in the memory 603 to implement the method mentioned in the previous embodiment, and can be specifically referred to the description of the previous embodiment. Further, the vocoder 601 can also cooperate with the memory 603 and the communication interface 602 to perform the operations of the receiving device in the above-mentioned application embodiment.
  • the vocoder 601 may specifically include a processor that executes the program code, such as a central processing unit (CPU) or a digital signal processor (DSP) or the like.
  • communication interface 602 can be used to receive a stream of voice code streams.
  • the memory 603 may not have program code, and the vocoder 601 may include a hardware processor that does not need to execute program code, such as an application specific integrated circuit (ASIC) or a field programmable logic gate array (FPGA). Or a hardware accelerator formed by an integrated circuit. At this time, the memory 603 may be used only for storing temporary data such as intermediate operation data of the vocoder 601.
  • ASIC application specific integrated circuit
  • FPGA field programmable logic gate array
  • the memory 603 may be used only for storing temporary data such as intermediate operation data of the vocoder 601.
  • the functions of the method may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software When implemented in software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions can be from a website site, computer, server or data center Transfer to another website site, computer, server, or data center by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL), or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer readable storage medium can be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that includes one or more available media.
  • the usable medium may be a magnetic medium (eg, a floppy disk, a hard disk, a magnetic tape), an optical medium (eg, a DVD), or a semiconductor medium (such as a solid state disk (SSD)).

Abstract

A frame loss compensation method and an device, comprising: receiving a voice code stream sequence; acquiring the historical frame information and future frame information in the voice code stream sequence, wherein the voice code stream sequence comprises frame information for multiple voice frames which comprise at least one historical frame, at least one current frame and at least one future frame; the at least one historical frame is located before the at least one current frame in the time domain, and the at least one future frame is located after the at least one current frame in the time domain; the historical frame information is frame information for the at least one historical frame and the future frame information is the frame information for the at least one future frame; and on the basis of the historical frame information and the future frame information, estimating the frame information of the at least one current frame, so as to increase the accuracy of frame loss compensation.

Description

一种丢帧补偿方法及设备Frame loss compensation method and device 技术领域Technical field
本申请涉及语音处理技术领域,尤其涉及一种丢帧补偿方法及设备。The present application relates to the field of voice processing technologies, and in particular, to a frame loss compensation method and device.
背景技术Background technique
在分组交换(packet switching,PS)语音通话,如VoLTE(基于长期演进的语音)、VoWiFi(基于无线保真的语音)和VoIP(基于互联协议的语音)场景下,由于语音数据不是独享带宽资源,存在带宽抢占和数据阻塞的现象,可能导致时延抖动和丢帧现象,出现语音断续和卡顿。为了降低由于时延抖动造成的语音断续和卡顿,通常在通话方案中都会采取自适应抖动缓冲器(Adaptive Jitter Buffer,AJB),降低一定时间范围内的时延抖动影响。In packet switching (PS) voice calls, such as VoLTE (Long Term Evolution based voice), VoWiFi (based on wireless fidelity voice) and VoIP (Internet Protocol based voice) scenarios, because voice data is not exclusive bandwidth Resources, such as bandwidth preemption and data blocking, may cause delay jitter and frame dropping, and voice interruption and jamming. In order to reduce the speech interruption and the jam caused by the delay jitter, an adaptive jitter buffer (AJB) is usually adopted in the call scheme to reduce the delay jitter effect in a certain time range.
在现有技术方案中,声码器本身就有丢帧补偿功能(Packet Loss Concealment,PLC),可以根据当前丢帧之前的好帧(历史帧)信息估计当前丢帧的码流信息。其中,当前丢帧的码流信息包括共振峰谱(formant)信息、基音频率、分数基音、自适应码书增益、固定码书增益或能量。但由于实际语音的变化速度较快,每一个字的发音所包括的音素、声门、声道、口腔信息都不断发生变化,因此利用历史帧进行丢帧补偿不够准确。In the prior art solution, the vocoder has a Packet Loss Concealment (PLC) function, and can estimate the code stream information of the current frame loss according to the good frame (history frame) information before the current frame loss. The code stream information of the current frame loss includes formant spectrum information, pitch frequency, fractional pitch, adaptive codebook gain, fixed codebook gain or energy. However, since the actual speech changes rapidly, the phonemes, glottis, channels, and oral information included in the pronunciation of each word are constantly changing. Therefore, frame loss compensation using historical frames is not accurate enough.
申请内容Application content
本申请提供了一种丢帧补偿方法及设备,可以提高丢帧补偿的准确度。The present application provides a frame loss compensation method and device, which can improve the accuracy of frame loss compensation.
第一方面,本申请实施例提供了一种丢帧补偿方法,包括:In a first aspect, the embodiment of the present application provides a frame loss compensation method, including:
首先接收语音码流序列;获取语音码流序列中的历史帧信息以及未来帧信息,其中,语音码流序列包括多个语音帧的帧信息,多个语音帧包括至少一个历史帧、至少一个当前帧和至少一个未来帧,至少一个历史帧在时域上位于至少一个当前帧之前,至少一个未来帧在时域上位于至少一个当前帧之后,历史帧信息是至少一个历史帧的帧信息,未来帧信息是至少一个未来帧的帧信息;根据历史帧信息以及未来帧信息,估计至少一个当前帧的帧信息,从而提高丢帧补偿的准确度。Receiving a sequence of voice code streams; acquiring historical frame information and future frame information in a sequence of voice code streams, wherein the sequence of voice code streams includes frame information of a plurality of voice frames, the plurality of voice frames including at least one history frame, at least one current a frame and at least one future frame, at least one historical frame is located before the at least one current frame in the time domain, at least one future frame is located after the at least one current frame in the time domain, and the historical frame information is frame information of at least one historical frame, and the future The frame information is frame information of at least one future frame; and frame information of at least one current frame is estimated according to the historical frame information and the future frame information, thereby improving the accuracy of the frame loss compensation.
在一种可能的设计中,在估计至少一个当前帧的帧信息之前,可以判断语音码流序列中语音帧的类型或状态,包括:判断至少一个当前帧之前是否存在好帧、至少一个当前帧之前的好帧是否为静默帧、是否存在有效的未来帧等等。针对语音码流序列中语音帧的不同类型或不同状态,对当前帧采取不同的补偿措施,使得恢复的信号和原始信号更接近,达到更好地丢帧补偿效果。In a possible design, before estimating the frame information of the at least one current frame, the type or state of the voice frame in the sequence of voice code streams may be determined, including: determining whether there is a good frame, at least one current frame before the at least one current frame. Whether the previous good frame is a silent frame, whether there is a valid future frame, and so on. For different types or different states of the speech frames in the speech stream sequence, different compensation measures are taken for the current frame, so that the recovered signal is closer to the original signal, and a better frame loss compensation effect is achieved.
在一种可能的设计中,在接收设备通过接口接收到语音码流序列之后,可以将语音码流序列存放到缓冲区,例如AJB的缓冲区。然后对缓冲区中的语音码流序列的帧信息进行解码得到解码后的历史帧信息和缓冲区中未解码的未来帧信息。In a possible design, after the receiving device receives the voice code stream sequence through the interface, the voice code stream sequence can be stored in a buffer, such as an AJB buffer. Then, the frame information of the voice code stream sequence in the buffer is decoded to obtain the decoded history frame information and the undecoded future frame information in the buffer.
在另一种可能的设计中,历史帧信息包括历史帧的共振峰谱信息,未来帧信息包括未来帧的共振峰谱信息。可以根据历史帧的共振峰谱信息和未来帧的共振峰谱信息,确定至少一个当前帧的共振峰谱信息。例如,共振峰谱信息为发声时声道的激励响应。 In another possible design, the historical frame information includes formant spectrum information of the historical frame, and the future frame information includes formant spectrum information of the future frame. The formant spectrum information of the at least one current frame may be determined according to the formant spectrum information of the historical frame and the formant spectrum information of the future frame. For example, the formant spectrum information is the excitation response of the channel at the time of utterance.
在另一种可能的设计中,在根据历史帧的共振峰谱信息和未来帧的共振峰谱信息,确定至少一个当前帧的共振峰谱信息之前,可以对语音码流序列中的帧状态进行判断,包括:判断丢失多少帧、是否存在未来好帧、当前帧之前是否有好帧等等。然后根据语音码流序列中的帧状态,采用不同的方法计算当前丢帧的共振峰谱信息。In another possible design, before determining the formant spectrum information of at least one current frame according to the formant spectrum information of the historical frame and the formant spectrum information of the future frame, the frame state in the sequence of the voice code stream may be performed. The judgment includes: judging how many frames are lost, whether there are future good frames, whether there are good frames before the current frame, and the like. Then, according to the frame state in the speech code stream sequence, different methods are used to calculate the formant spectrum information of the current frame loss.
在另一种可能的设计中,历史帧信息包括历史帧的基音数值,未来帧信息包括未来帧的基音数值。可以根据历史帧的基音数值和未来帧的基音数值,确定至少一个当前帧的基音数值。例如,基音数值为发声时声带震动的基因频率,基因周期为基音频率取倒数的值。In another possible design, the historical frame information includes the pitch value of the historical frame, and the future frame information includes the pitch value of the future frame. The pitch value of at least one current frame may be determined based on the pitch value of the historical frame and the pitch value of the future frame. For example, the pitch value is the gene frequency of the vocal cord vibration at the time of sounding, and the gene period is the value of the inverse of the pitch frequency.
在另一种可能的设计中,在根据历史帧的基音数值和未来帧的基音数值,确定至少一个当前帧的基音数值之前,可以对语音码流序列的帧状态进行判断,包括:判断丢失多少帧、是否存在未来好帧、当前帧之前是否有好帧等等,然后根据语音码流序列的帧状态,采用不同的方法计算当前丢帧的基音数值。In another possible design, before determining the pitch value of at least one current frame based on the pitch value of the historical frame and the pitch value of the future frame, the frame state of the voice code stream sequence may be determined, including: determining how much is lost. Frame, whether there are future good frames, whether there are good frames before the current frame, etc., and then according to the frame state of the voice code stream sequence, different methods are used to calculate the pitch value of the current frame loss.
在另一种可能的设计中,根据历史帧经过解码得到的时域信号的大小,确定至少一个当前帧的谱倾斜的大小;根据至少一个当前帧的谱倾斜的大小,确定至少一个当前帧的帧类型。例如,时域信号是解码后模拟的帧信息的时域表示。In another possible design, the size of the spectral tilt of the at least one current frame is determined according to the size of the time domain signal obtained by decoding the historical frame; determining the at least one current frame according to the size of the spectral tilt of the at least one current frame. Frame type. For example, the time domain signal is a time domain representation of the decoded frame information after decoding.
在另一种可能的设计中,可以获取至少一个当前帧中多个子帧的基音变化状态;根据多个子帧的基音变化状态,确定至少一个当前帧的帧类型。其中,浊音主要是声带震动产生的,存在基音,而且声带震动变化相对缓慢,基音变化相对缓慢。例如,每个子帧都有一个基音,因此利用基音判断至少一个当前帧的帧类型。In another possible design, a pitch change state of a plurality of subframes in at least one current frame may be acquired; and a frame type of at least one current frame is determined according to a pitch change state of the plurality of subframes. Among them, the voiced sound is mainly caused by the vocal cord vibration, there is a pitch, and the vocal cord vibration changes relatively slowly, and the pitch changes relatively slowly. For example, each sub-frame has a pitch, so the pitch is used to determine the frame type of at least one current frame.
在另一种可能的设计中,确定至少一个当前帧的帧类型,根据帧类型,确定至少一个当前帧的自适应码书增益和固定码书增益的至少一项。其中,当前帧包括基音的语音帧和噪音的语音帧,自适应码书增益为基音部分的能量增益,固定码书增益为噪声部分的能量增益。In another possible design, the frame type of at least one current frame is determined, and at least one of an adaptive codebook gain and a fixed codebook gain of the at least one current frame is determined according to the frame type. The current frame includes a pitch speech frame and a noise speech frame, the adaptive codebook gain is the energy gain of the pitch portion, and the fixed codebook gain is the energy gain of the noise portion.
在另一种可能的设计中,若帧类型为浊音,则根据一个历史帧的自适应码书增益和基音周期、以及至少一个当前帧的能量增益,确定至少一个当前帧的自适应码书增益,并将多个历史帧的固定码书增益的平均值作为至少一个当前帧的固定码书增益。In another possible design, if the frame type is voiced, the adaptive codebook gain of at least one current frame is determined according to an adaptive codebook gain and a pitch period of a historical frame, and an energy gain of at least one current frame. And taking the average of the fixed codebook gains of the plurality of historical frames as the fixed codebook gain of the at least one current frame.
在另一种可能的设计中,若帧类型为清音,则根据一个历史帧的固定码书增益和基音周期、以及至少一个当前帧的能量增益,确定至少一个当前帧的固定码书增益,并将多个历史帧的自适应码书增益的平均值作为至少一个当前帧的自适应码书增益。In another possible design, if the frame type is unvoiced, determining a fixed codebook gain of at least one current frame according to a fixed codebook gain and a pitch period of a historical frame, and an energy gain of at least one current frame, and The average of the adaptive codebook gains of the plurality of historical frames is taken as the adaptive codebook gain of the at least one current frame.
在另一种可能的设计中,根据解码后的历史帧信息中的时域信号大小和历史帧中每个子帧的长度,确定所述至少一个当前帧的能量增益。其中,当前帧的能量增益包括浊音中当前帧的能量增益或清音中当前帧的能量增益。In another possible design, the energy gain of the at least one current frame is determined according to a time domain signal size in the decoded historical frame information and a length of each subframe in the historical frame. Wherein, the energy gain of the current frame includes the energy gain of the current frame in the voiced sound or the energy gain of the current frame in the unvoiced sound.
第二方面,本申请实施例提供了一种丢帧补偿装置,该装置被配置为实现上述第一方面中用户设备所执行的方法和功能,由硬件/软件实现,其硬件/软件包括与上述功能相应的单元。In a second aspect, the embodiment of the present application provides a frame loss compensation apparatus, where the apparatus is configured to implement the method and function performed by the user equipment in the foregoing first aspect, implemented by hardware/software, and the hardware/software includes the foregoing. The corresponding unit of function.
第三方面,本申请提供了一种丢帧补偿设备,包括:声码器、存储器和通信总线,所述存储器通过所述通信总线耦合至所述声码器;其中,所述通信总线用于实现所述处理器和存储器之间连接通信,声码器执行所述存储器中存储的程序用于实现上述第一方面提供的一种丢帧补偿方法中的步骤。 In a third aspect, the present application provides a frame loss compensation device, comprising: a vocoder, a memory, and a communication bus, the memory being coupled to the vocoder via the communication bus; wherein the communication bus is used for A connection communication between the processor and the memory is implemented, and the vocoder executes the program stored in the memory for implementing the steps in the frame loss compensation method provided by the above first aspect.
本申请的又一方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述各方面所述的方法。Yet another aspect of the present application provides a computer readable storage medium having stored therein instructions that, when executed on a computer, cause the computer to perform the methods described in the above aspects.
本申请的又一方面提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述各方面所述的方法。Yet another aspect of the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the methods described in the various aspects above.
附图说明DRAWINGS
为了更清楚地说明本申请实施例或背景技术中的技术方案,下面将对本申请实施例或背景技术中所需要使用的附图进行说明。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the background art, the drawings to be used in the embodiments of the present application or the background art will be described below.
图1是本申请实施例提出一种丢帧补偿***的架构示意图;1 is a schematic structural diagram of a frame loss compensation system according to an embodiment of the present application;
图2是本申请实施例提供的一种语音数列序列的示意图;2 is a schematic diagram of a sequence of voice sequences provided by an embodiment of the present application;
图3是本申请实施例提供的一种丢帧补偿方法的流程示意图FIG. 3 is a schematic flowchart of a frame loss compensation method according to an embodiment of the present application;
图4是本申请实施例提供的另一种丢帧补偿的流程示意图;4 is a schematic flowchart of another frame loss compensation provided by an embodiment of the present application;
图5是本申请实施例提供的一种丢帧补偿装置的结构示意图;FIG. 5 is a schematic structural diagram of a frame loss compensation apparatus according to an embodiment of the present disclosure;
图6是本申请实施例提供的一种丢帧补偿设备的结构示意图。FIG. 6 is a schematic structural diagram of a frame loss compensation apparatus according to an embodiment of the present application.
具体实施方式Detailed ways
下面结合本申请实施例中的附图对本申请实施例进行描述。The embodiments of the present application are described below in conjunction with the accompanying drawings in the embodiments of the present application.
请参见图1,图1是本申请实施例提出一种丢帧补偿***的架构示意图。该***可以应用于PS语音通话(包括但不限于VoLTE、VoWiFi和VoIP)场景中,在增加缓冲的情况下,该***也可以应用于电路交换(Circuit Switched,CS)通话中。该***包括基站和接收设备,接收设备可以是指提供到用户的语音和/或数据连接的设备,即用户设备,也可以被连接到诸如膝上型计算机或台式计算机等的计算设备,或者其可以是诸如个人数字助理(Personal Digital Assistant,PDA)等的独立设备。接收设备还可以称为***、用户单元、用户站、移动站、移动台、远程站、接入点、远程终端、接入终端、用户终端、用户代理或用户装置。基站可以为接入点、节点B(Node B)、演进型节点(eNB)或5G基站,指在空中接口上通过一个或多个扇区与无线终端进行通信的接入网络中的设备。通过将已接收的空中接口帧转换为IP(互联协议)分组,基站可以作为无线终端和接入网络的其余部分之间的信号中转设备,接入网络可以包括互联协议网络。基站还可以对空中接口的属性的管理进行协调。本实施例的基站和用户设备都可以采用以下实施例所提到的丢帧补偿方法,并包括相应的丢帧补偿装置,以实现对语音信号的丢帧补偿。当任一设备从对端设备接收到语音码流序列后可以对语音码流序列做解码获得解码后的帧信息,并对其中的丢帧作出补偿以及做后续解码。Referring to FIG. 1, FIG. 1 is a schematic structural diagram of a frame loss compensation system according to an embodiment of the present application. The system can be applied to PS voice calls (including but not limited to VoLTE, VoWiFi and VoIP) scenarios, and the system can also be applied to Circuit Switched (CS) calls with increased buffering. The system includes a base station and a receiving device, and the receiving device may refer to a device that provides a voice and/or data connection to the user, ie, a user device, or may be connected to a computing device such as a laptop or desktop computer, or It can be a standalone device such as a Personal Digital Assistant (PDA). A receiving device may also be referred to as a system, subscriber unit, subscriber station, mobile station, mobile station, remote station, access point, remote terminal, access terminal, user terminal, user agent, or user device. A base station may be an access point, a Node B, an evolved node (eNB), or a 5G base station, referred to as a device in an access network that communicates with a wireless terminal over one or more sectors over an air interface. By converting the received air interface frame to an IP (Internet Protocol) packet, the base station can act as a signal relay device between the wireless terminal and the rest of the access network, which can include an internetwork protocol network. The base station can also coordinate the management of the attributes of the air interface. Both the base station and the user equipment in this embodiment may adopt the frame loss compensation method mentioned in the following embodiments, and include a corresponding frame loss compensation device to implement frame loss compensation for the voice signal. When any device receives the voice code stream sequence from the peer device, it can decode the voice code stream sequence to obtain the decoded frame information, and compensate for the lost frame and perform subsequent decoding.
图2是本申请实施例提供的一种语音码流序列的示意图。如图2所示,在当前时刻T,常见的语音信号,即语音帧,被分为三种帧,包括:历史帧(T时刻之前的至少一个帧)、当前帧(T时刻的至少一个帧)和未来帧(T时刻的至少一个帧),其中时刻T是一段时间单位或时间点。例如,历史帧包括N-1帧、N-2帧、N-3帧、N-4帧等等,当前帧为丢失帧,包括:N帧和N+1帧,未来帧包括N+2帧、N+3帧、N+4帧等等,N为大于4的正整数。本实施例中丢帧补偿中涉及的丢帧或叫丢失帧可以包括传输中丢失的帧、损坏的 帧、无法正确接收或解码的帧、或由于特定原因无法使用的帧。FIG. 2 is a schematic diagram of a voice code stream sequence provided by an embodiment of the present application. As shown in FIG. 2, at the current time T, a common voice signal, that is, a voice frame, is divided into three types of frames, including: a history frame (at least one frame before T time) and a current frame (at least one frame at time T). And a future frame (at least one frame at time T), wherein the time T is a unit of time or a point in time. For example, the history frame includes an N-1 frame, an N-2 frame, an N-3 frame, an N-4 frame, and the like, and the current frame is a lost frame, including: an N frame and an N+1 frame, and the future frame includes an N+2 frame. , N+3 frames, N+4 frames, etc., N is a positive integer greater than 4. The lost frame or lost frame involved in frame loss compensation in this embodiment may include a lost frame in transmission, and is damaged. Frames, frames that cannot be received or decoded correctly, or frames that are not available for a specific reason.
图3是本申请实施例提供的一种丢帧补偿方法的流程示意图。如图3所示,本申请实施例中的方法包括:FIG. 3 is a schematic flowchart of a frame loss compensation method according to an embodiment of the present application. As shown in FIG. 3, the method in this embodiment of the present application includes:
S301,获取所述语音码流序列中的历史帧信息以及未来帧信息,其中,所述语音码流序列包括多个语音帧的帧信息,所述多个语音帧包括至少一个历史帧、至少一个当前帧和至少一个未来帧,所述至少一个历史帧在时域上位于至少一个当前帧之前,所述至少一个未来帧在时域上位于至少一个当前帧之后,所述历史帧信息是所述至少一个历史帧的帧信息,所述未来帧信息是所述至少一个未来帧的帧信息。其中,至少一个当前帧可以是各种原因导致的丢帧。S301. Obtain historical frame information and future frame information in the sequence of voice code streams, where the sequence of voice code streams includes frame information of multiple voice frames, where the multiple voice frames include at least one history frame, at least one a current frame and at least one future frame, the at least one historical frame being located before the at least one current frame in the time domain, the at least one future frame being located after the at least one current frame in the time domain, the historical frame information being the Frame information of at least one history frame, the future frame information being frame information of the at least one future frame. The at least one current frame may be a frame loss caused by various reasons.
在具体实现中,在接收设备通过接口接收到语音码流序列之后,可以将语音码流序列存放到存储器或缓冲器的缓冲区,例如AJB的缓冲区。然后对缓冲区中的所述语音码流序列的帧信息进行解码得到解码后的所述历史帧信息和所述缓冲区中未解码的所述未来帧信息。例如,解码后的历史帧为语音模拟信号,而解码前的历史帧为语音数字信号。未来帧没有被解码,丢帧补偿装置或***可以对未来帧进行解析获取部分有效的帧信息,例如共振峰谱信息和基音数值等等。其中,包括历史帧、当前帧和未来帧的多个帧的信息或数据均被缓冲在所述缓冲区中。In a specific implementation, after the receiving device receives the voice code stream sequence through the interface, the voice code stream sequence may be stored in a buffer of a memory or a buffer, such as a buffer of the AJB. The frame information of the voice code stream sequence in the buffer is then decoded to obtain the decoded history frame information and the undecoded future frame information in the buffer. For example, the decoded history frame is a voice analog signal, and the historical frame before decoding is a voice digital signal. The future frame is not decoded, and the frame loss compensation device or system can parse the future frame to obtain partially valid frame information, such as formant spectrum information and pitch value. Wherein, information or data of a plurality of frames including a history frame, a current frame, and a future frame are buffered in the buffer.
S302,根据所述历史帧信息以及所述未来帧信息,对所述至少一个当前帧的帧信息进行补偿。S302. Compensate the frame information of the at least one current frame according to the historical frame information and the future frame information.
在具体实现中,在执行S302之前,可以判断语音码流序列中语音帧的类型或状态,包括:判断丢帧之前是否存在好帧(即能够用作补偿的正常帧)、丢帧之前的好帧是否为静默帧、是否存在有效的未来帧等等。针对所述语音码流序列中语音帧的不同类型或不同状态,在S302中对当前帧采取不同的补偿措施,使得恢复的信号和原始信号更接近,达到更好地丢帧补偿效果。具体丢帧补偿方法如图4所示,图4是本申请实施例提供的另一种丢帧补偿方法的流程示意图。包括:In a specific implementation, before performing S302, the type or state of the voice frame in the voice code stream sequence may be determined, including: determining whether there is a good frame before the frame loss (ie, a normal frame that can be used as compensation), and before the frame loss is good. Whether the frame is a silent frame, whether there is a valid future frame, and so on. For different types or different states of the speech frames in the sequence of the speech code stream, different compensation measures are taken for the current frame in S302, so that the recovered signal is closer to the original signal, and a better frame loss compensation effect is achieved. FIG. 4 is a schematic flowchart of another frame loss compensation method according to an embodiment of the present disclosure. include:
S401,判断当前帧是否是丢帧或坏帧。S402,如果当前帧是好帧,对好帧进行解码。S403,如果当前帧是坏帧或丢帧,则判断当前帧之前的好帧是否为静默帧(Silence Insertion Descriptor,SID)。S404,如果当前帧之前的好帧是静默帧,则直接对静默帧进行解码。S405,如果当前帧之前的好帧不是静默帧,则确定当前帧之后是否存在有效的未来帧。S406,如果当前帧之后不存在有效的未来帧,则根据历史帧信息对当前帧进行补偿。S407,如果存在有效的未来帧,则根据历史帧信息和未来帧信息对当前丢帧进行补偿,本步骤的具体实施方式在下面详细介绍。S408,在补偿当前帧的帧信息之后,对补偿后的当前帧进行解码。S401. Determine whether the current frame is a dropped frame or a bad frame. S402. If the current frame is a good frame, decode the good frame. S403. If the current frame is a bad frame or a dropped frame, determine whether the good frame before the current frame is a Silence Insertion Descriptor (SID). S404. If the good frame before the current frame is a silent frame, the silent frame is directly decoded. S405. If the good frame before the current frame is not a silent frame, determine whether there is a valid future frame after the current frame. S406. If there is no valid future frame after the current frame, the current frame is compensated according to the historical frame information. S407. If there is a valid future frame, the current frame loss is compensated according to the historical frame information and the future frame information. The specific implementation manner of this step is described in detail below. S408. After compensating the frame information of the current frame, decoding the compensated current frame.
本实施例中,对上述S407步骤进行了详细的介绍,可以综合考虑历史帧信息和未来帧信息对当前丢帧进行补偿,具体方法如下:In this embodiment, the step S407 is described in detail, and the current frame loss can be compensated by considering the historical frame information and the future frame information. The specific method is as follows:
在一种实施例中,所述历史帧信息包括所述历史帧的共振峰谱信息,所述未来帧信息包括所述未来帧的共振峰谱信息。在本实施例的方案中,可以根据所述历史帧的共振峰谱信息和所述未来帧的共振峰谱信息,确定所述至少一个当前帧的共振峰谱信息。其中,共振峰谱信息为发声时声道的激励响应,包括导谱频率(Immittance Spectral Frequency,ISF) 信息,以下使用ISF向量表示共振峰谱信息。In one embodiment, the historical frame information includes formant spectrum information of the historical frame, and the future frame information includes formant spectrum information of the future frame. In the solution of the embodiment, the formant spectrum information of the at least one current frame may be determined according to the formant spectrum information of the historical frame and the formant spectrum information of the future frame. The formant spectrum information is the excitation response of the channel when the sound is emitted, including the Immittance Spectral Frequency (ISF). Information, the following uses an ISF vector to represent formant spectrum information.
例如,语音码流序列中N-2帧和N-1帧是好帧,丢失N帧和N+1帧,存在未来的N+2帧、N+3帧,采用一阶多项式拟合计算N帧的共振峰谱信息。。For example, the N-2 frame and the N-1 frame in the speech code stream sequence are good frames, N frames and N+1 frames are lost, and there are future N+2 frames and N+3 frames, and the first-order polynomial fitting is used to calculate N. The formant spectrum information of the frame. .
ISFi(N-1)=a+b×(N-1);ISF i (N-1)=a+b×(N-1);
ISFi(N+2)=a+b×(N+2);ISF i (N+2)=a+b×(N+2);
其中,N-1帧和N+2帧的共振峰谱信息由多个点的振峰谱信息来表示,共振峰谱信息是经过滤波器处理过的,每个点表示滤波器的一个系数对。ISFi(N-1)是N-1帧的第i个点对应的共振峰谱信息,ISFi(N+2)为N+2帧的第i个点对应的共振峰谱信息,由上述两个公式计算得到:The formant spectrum information of the N-1 frame and the N+2 frame is represented by the peak spectrum information of the plurality of points, and the formant spectrum information is processed by the filter, and each point represents a coefficient pair of the filter. . ISF i (N-1) is the formant spectrum information corresponding to the i-th point of the N-1 frame, and ISF i (N+2) is the formant spectrum information corresponding to the i-th point of the N+2 frame, The two formulas are calculated:
Figure PCTCN2017090035-appb-000001
Figure PCTCN2017090035-appb-000001
Figure PCTCN2017090035-appb-000002
Figure PCTCN2017090035-appb-000002
将上述a、b代入ISFi(N)=a+b·N,其中,ISFi(N)为N帧的第i个点对应的共振峰谱信息,计算得到:Substituting the above a and b into ISF i (N)=a+b·N, where ISF i (N) is the formant spectrum information corresponding to the i-th point of the N frame, and calculates:
Figure PCTCN2017090035-appb-000003
Figure PCTCN2017090035-appb-000003
可选的,在根据所述历史帧的共振峰谱信息和所述未来帧的共振峰谱信息,确定所述至少一个当前帧的共振峰谱信息之前,可以对语音码流序列中的帧状态进行判断,包括:判断丢失多少帧、是否存在未来好帧、当前帧之前是否有好帧等等。然后根据语音码流序列中的帧状态,采用不同的方法计算当前丢帧的共振峰谱信息。Optionally, before determining the formant spectrum information of the at least one current frame according to the formant spectrum information of the historical frame and the formant spectrum information of the future frame, the frame state in the sequence of the voice code stream may be The judgment includes: judging how many frames are lost, whether there are future good frames, whether there are good frames before the current frame, and the like. Then, according to the frame state in the speech code stream sequence, different methods are used to calculate the formant spectrum information of the current frame loss.
例如,当前丢帧之前的两帧是好帧、丢失1-2帧、没有未来好帧,则采用现有技术中的一阶多项式利用当前丢帧之前的两帧进行拟合。当前丢帧的前一帧是好帧、丢失一帧或以上、没有未来好帧,则利用当前丢帧的前一帧和ISFmean(i)进行拟合。丢失3帧或以上、有未来好帧,采用一阶多项式利用ISFmean(i)和未来好帧进行拟合。当前丢帧之前的三帧是好帧、丢失1-2帧、有未来好帧,则采用一阶多项式利用当前丢帧之前的好帧和未来好帧进行拟合,此种情况上面已经详细说明。其中,ISFmean(i)计算方法如下:For example, if two frames before the current frame loss are good frames, 1-2 frames are lost, and no future good frames are used, the first-order polynomial in the prior art is used to fit the two frames before the current frame loss. If the previous frame of the current frame loss is a good frame, one frame or more is lost, and there is no future good frame, the previous frame of the current frame loss is used to fit with the ISF mean (i). Loss of 3 frames or more, with good future frames, using first-order polynomial to fit with ISF mean (i) and future good frames. If the three frames before the current frame loss are good frames, 1-2 frames are lost, and there are future good frames, the first-order polynomial is used to fit the good frame before the current frame loss and the future good frame. This situation has been described in detail above. . Among them, ISF mean (i) is calculated as follows:
ISFmean(i)=β×ISFconst_mean(i)+(1-β)×ISFadaptive_mean(i),i=0,…,15;ISF mean (i)=β×ISF const_mean (i)+(1-β)×ISF adaptive_mean (i),i=0,...,15;
其中,
Figure PCTCN2017090035-appb-000004
past_ISFq(i)是当前丢帧的前一帧的第i个点对应的共振峰谱信息,β为预设常量。ISFconst_mean(i)为一段时间范围内的共振峰谱信息的平均值。
among them,
Figure PCTCN2017090035-appb-000004
past_ISF q (i) is the formant spectrum information corresponding to the i-th point of the previous frame of the current frame loss, and β is a preset constant. ISF const _ mean (i) is the average of the formant spectrum information over a period of time.
另外,在计算得到ISFmean(i)之后,根据当前丢帧的前一帧的共振峰谱信息和ISFmean(i),计算当前丢帧的共振峰谱信息ISFq(i)。计算公式如下:Further, after the calculated ISF mean (i), according to the previous current frame dropping formant spectral information and ISF mean (i) a calculated spectrum of the current frame dropping formant information ISF q (i). Calculated as follows:
ISFq(i)=α×past_ISFq(i)+(1-α)×ISFmean(i),i=0,…,15,α为预设常数。 ISF q (i) = α × past_ISF q (i) + (1 - α) × ISF mean (i), i = 0, ..., 15, and α is a preset constant.
在另一种实施例中,所述历史帧信息包括所述历史帧的基音数值,所述未来帧信息包括所述未来帧的基音数值。在本实施例的方案中,可以根据所述历史帧的基音数值和所述未来帧的基音数值,确定所述至少一个当前帧的基音数值。例如,基音数值为发声时声带震动的基音频率,是基音周期的倒数。或者基音数值是基音周期。In another embodiment, the historical frame information includes a pitch value of the historical frame, and the future frame information includes a pitch value of the future frame. In the solution of this embodiment, the pitch value of the at least one current frame may be determined according to a pitch value of the historical frame and a pitch value of the future frame. For example, the pitch value is the pitch frequency of the vocal cord vibration at the time of sounding, and is the reciprocal of the pitch period. Or the pitch value is the pitch period.
例如,语音码流序列中包括每帧的四个基音。设定N-2帧、N-1帧是好帧,没有N帧和N+1帧,有N+2帧和N+3帧,采用二阶多项式根据N-1帧和N+2帧拟合计算N帧的基音数值。已知N-1帧的基音数值pitch1(N-1),pitch2(N-1),pitch3(N-1),pitch4(N-1)、和N+2帧的基音数值pitch1(N+2),pitch2(N+2),pitch3(N+2),pitch4(N+2)。其中,pitch表示基音数值,N表示帧号,下标表示子帧在每一帧内的位置,例如每个子帧对应一个基音。二阶多项式如下:For example, a sequence of speech streams includes four pitches per frame. Set N-2 frame, N-1 frame is a good frame, there are no N frames and N+1 frames, there are N+2 frames and N+3 frames, and second-order polynomials are used according to N-1 frames and N+2 frames. Calculate the pitch value of N frames. The pitch value pitch 1 (N-1), pitch 2 (N-1), pitch 3 (N-1), pitch 4 (N-1), and N+2 frame pitch value pitch of the N-1 frame are known. 1 (N+2), pitch 2 (N+2), pitch 3 (N+2), pitch 4 (N+2). Wherein, pitch represents a pitch value, N represents a frame number, and a subscript indicates a position of the subframe within each frame, for example, each sub-frame corresponds to a pitch. The second-order polynomial is as follows:
Y=a0+a1x+a2x2,其中,a0、a1、a2分别是拟合曲线的系数,可以是按照工程设计经验预设的。按偏差平方和最小的原则就可以得到下面的矩阵:Y = a 0 + a 1 x + a 2 x 2 , where a 0 , a 1 , and a 2 are coefficients of the fitting curve, respectively, which may be preset according to engineering design experience. According to the principle of the smallest squared deviation, the following matrix can be obtained:
Figure PCTCN2017090035-appb-000005
Figure PCTCN2017090035-appb-000005
其中,n是N-1帧和N+2帧的子帧之和(n=8),xi是N-1帧和N+2帧中第i个子帧的时间点,yi是N-1帧和N+2帧中第i个子帧的时间点对应的基音数值。pitch1(N-1)时间点定义为4·(N-1)+1,pitch2(N-1)时间点定义为4·(N-1)+2,依此类推pitch4(N+2)时间点定义为4·(N+2)+4。然后分别将4·(N-1)+1和pitch1(N-1)、4·(N-1)+2和
Figure PCTCN2017090035-appb-000006
4·N+1,…,4·N+4分别作为变量x分别代入已知系数的Y=a0+a1x+a2x2,计算得到的Y值作为N帧的基音数值pitch1(N),pitch2(N),pitch3(N),pitch4(N)。
Where n is the sum of the subframes of the N-1 frame and the N+2 frame (n=8), x i is the time point of the ith subframe in the N-1 frame and the N+2 frame, and y i is N- The pitch value corresponding to the time point of the i-th subframe in the 1 frame and the N+2 frame. The pitch 1 (N-1) time point is defined as 4·(N-1)+1, the pitch 2 (N-1) time point is defined as 4·(N-1)+2, and so on pitch 4 (N+) 2) The time point is defined as 4·(N+2)+4. Then 4·(N-1)+1 and pitch 1 (N-1), 4·(N-1)+2 and
Figure PCTCN2017090035-appb-000006
4·N+1,...,4·N+4 are respectively substituted into the known coefficient Y=a 0 +a 1 x+a 2 x 2 as the variable x, and the calculated Y value is used as the pitch value pitch 1 of the N frame. (N), pitch 2 (N), pitch 3 (N), pitch 4 (N).
可选的,在根据所述历史帧的基音数值和所述未来帧的基音数值,确定所述至少一个当前帧的基音数值之前,可以对语音码流序列的帧状态进行判断,包括:判断丢失多少帧、是否存在未来好帧、当前丢帧之前是否有好帧等等,然后根据语音码流序列的帧状态,采用不同的方法计算当前丢帧的基音数值。Optionally, before determining the pitch value of the at least one current frame according to the pitch value of the historical frame and the pitch value of the future frame, the frame state of the voice code stream sequence may be determined, including: determining the loss. How many frames, whether there are future good frames, whether there are good frames before the current frame loss, etc., and then according to the frame state of the voice code stream sequence, different methods are used to calculate the pitch value of the current frame loss.
例如,当前丢帧之前的两帧是好帧、丢失1-3帧、没有未来好帧,则采用二阶多项式利用当前丢帧之前的好帧对当前丢帧的基音数值进行拟合。当前丢帧包括基音的语音帧和噪音的语音帧,如果丢失4帧或以上,则降低基音的能量,仅补偿噪声的基音数值。当前丢帧之前的三帧是好帧、丢失1-3帧、有未来好帧,采用二阶多项式利用当前丢帧之前的好帧和未来好帧,对当前丢帧的基音数值进行拟合,此种情况上面已经介绍。 For example, if two frames before the current frame loss are good frames, 1-3 frames are lost, and no future good frames are used, the second-order polynomial is used to fit the pitch values of the current frame loss by using the good frame before the current frame loss. The current frame loss includes a pitched speech frame and a noisy speech frame. If 4 frames or more is lost, the pitch energy is reduced, and only the pitch value of the noise is compensated. The three frames before the current frame loss are good frames, 1-3 frames are lost, and there are future good frames. The second-order polynomial is used to fit the pitch values of the current frame loss by using the good frame and the future good frame before the current frame loss. This situation has been introduced above.
S303,判断当前帧的帧类型。其中,帧类型包括清音和浊音。浊音和清音的发声特性存在较大差异,当前帧的帧类型不同,采用的丢帧补偿策略也不同。浊音和清音的差异是浊音信号具有明显的周期性,这是由于声带震动引起的。周期性检测可以采用过零率、相关性、谱倾斜(spectral tilt)或基音变化率等算法。其中,过零率和相关性计算在现有技术中应用很多,不做描述。以下通过谱倾斜和基音变化率来确定语音信号的帧状态进行介绍。S303. Determine a frame type of the current frame. Among them, the frame type includes unvoiced and voiced. The vocal characteristics of voiced and unvoiced voices are quite different. The frame type of the current frame is different, and the frame loss compensation strategy used is different. The difference between voiced and unvoiced is that the voiced signal has significant periodicity due to vocal cord vibration. Periodic detection can employ algorithms such as zero-crossing rate, correlation, spectral tilt, or pitch change rate. Among them, the zero-crossing rate and correlation calculation are widely used in the prior art and will not be described. The following describes the frame state of a speech signal by spectral tilt and pitch change rate.
在一种实施例中,可以根据历史帧经过解码得到的时域信号的大小,确定所述至少一个当前帧的谱倾斜的大小;根据所述至少一个当前帧的谱倾斜的大小,确定所述至少一个当前帧的帧类型。其中,浊音语音信号的基音频率在500Hz以下,可以根据谱倾斜判断周期性信号。计算谱倾斜公式如下:In an embodiment, the size of the spectral tilt of the at least one current frame may be determined according to the size of the time domain signal obtained by decoding the historical frame; and determining the size according to the size of the spectral tilt of the at least one current frame. The frame type of at least one current frame. Wherein, the pitch frequency of the voiced speech signal is below 500 Hz, and the periodic signal can be determined according to the spectral tilt. Calculate the spectral tilt formula as follows:
Figure PCTCN2017090035-appb-000007
其中,
Figure PCTCN2017090035-appb-000008
Figure PCTCN2017090035-appb-000007
among them,
Figure PCTCN2017090035-appb-000008
其中,tilt为当前帧的谱倾斜的大小,s是历史帧经过解码得到的模拟的时域信号的大小,i是时域信号在时间方向上时间点。对于一段时间范围内的语音编码序列,r0的值是固定的,由于清音具有类似白噪声的特性,r1的值就相对较小,若tilt的值小于预设阈值,则确定所述至少一个当前帧的帧类型为清音。而对于浊音,r1的值就相对较大,若tilt的值不小于所述预设阈值,则确定所述至少一个当前帧的帧类型为清音。Where, tilt is the magnitude of the spectral tilt of the current frame, s is the size of the simulated time domain signal obtained by decoding the historical frame, and i is the time point of the time domain signal in the time direction. For a speech coding sequence in a range of time, the value of r 0 is fixed, and since the unvoiced sound has a white noise-like characteristic, the value of r1 is relatively small, and if the value of tilt is less than a preset threshold, the at least one is determined. The frame type of the current frame is unvoiced. For voiced voices, the value of r 1 is relatively large. If the value of tilt is not less than the preset threshold, it is determined that the frame type of the at least one current frame is unvoiced.
在另一种实施例中,可以获取所述至少一个当前帧中多个子帧的基音变化状态;根据所述多个子帧的基音变化状态,确定所述至少一个当前帧的帧类型。其中,浊音主要是声带震动产生的,存在基音,而且声带震动变化相对缓慢,基音变化相对缓慢。每个子帧都有一个基音,因此利用基音判断当前丢帧的帧类型。In another embodiment, a pitch change state of the plurality of subframes in the at least one current frame may be acquired; and a frame type of the at least one current frame is determined according to a pitch change state of the multiple subframes. Among them, the voiced sound is mainly caused by the vocal cord vibration, there is a pitch, and the vocal cord vibration changes relatively slowly, and the pitch changes relatively slowly. Each sub-frame has a pitch, so the pitch is used to determine the frame type of the current dropped frame.
Figure PCTCN2017090035-appb-000009
Figure PCTCN2017090035-appb-000009
其中,pitchchange是一个帧内4个子帧的基音变化状态,pitch(i)是第i个子帧的基音数值,pitch(i+1)是第i+1个子帧的基音数值。如果pitchchange变化较小,则判定至少一个当前帧为浊音,如果变化较大,则判定所述至少一个当前帧为清音,所述判断可以通过将所述变化与预设阈值进行比较来完成,当达到所述阈值时则确定该语音信号为清音,否则确定该语音信号为浊音。如果希望扩展到帧间进行判断,则可以增加到i=0,1,…,7,这样就可以对两帧内8个子帧的基音变化进行判断。Where pitch change is the pitch change state of 4 subframes in a frame, pitch(i) is the pitch value of the i-th subframe, and pitch(i+1) is the pitch value of the i+1th subframe. If the pitch change changes less, determining that at least one current frame is voiced, and if the change is large, determining that the at least one current frame is unvoiced, the determining may be performed by comparing the change with a preset threshold. When the threshold is reached, it is determined that the voice signal is unvoiced, otherwise the voice signal is determined to be voiced. If it is desired to extend the judgment between frames, it can be increased to i = 0, 1, ..., 7, so that the pitch variation of 8 subframes in two frames can be judged.
S304,对至少一个当前帧的能量进行调整。S304. Adjust energy of at least one current frame.
在具体实现中,可以确定所述至少一个当前帧的帧类型,根据所述帧类型,确定所述至少一个当前帧的自适应码书增益和固定码书增益的至少一项。其中,当前帧包括基音的语音帧和噪音的语音帧,自适应码书增益为基音部分的能量增益,固定码书增益为噪音部分的能量增益。In a specific implementation, a frame type of the at least one current frame may be determined, and at least one of an adaptive codebook gain and a fixed codebook gain of the at least one current frame is determined according to the frame type. The current frame includes a pitched speech frame and a noisy speech frame, the adaptive codebook gain is the energy gain of the pitch portion, and the fixed codebook gain is the energy gain of the noise portion.
在一种实施例中,若所述帧类型为浊音,则根据一个历史帧的自适应码书增益和基音周期、以及所述至少一个当前帧的能量增益,确定所述至少一个当前帧的自适应码书增益,并将多个历史帧的固定码书增益的平均值作为所述至少一个当前帧的固定码书增益。其中,一个历史帧可以为当前帧之前的最近一个历史帧。 In an embodiment, if the frame type is voiced, determining the self of the at least one current frame according to an adaptive codebook gain and a pitch period of a historical frame, and an energy gain of the at least one current frame. The codebook gain is adapted and the average of the fixed codebook gains of the plurality of historical frames is used as the fixed codebook gain of the at least one current frame. Wherein, a history frame may be the latest history frame before the current frame.
在另一种实施例中,若所述帧类型为清音,则根据一个历史帧的固定码书增益和基音周期、以及所述至少一个当前帧的能量增益,确定所述至少一个当前帧的固定码书增益,并将多个历史帧的自适应码书增益的平均值作为所述至少一个当前帧的自适应码书增益。In another embodiment, if the frame type is unvoiced, determining the fixing of the at least one current frame according to a fixed codebook gain and a pitch period of a historical frame, and an energy gain of the at least one current frame. The codebook gains and averages the adaptive codebook gains of the plurality of historical frames as the adaptive codebook gain of the at least one current frame.
[根据细则91更正 02.08.2017]  
例如,如果当前丢帧的数量没有超过3帧,则在能量调整上对当前丢帧进行增强,包括:若当前丢帧的帧状态为浊音,则当前丢帧的自适应码书增益
Figure PCTCN2017090035-appb-000010
当前丢帧的固定码书增益gc=median5(gc(n-1),...,gc(n-5)),其中,gp(n-1)为最近一个所述历史帧的自适应码书增益,Gvoice为浊音中当前丢帧的能量增益,Tc为最近一个历史帧的基音周期,median5(gc(n-1),...,gc(n-5))为最近五个历史帧的固定码书增益的平均值。若当前丢帧的帧状态为清音,则当前丢帧的自适应码书增益gp=median5(gp(n-1),...,gp(n-5)),当前丢帧的固定码书增益
Figure PCTCN2017090035-appb-000012
其中,median5(gp(n-1),...,gp(n-5))为最近五个历史帧的自适应码书增益的平均值,gc(n-1)为最近一个历史帧的固定码书增益,Gnoise为清音中当前丢帧的能量增益,所述Tc为最近一个历史帧的基音周期。
[Correct according to Rule 91 02.08.2017]
For example, if the current number of dropped frames does not exceed 3 frames, the current lost frame is enhanced in energy adjustment, including: if the frame state of the current dropped frame is voiced, the adaptive codebook gain of the current dropped frame
Figure PCTCN2017090035-appb-000010
The fixed codebook gain of the current frame loss g c =median5(g c (n-1),...,g c (n-5)), where g p (n-1) is the most recent one of the history frames Adaptive codebook gain, G voice is the energy gain of the current frame loss in the voiced sound, T c is the pitch period of the most recent history frame, median5(g c (n-1),...,g c (n-5 )) is the average of the fixed codebook gains for the last five historical frames. If the frame state of the current frame loss is unvoiced, the adaptive codebook gain of the current frame loss g p =median5(g p (n-1),...,g p (n-5)), the current frame loss Fixed codebook gain
Figure PCTCN2017090035-appb-000012
Where median5(g p (n-1),...,g p (n-5)) is the average of the adaptive codebook gains of the last five historical frames, and g c (n-1) is the nearest one. The fixed codebook gain of the history frame, G noise is the energy gain of the current frame loss in the unvoiced sound, and the T c is the pitch period of the most recent history frame.
又如:如果当前丢帧的数量超过3帧,则在能量调整上对当前丢帧进行衰减,包括:当前丢帧的自适应码书增益gp=Pp(state)×median5(gp(n-1),...,gp(n-5)),当前丢帧的固定码书增益gc=Pc(state)×median5(gc(n-1),...,gc(n-5))。其中,median5(gp(n-1),...,gp(n-5))为最近五个历史帧的自适应码书增益的平均值,median5(gc(n-1),...,gc(n-5))为最近五个历史帧的固定码书增益的平均值,例如Pp(state)为衰减因子(Pp(1)为0.98,Pp(2)=0.96,Pp(3)=0.75,Pp(4)=0.23,Pp(5)=0.05,Pp(6)=0.01)。例如Pc(state)为衰减因子(Pc(1)=0.98,Pc(2)=0.98,Pc(3)=0.98,Pc(4)=0.98,Pc(5)=0.98,Pc(6)=0.70),state={0,1,2,3,4,5,6}。For another example, if the number of currently dropped frames exceeds 3 frames, the current frame loss is attenuated in the energy adjustment, including: the current codebook gain of the current frame loss g p = P p (state) × median5 (g p ( N-1),...,g p (n-5)), the fixed codebook gain of the current frame loss g c =P c (state)×median5(g c (n-1),...,g c (n-5)). Where median5(g p (n-1),...,g p (n-5)) is the average of the adaptive codebook gains of the last five historical frames, median5(g c (n-1), ..., g c (n-5)) is the average of the fixed codebook gains of the last five historical frames, for example, P p (state) is the attenuation factor (P p (1) is 0.98, P p (2) = 0.96, P p (3) = 0.75, P p (4) = 0.23, P p (5) = 0.05, P p (6) = 0.01). For example, P c (state) is the attenuation factor (P c (1) = 0.98, P c (2) = 0.98, P c (3) = 0.98, P c (4) = 0.98, P c (5) = 0.98, P c (6) = 0.70), state = {0, 1, 2, 3, 4, 5, 6}.
其中,可以根据解码后的历史帧信息中的时域信号大小和所述历史帧中每个子帧的长度,确定所述至少一个当前的能量增益。其中,当前帧的能量增益包括所述浊音中当前丢帧的能量增益或述清音中当前丢帧的能量增益。由于在当前时刻对未来帧并没有解码,因此只能根据历史帧信息来确定当前丢帧的能量增益。当前一好帧为浊音时,当前丢帧的能量增益计算公式如下:The at least one current energy gain may be determined according to a time domain signal size in the decoded historical frame information and a length of each subframe in the historical frame. The energy gain of the current frame includes an energy gain of the current dropped frame in the voiced sound or an energy gain of the current dropped frame in the unvoiced sound. Since the future frame is not decoded at the current time, the energy gain of the current frame loss can only be determined based on the history frame information. When the current good frame is voiced, the energy gain of the current frame loss is calculated as follows:
Figure PCTCN2017090035-appb-000014
Figure PCTCN2017090035-appb-000014
其中,S是当前帧的前一好帧经过解码得到的时域信号大小,长度为4Lsubfr,Lsubfr表示一个子帧的长度,Tc表示前一好帧的基音周期,i是时域信号在时间方向上的时间点。为了防止Gvoice过大或过小导致恢复后的帧的能量不可预期,Gvoice限制在[0,2]。Where S is the time domain signal size obtained by decoding the previous good frame of the current frame, the length is 4L subfr , L subfr represents the length of one subframe, T c represents the pitch period of the previous good frame, and i is the time domain signal. The point in time in the direction of time. In order to prevent the G voice from being too large or too small, the energy of the restored frame is unpredictable, and G voice is limited to [0, 2].
当前一好帧为清音或噪声时,当前帧的能量增益的计算公式如下: When the current good frame is unvoiced or noise, the energy gain of the current frame is calculated as follows:
Figure PCTCN2017090035-appb-000015
Figure PCTCN2017090035-appb-000015
其中,S表示当前帧的前一好帧经过解码得到的时域信号大小,Lsubfr表示一个子帧的长度,i是时域信号在时间方向上的时间点。为了防止Gnoise过大或过小导致恢复后的帧的能量不可预期,Gnoise限制在[0,2]。Where S is the time domain signal size obtained by decoding the previous good frame of the current frame, L subfr is the length of one subframe, and i is the time point of the time domain signal in the time direction. In order to prevent the G noise from being too large or too small, the energy of the restored frame is unpredictable, and G noise is limited to [0, 2].
综上所述,在本发明实施例中首先获取语音码流序列中的历史帧信息以及未来帧信息,然后根据历史帧信息以及未来帧信息,估计语音信号中当前丢帧的共振峰谱信息、基音数值、固定码书增益、自适应码书增益和能量等等。通过同时利用历史帧信息和未来帧信息对丢帧进行补偿,提高了丢帧补偿的准确度。In summary, in the embodiment of the present invention, the historical frame information and the future frame information in the sequence of the voice code stream are first obtained, and then the formant spectrum information of the current frame loss in the voice signal is estimated according to the historical frame information and the future frame information. Pitch value, fixed codebook gain, adaptive codebook gain and energy, etc. By using the historical frame information and the future frame information to compensate for the lost frame, the accuracy of the frame loss compensation is improved.
上述详细阐述了本申请实施例的方法,下面提供了本申请实施例的装置。The above describes the method of the embodiment of the present application in detail, and the apparatus of the embodiment of the present application is provided below.
请参见图5,图5是本申请实施例提供的一种丢帧补偿装置的结构示意图,该装置可以是个声码器,例如可包括接收模块501、获取模块502和处理模块503,其中,各模块的详细描述如下。Referring to FIG. 5, FIG. 5 is a schematic structural diagram of a frame loss compensation apparatus according to an embodiment of the present disclosure. The device may be a vocoder, and may include, for example, a receiving module 501, an obtaining module 502, and a processing module 503. A detailed description of the module is as follows.
接收模块501,用于接收语音码流序列;The receiving module 501 is configured to receive a voice code stream sequence.
获取模块502,用于获取所述语音码流序列中的历史帧信息以及未来帧信息,其中,所述语音码流序列包括多个语音帧的帧信息,所述多个语音帧包括至少一个历史帧、至少一个当前帧和至少一个未来帧,所述至少一个历史帧在时域上位于至少一个当前帧之前,所述至少一个未来帧在时域上位于至少一个当前帧之后,所述历史帧信息是所述至少一个历史帧的帧信息,所述未来帧信息是所述至少一个未来帧的帧信息;The obtaining module 502 is configured to acquire historical frame information and future frame information in the sequence of the voice code stream, where the voice code stream sequence includes frame information of multiple voice frames, where the multiple voice frames include at least one history a frame, at least one current frame, and at least one future frame, the at least one historical frame being located before the at least one current frame in the time domain, the at least one future frame being located after the at least one current frame in the time domain, the historical frame The information is frame information of the at least one history frame, and the future frame information is frame information of the at least one future frame;
处理模块503,用于根据所述历史帧信息以及所述未来帧信息,估计所述至少一个当前帧的帧信息。The processing module 503 is configured to estimate frame information of the at least one current frame according to the historical frame information and the future frame information.
其中,在缓冲区中存储所述所述语音码流序列;Wherein the sequence of the voice code stream is stored in a buffer;
可选的,处理模块503具体用于:对所述缓冲区中的所述语音码流序列的多个语音帧的帧信息进行解码以获得解码后的所述历史帧信息;从所述缓冲区中获取未解码的所述未来帧信息。Optionally, the processing module 503 is specifically configured to: decode frame information of multiple voice frames of the voice code stream sequence in the buffer to obtain the decoded history frame information; Obtaining the undecoded future frame information.
其中,所述历史帧信息包括所述至少一个历史帧的共振峰谱信息,所述未来帧信息包括所述至少一个未来帧的共振峰谱信息;The historical frame information includes formant spectrum information of the at least one historical frame, and the future frame information includes formant spectrum information of the at least one future frame;
可选的,处理模块503具体用于:根据所述历史帧的共振峰谱信息和所述未来帧的共振峰谱信息,确定所述至少一个当前帧的共振峰谱信息。Optionally, the processing module 503 is specifically configured to: determine formant spectrum information of the at least one current frame according to formant spectrum information of the historical frame and formant spectrum information of the future frame.
其中,所述历史帧信息包括所述至少一个历史帧的基音数值,所述未来帧信息包括所述至少一个未来帧的基音数值;The historical frame information includes a pitch value of the at least one historical frame, and the future frame information includes a pitch value of the at least one future frame;
可选的,处理模块503具体用于:根据所述至少一个历史帧的基音数值和所述至少一个未来帧的基音数值,确定所述至少一个当前帧的基音数值。Optionally, the processing module 503 is specifically configured to: determine a pitch value of the at least one current frame according to a pitch value of the at least one historical frame and a pitch value of the at least one future frame.
其中,所述历史帧信息包括所述至少一个历史帧的能量,所述未来帧信息包括所述 至少一个未来帧的能量;The historical frame information includes energy of the at least one historical frame, and the future frame information includes the The energy of at least one future frame;
可选的,处理模块503具体用于:根据所述至少一个历史帧的能量和所述至少一个未来帧的能量,确定所述至少一个当前帧的能量。Optionally, the processing module 503 is specifically configured to: determine, according to the energy of the at least one historical frame and the energy of the at least one future frame, the energy of the at least one current frame.
可选的,处理模块503具体用于:确定所述至少一个当前帧的帧类型,所述帧类型包括清音或浊音;Optionally, the processing module 503 is specifically configured to: determine a frame type of the at least one current frame, where the frame type includes unvoiced or voiced sound;
根据所述帧类型,确定所述至少一个当前帧的自适应码书增益和固定码书增益的至少一项。Determining at least one of an adaptive codebook gain and a fixed codebook gain of the at least one current frame based on the frame type.
可选的,处理模块503,还用于确定所述至少一个当前帧的谱倾斜的大小;Optionally, the processing module 503 is further configured to determine a size of a spectral tilt of the at least one current frame;
根据所述至少一个当前帧的谱倾斜的大小,确定所述至少一个当前帧的帧类型。Determining a frame type of the at least one current frame according to a size of a spectral tilt of the at least one current frame.
可选的,处理模块503,还用于获取所述至少一个当前帧中多个子帧的基音变化状态;Optionally, the processing module 503 is further configured to acquire a pitch change state of the multiple subframes in the at least one current frame.
根据所述多个子帧的基音变化状态,确定所述至少一个当前帧的帧类型。Determining a frame type of the at least one current frame according to a pitch change state of the plurality of subframes.
可选的,处理模块503具体用于:若所述帧类型为浊音,则根据一个历史帧的自适应码书增益和基音周期、以及所述至少一个当前帧的能量增益,确定所述至少一个当前帧的自适应码书增益,并将多个历史帧的固定码书增益的平均值作为所述至少一个当前帧的固定码书增益。Optionally, the processing module 503 is specifically configured to: if the frame type is voiced, determine the at least one according to an adaptive codebook gain and a pitch period of a historical frame, and an energy gain of the at least one current frame. An adaptive codebook gain of the current frame, and an average of the fixed codebook gains of the plurality of historical frames as a fixed codebook gain of the at least one current frame.
可选的,处理模块503具体用于:若所述帧类型为清音,则根据一个历史帧的固定码书增益和基音周期、以及所述至少一个当前帧的能量增益,确定所述至少一个当前帧的固定码书增益,并将多个历史帧的自适应码书增益的平均值作为所述至少一个当前帧的自适应码书增益。Optionally, the processing module 503 is specifically configured to: if the frame type is unvoiced, determine the at least one current according to a fixed codebook gain and a pitch period of a historical frame, and an energy gain of the at least one current frame. The fixed codebook gain of the frame, and the average of the adaptive codebook gains of the plurality of historical frames is used as the adaptive codebook gain of the at least one current frame.
可选的,处理模块503,还用于根据解码后的所述历史帧信息中的时域信号大小和所述历史帧中每个子帧的长度,确定所述至少一个当前的能量增益。Optionally, the processing module 503 is further configured to determine the at least one current energy gain according to the size of the time domain signal in the decoded historical frame information and the length of each subframe in the historical frame.
需要说明的是,各个模块的具体功能实现还可以对应参照图3所示的方法实施例的相应描述,执行上述实施例中所执行的方法和功能。It should be noted that the specific function implementation of each module may also perform the methods and functions performed in the foregoing embodiments corresponding to the corresponding descriptions of the method embodiments shown in FIG.
请继续参考图6,图6是本申请提出的一种丢帧补偿设备的结构示意图。该设备可以包括:至少一个声码器601,例如,宽带自适应多速率声码器(Adaptive Multi-Rate Wideband,AMR-WB),至少一个通信接口602,至少一个存储器603和至少一个通信总线604。其中,通信总线604用于实现这些组件之间的连接通信。其中,本申请实施例中设备的通信接口602用于与其他节点设备进行信令或数据的通信。存储器603可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。存储器603可选的还可以是至少一个位于远离前述声码器601的存储装置。存储器603中存储一组程序代码,且可以进一步用于存储声码器601的中间运算数据等临时数据。声码器601执行存储器603中程序代码以实现之前实施例所提到的方法,具体可以参考之前实施例的描述。进一步的,声码器601还可以与存储器603和通信接口602相配合,执行上述申请实施例中接收设备的操作。声码器601具体可以包括执行所述程序代码的处理器,如中央处理单元(CPU)或数字信号处理器(DSP)等。例如,通信接口602可用于接收语音码流序列。Please refer to FIG. 6. FIG. 6 is a schematic structural diagram of a frame loss compensation device according to the present application. The apparatus can include at least one vocoder 601, such as an Adaptive Multi-Rate Wideband (AMR-WB), at least one communication interface 602, at least one memory 603, and at least one communication bus 604. . Among them, the communication bus 604 is used to implement connection communication between these components. The communication interface 602 of the device in the embodiment of the present application is used for signaling or data communication with other node devices. The memory 603 may be a high speed RAM memory or a non-volatile memory such as at least one disk memory. The memory 603 can optionally also be at least one storage device located remotely from the vocoder 601. A set of program codes is stored in the memory 603, and may be further used to store temporary data such as intermediate operation data of the vocoder 601. The vocoder 601 executes the program code in the memory 603 to implement the method mentioned in the previous embodiment, and can be specifically referred to the description of the previous embodiment. Further, the vocoder 601 can also cooperate with the memory 603 and the communication interface 602 to perform the operations of the receiving device in the above-mentioned application embodiment. The vocoder 601 may specifically include a processor that executes the program code, such as a central processing unit (CPU) or a digital signal processor (DSP) or the like. For example, communication interface 602 can be used to receive a stream of voice code streams.
可以理解,所述存储器603可以没存有程序代码,此时声码器601可以包括不需要执行程序代码的硬件处理器,如专用集成电路(ASIC)、现场可编程逻辑门阵列(FPGA) 或集成电路形成的硬件加速器。此时,存储器603可仅用于存储声码器601的中间运算数据等临时数据。It can be understood that the memory 603 may not have program code, and the vocoder 601 may include a hardware processor that does not need to execute program code, such as an application specific integrated circuit (ASIC) or a field programmable logic gate array (FPGA). Or a hardware accelerator formed by an integrated circuit. At this time, the memory 603 may be used only for storing temporary data such as intermediate operation data of the vocoder 601.
在上述实施例中,方法的功能可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机或其内部处理器上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。 In the above embodiments, the functions of the method may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer or its internal processor, the processes or functions described in accordance with embodiments of the present application are generated in whole or in part. The computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions can be from a website site, computer, server or data center Transfer to another website site, computer, server, or data center by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL), or wireless (eg, infrared, wireless, microwave, etc.). The computer readable storage medium can be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that includes one or more available media. The usable medium may be a magnetic medium (eg, a floppy disk, a hard disk, a magnetic tape), an optical medium (eg, a DVD), or a semiconductor medium (such as a solid state disk (SSD)).

Claims (23)

  1. 一种丢帧补偿方法,其特征在于,所述方法包括:A frame loss compensation method, characterized in that the method comprises:
    接收语音码流序列;Receiving a sequence of voice code streams;
    获取所述语音码流序列中的历史帧信息以及未来帧信息,其中,所述语音码流序列包括多个语音帧的帧信息,所述多个语音帧包括至少一个历史帧、至少一个当前帧和至少一个未来帧,所述至少一个历史帧在时域上位于至少一个当前帧之前,所述至少一个未来帧在时域上位于至少一个当前帧之后,所述历史帧信息是所述至少一个历史帧的帧信息,所述未来帧信息是所述至少一个未来帧的帧信息;Acquiring historical frame information and future frame information in the sequence of voice code streams, wherein the sequence of voice code streams includes frame information of a plurality of voice frames, the plurality of voice frames including at least one history frame, at least one current frame And at least one future frame, the at least one historical frame being located before the at least one current frame in the time domain, the at least one future frame being located after the at least one current frame in the time domain, the historical frame information being the at least one Frame information of a history frame, the future frame information being frame information of the at least one future frame;
    根据所述历史帧信息以及所述未来帧信息,估计所述至少一个当前帧的帧信息。And estimating frame information of the at least one current frame according to the historical frame information and the future frame information.
  2. 如权利要求1所述的方法,其特征在于,还包括:在缓冲区中存储所述所述语音码流序列;The method of claim 1 further comprising: storing said sequence of speech code streams in a buffer;
    所述获取所述语音码流序列中的历史帧信息和未来帧信息包括:The acquiring the historical frame information and the future frame information in the sequence of the voice code stream includes:
    对所述缓冲区中的所述语音码流序列的多个语音帧的帧信息进行解码以获得解码后的所述历史帧信息;Decoding frame information of the plurality of voice frames of the voice code stream sequence in the buffer to obtain the decoded history frame information;
    从所述缓冲区中获取未解码的所述未来帧信息。The undecoded future frame information is obtained from the buffer.
  3. 如权利要求1或2所述的方法,其特征在于,所述历史帧信息包括所述至少一个历史帧的共振峰谱信息,所述未来帧信息包括所述至少一个未来帧的共振峰谱信息;The method according to claim 1 or 2, wherein the history frame information includes formant spectrum information of the at least one history frame, and the future frame information includes formant spectrum information of the at least one future frame ;
    所述根据所述历史帧信息以及所述未来帧信息,估计所述至少一个当前帧的帧信息包括:And estimating the frame information of the at least one current frame according to the historical frame information and the future frame information, including:
    根据所述至少一个历史帧的共振峰谱信息和所述至少一个未来帧的共振峰谱信息,确定所述至少一个当前帧的共振峰谱信息。And determining formant spectrum information of the at least one current frame according to formant spectrum information of the at least one historical frame and formant spectrum information of the at least one future frame.
  4. 如权利要求1至3中任一项所述的方法,其特征在于,所述历史帧信息包括所述至少一个历史帧的基音数值,所述未来帧信息包括所述至少一个未来帧的基音数值;The method according to any one of claims 1 to 3, wherein the history frame information includes a pitch value of the at least one history frame, the future frame information including a pitch value of the at least one future frame ;
    所述根据所述历史帧信息以及所述未来帧信息,估计所述至少一个当前帧的帧信息包括:And estimating the frame information of the at least one current frame according to the historical frame information and the future frame information, including:
    根据所述至少一个历史帧的基音数值和所述至少一个未来帧的基音数值,确定所述至少一个当前帧的基音数值。Determining a pitch value of the at least one current frame based on a pitch value of the at least one historical frame and a pitch value of the at least one future frame.
  5. 如权利要求1至4中任一项所述的方法,其特征在于,所述历史帧信息包括所述至少一个历史帧的能量,所述未来帧信息包括所述至少一个未来帧的能量;The method according to any one of claims 1 to 4, wherein the history frame information includes energy of the at least one history frame, and the future frame information includes energy of the at least one future frame;
    所述根据所述历史帧信息以及所述未来帧信息,估计所述至少一个当前帧的帧信息包括:And estimating the frame information of the at least one current frame according to the historical frame information and the future frame information, including:
    根据所述至少一个历史帧的能量和所述至少一个未来帧的能量,确定所述至少一个 当前帧的能量。Determining the at least one based on energy of the at least one historical frame and energy of the at least one future frame The energy of the current frame.
  6. 如权利要求1至5中任一项所述的方法,其特征在于,所述根据所述历史帧信息以及所述未来帧信息,估计所述至少一个当前帧的帧信息包括:The method according to any one of claims 1 to 5, wherein the estimating the frame information of the at least one current frame according to the historical frame information and the future frame information comprises:
    确定所述至少一个当前帧的帧类型,所述帧类型包括清音或浊音;Determining a frame type of the at least one current frame, the frame type including unvoiced or voiced;
    根据所述帧类型,确定所述至少一个当前帧的自适应码书增益和固定码书增益的至少一项。Determining at least one of an adaptive codebook gain and a fixed codebook gain of the at least one current frame based on the frame type.
  7. 如权利要求6所述的方法,其特征在于,所述确定所述至少一个当前帧的帧类型包括:The method of claim 6, wherein the determining a frame type of the at least one current frame comprises:
    确定所述至少一个当前帧的谱倾斜的大小;Determining a magnitude of a spectral tilt of the at least one current frame;
    根据所述至少一个当前帧的谱倾斜的大小,确定所述至少一个当前帧的帧类型。Determining a frame type of the at least one current frame according to a size of a spectral tilt of the at least one current frame.
  8. 如权利要求6所述的方法,其特征在于,所述确定所述至少一个当前帧的帧类型包括:The method of claim 6, wherein the determining a frame type of the at least one current frame comprises:
    获取所述至少一个当前帧中多个子帧的基音变化状态;Obtaining a pitch change state of the plurality of subframes in the at least one current frame;
    根据所述多个子帧的基音变化状态,确定所述至少一个当前帧的帧类型。Determining a frame type of the at least one current frame according to a pitch change state of the plurality of subframes.
  9. 如权利要求6-8任一项所述的方法,其特征在于,所述根据所述帧类型,确定所述至少一个当前帧的自适应码书增益和固定码书增益的至少一项包括:The method according to any one of claims 6 to 8, wherein the determining at least one of an adaptive codebook gain and a fixed codebook gain of the at least one current frame according to the frame type comprises:
    若所述帧类型为浊音,则根据一个历史帧的自适应码书增益和基音周期、以及所述至少一个当前帧的能量增益,确定所述至少一个当前帧的自适应码书增益,并将多个历史帧的固定码书增益的平均值作为所述至少一个当前帧的固定码书增益。If the frame type is voiced, determining an adaptive codebook gain of the at least one current frame according to an adaptive codebook gain and a pitch period of a historical frame, and an energy gain of the at least one current frame, and The average of the fixed codebook gains of the plurality of historical frames is used as the fixed codebook gain of the at least one current frame.
  10. 如权利要求6-9任一项所述的方法,其特征在于,所述根据所述帧类型,确定所述至少一个当前帧的自适应码书增益和固定码书增益的至少一项包括:The method according to any one of claims 6 to 9, wherein the determining at least one of an adaptive codebook gain and a fixed codebook gain of the at least one current frame according to the frame type comprises:
    若所述帧类型为清音,则根据一个历史帧的固定码书增益和基音周期、以及所述至少一个当前帧的能量增益,确定所述至少一个当前帧的固定码书增益,并将多个历史帧的自适应码书增益的平均值作为所述至少一个当前帧的自适应码书增益。If the frame type is unvoiced, determining a fixed codebook gain of the at least one current frame according to a fixed codebook gain and a pitch period of a historical frame, and an energy gain of the at least one current frame, and The average of the adaptive codebook gains of the historical frames is used as the adaptive codebook gain for the at least one current frame.
  11. 如权利要求9或10所述的方法,其特征在于,所述方法还包括:The method of claim 9 or 10, wherein the method further comprises:
    根据解码后的所述历史帧信息中的时域信号大小和所述历史帧中每个子帧的长度,确定所述至少一个当前帧的能量增益。And determining an energy gain of the at least one current frame according to the size of the time domain signal in the decoded historical frame information and the length of each subframe in the historical frame.
  12. 一种丢帧补偿装置,其特征在于,所述装置包括:A frame loss compensation device, characterized in that the device comprises:
    接收模块,用于接收语音码流序列;a receiving module, configured to receive a voice code stream sequence;
    获取模块,用于获取所述语音码流序列中的历史帧信息以及未来帧信息,其中,所述语音码流序列包括多个语音帧的帧信息,所述多个语音帧包括至少一个历史帧、至少一个 当前帧和至少一个未来帧,所述至少一个历史帧在时域上位于至少一个当前帧之前,所述至少一个未来帧在时域上位于至少一个当前帧之后,所述历史帧信息是所述至少一个历史帧的帧信息,所述未来帧信息是所述至少一个未来帧的帧信息;An acquiring module, configured to acquire historical frame information and future frame information in the sequence of the voice code stream, where the voice code stream sequence includes frame information of multiple voice frames, where the multiple voice frames include at least one historical frame ,at least one a current frame and at least one future frame, the at least one historical frame being located before the at least one current frame in the time domain, the at least one future frame being located after the at least one current frame in the time domain, the historical frame information being the Frame information of at least one history frame, the future frame information being frame information of the at least one future frame;
    处理模块,用于根据所述历史帧信息以及所述未来帧信息,估计所述至少一个当前帧的帧信息。And a processing module, configured to estimate frame information of the at least one current frame according to the historical frame information and the future frame information.
  13. 如权利要求12所述的装置,其特征在于,在缓冲区中存储所述所述语音码流序列;The apparatus according to claim 12, wherein said sequence of voice code streams is stored in a buffer;
    所述获取模块具体用于:The obtaining module is specifically configured to:
    对所述缓冲区中的所述语音码流序列的多个语音帧的帧信息进行解码以获得解码后的所述历史帧信息;Decoding frame information of the plurality of voice frames of the voice code stream sequence in the buffer to obtain the decoded history frame information;
    从所述缓冲区中获取未解码的所述未来帧信息。The undecoded future frame information is obtained from the buffer.
  14. 如权利要求12或13所述的装置,其特征在于,所述历史帧信息包括所述至少一个历史帧的共振峰谱信息,所述未来帧信息包括所述至少一个未来帧的共振峰谱信息;The apparatus according to claim 12 or 13, wherein said history frame information includes formant spectrum information of said at least one history frame, said future frame information including formant spectrum information of said at least one future frame ;
    所述处理模块具体用于:The processing module is specifically configured to:
    根据所述历史帧的共振峰谱信息和所述未来帧的共振峰谱信息,确定所述至少一个当前帧的共振峰谱信息。And determining formant spectrum information of the at least one current frame according to formant spectrum information of the historical frame and formant spectrum information of the future frame.
  15. 如权利要求12至14任一项所述的装置,其特征在于,所述历史帧信息包括所述至少一个历史帧的基音数值,所述未来帧信息包括所述至少一个未来帧的基音数值;The apparatus according to any one of claims 12 to 14, wherein the history frame information includes a pitch value of the at least one history frame, the future frame information including a pitch value of the at least one future frame;
    所述处理模块具体用于:The processing module is specifically configured to:
    根据所述至少一个历史帧的基音数值和所述至少一个未来帧的基音数值,确定所述至少一个当前帧的基音数值。Determining a pitch value of the at least one current frame based on a pitch value of the at least one historical frame and a pitch value of the at least one future frame.
  16. 如权利要求12至15任一项所述的装置,其特征在于,所述历史帧信息包括所述至少一个历史帧的能量,所述未来帧信息包括所述至少一个未来帧的能量;The apparatus according to any one of claims 12 to 15, wherein the history frame information includes energy of the at least one history frame, and the future frame information includes energy of the at least one future frame;
    所述处理模块具体用于:The processing module is specifically configured to:
    根据所述至少一个历史帧的能量和所述至少一个未来帧的能量,确定所述至少一个当前帧的能量。And determining an energy of the at least one current frame according to an energy of the at least one historical frame and an energy of the at least one future frame.
  17. 如权利要求12至16任一项所述的装置,其特征在于,所述处理模块具体用于:The device according to any one of claims 12 to 16, wherein the processing module is specifically configured to:
    确定所述至少一个当前帧的帧类型,所述帧类型包括清音或浊音;Determining a frame type of the at least one current frame, the frame type including unvoiced or voiced;
    根据所述帧类型,确定所述至少一个当前帧的自适应码书增益和固定码书增益的至少一项。Determining at least one of an adaptive codebook gain and a fixed codebook gain of the at least one current frame based on the frame type.
  18. 如权利要求17所述的装置,其特征在于,所述处理模块,还用于确定所述至少 一个当前帧的谱倾斜的大小;The device according to claim 17, wherein said processing module is further configured to determine said at least The size of the spectral tilt of a current frame;
    根据所述至少一个当前帧的谱倾斜的大小,确定所述至少一个当前帧的帧类型。Determining a frame type of the at least one current frame according to a size of a spectral tilt of the at least one current frame.
  19. 如权利要求17所述的装置,其特征在于,所述处理模块,还用于获取所述至少一个当前帧中多个子帧的基音变化状态;The apparatus according to claim 17, wherein the processing module is further configured to acquire a pitch change state of the plurality of subframes in the at least one current frame;
    根据所述多个子帧的基音变化状态,确定所述至少一个当前帧的帧类型。Determining a frame type of the at least one current frame according to a pitch change state of the plurality of subframes.
  20. 如权利要求17至19任一项所述的装置,其特征在于,所述处理模块具体用于:The device according to any one of claims 17 to 19, wherein the processing module is specifically configured to:
    若所述帧类型为浊音,则根据一个历史帧的自适应码书增益和基音周期、以及所述至少一个当前帧的能量增益,确定所述至少一个当前帧的自适应码书增益,并将多个历史帧的固定码书增益的平均值作为所述至少一个当前帧的固定码书增益。If the frame type is voiced, determining an adaptive codebook gain of the at least one current frame according to an adaptive codebook gain and a pitch period of a historical frame, and an energy gain of the at least one current frame, and The average of the fixed codebook gains of the plurality of historical frames is used as the fixed codebook gain of the at least one current frame.
  21. 如权利要求17至19任一项所述的装置,其特征在于,所述处理模块具体用于:The device according to any one of claims 17 to 19, wherein the processing module is specifically configured to:
    若所述帧类型为清音,则根据一个历史帧的固定码书增益和基音周期、以及所述至少一个当前帧的能量增益,确定所述至少一个当前帧的固定码书增益,并将多个历史帧的自适应码书增益的平均值作为所述至少一个当前帧的自适应码书增益。If the frame type is unvoiced, determining a fixed codebook gain of the at least one current frame according to a fixed codebook gain and a pitch period of a historical frame, and an energy gain of the at least one current frame, and The average of the adaptive codebook gains of the historical frames is used as the adaptive codebook gain for the at least one current frame.
  22. 如权利要求20或21所述的装置,其特征在于,所述处理模块,还用于根据解码后的所述历史帧信息中的时域信号大小和所述历史帧中每个子帧的长度,确定所述至少一个当前帧的能量增益。The device according to claim 20 or 21, wherein the processing module is further configured to: according to the size of the time domain signal in the decoded historical frame information and the length of each subframe in the historical frame, Determining an energy gain of the at least one current frame.
  23. 一种丢帧补偿设备,其特征在于,包括:存储器、通信总线以及声码器,所述存储器通过所述通信总线耦合至所述声码器;其中,所述存储器用于存储程序代码,所述声码器用于调用所述程序代码,执行以下操作:A frame loss compensation device, comprising: a memory, a communication bus, and a vocoder, the memory being coupled to the vocoder via the communication bus; wherein the memory is configured to store program code, The vocoder is used to call the program code and performs the following operations:
    接收语音码流序列;Receiving a sequence of voice code streams;
    获取所述语音码流序列中的历史帧信息以及未来帧信息,其中,所述语音码流序列包括多个语音帧的帧信息,所述多个语音帧包括至少一个历史帧、至少一个当前帧和至少一个未来帧,所述至少一个历史帧在时域上位于至少一个当前帧之前,所述至少一个未来帧在时域上位于至少一个当前帧之后,所述历史帧信息是所述至少一个历史帧的帧信息,所述未来帧信息是所述至少一个未来帧的帧信息;Acquiring historical frame information and future frame information in the sequence of voice code streams, wherein the sequence of voice code streams includes frame information of a plurality of voice frames, the plurality of voice frames including at least one history frame, at least one current frame And at least one future frame, the at least one historical frame being located before the at least one current frame in the time domain, the at least one future frame being located after the at least one current frame in the time domain, the historical frame information being the at least one Frame information of a history frame, the future frame information being frame information of the at least one future frame;
    根据所述历史帧信息以及所述未来帧信息,估计所述至少一个当前帧的帧信息。 And estimating frame information of the at least one current frame according to the historical frame information and the future frame information.
PCT/CN2017/090035 2017-06-26 2017-06-26 Frame loss compensation method and device WO2019000178A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2017/090035 WO2019000178A1 (en) 2017-06-26 2017-06-26 Frame loss compensation method and device
CN201780046044.XA CN109496333A (en) 2017-06-26 2017-06-26 A kind of frame losing compensation method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/090035 WO2019000178A1 (en) 2017-06-26 2017-06-26 Frame loss compensation method and device

Publications (1)

Publication Number Publication Date
WO2019000178A1 true WO2019000178A1 (en) 2019-01-03

Family

ID=64740767

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/090035 WO2019000178A1 (en) 2017-06-26 2017-06-26 Frame loss compensation method and device

Country Status (2)

Country Link
CN (1) CN109496333A (en)
WO (1) WO2019000178A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111836117B (en) * 2019-04-15 2022-08-09 深信服科技股份有限公司 Method and device for sending supplementary frame data and related components
CN111711992B (en) * 2020-06-23 2023-05-02 瓴盛科技有限公司 CS voice downlink jitter calibration method
CN112489665B (en) * 2020-11-11 2024-02-23 北京融讯科创技术有限公司 Voice processing method and device and electronic equipment
CN112634912B (en) * 2020-12-18 2024-04-09 北京猿力未来科技有限公司 Packet loss compensation method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004239930A (en) * 2003-02-03 2004-08-26 Iwatsu Electric Co Ltd Method and system for detecting pitch in packet loss compensation
CN101147190A (en) * 2005-01-31 2008-03-19 高通股份有限公司 Frame erasure concealment in voice communications
CN101894558A (en) * 2010-08-04 2010-11-24 华为技术有限公司 Lost frame recovering method and equipment as well as speech enhancing method, equipment and system
CN102449690A (en) * 2009-06-04 2012-05-09 高通股份有限公司 Systems and methods for reconstructing an erased speech frame
CN103714820A (en) * 2013-12-27 2014-04-09 广州华多网络科技有限公司 Packet loss hiding method and device of parameter domain
CN106251875A (en) * 2016-08-12 2016-12-21 广州市百果园网络科技有限公司 The method of a kind of frame losing compensation and terminal

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2388439A1 (en) * 2002-05-31 2003-11-30 Voiceage Corporation A method and device for efficient frame erasure concealment in linear predictive based speech codecs
KR100542435B1 (en) * 2003-09-01 2006-01-11 한국전자통신연구원 Method and apparatus for frame loss concealment for packet network
US8255207B2 (en) * 2005-12-28 2012-08-28 Voiceage Corporation Method and device for efficient frame erasure concealment in speech codecs
CN101009098B (en) * 2007-01-26 2011-01-26 清华大学 Sound coder gain parameter division-mode anti-channel error code method
KR100998396B1 (en) * 2008-03-20 2010-12-03 광주과학기술원 Method And Apparatus for Concealing Packet Loss, And Apparatus for Transmitting and Receiving Speech Signal
CN101630242B (en) * 2009-07-28 2011-01-12 苏州国芯科技有限公司 Contribution module for rapidly computing self-adaptive code book by G723.1 coder
CN103325375B (en) * 2013-06-05 2016-05-04 上海交通大学 One extremely low code check encoding and decoding speech equipment and decoding method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004239930A (en) * 2003-02-03 2004-08-26 Iwatsu Electric Co Ltd Method and system for detecting pitch in packet loss compensation
CN101147190A (en) * 2005-01-31 2008-03-19 高通股份有限公司 Frame erasure concealment in voice communications
CN102449690A (en) * 2009-06-04 2012-05-09 高通股份有限公司 Systems and methods for reconstructing an erased speech frame
CN101894558A (en) * 2010-08-04 2010-11-24 华为技术有限公司 Lost frame recovering method and equipment as well as speech enhancing method, equipment and system
CN103714820A (en) * 2013-12-27 2014-04-09 广州华多网络科技有限公司 Packet loss hiding method and device of parameter domain
CN106251875A (en) * 2016-08-12 2016-12-21 广州市百果园网络科技有限公司 The method of a kind of frame losing compensation and terminal

Also Published As

Publication number Publication date
CN109496333A (en) 2019-03-19

Similar Documents

Publication Publication Date Title
US9047863B2 (en) Systems, methods, apparatus, and computer-readable media for criticality threshold control
JP5232151B2 (en) Packet-based echo cancellation and suppression
KR100581413B1 (en) Improved spectral parameter substitution for the frame error concealment in a speech decoder
JP5571235B2 (en) Signal coding using pitch adjusted coding and non-pitch adjusted coding
US8352252B2 (en) Systems and methods for preventing the loss of information within a speech frame
US7778824B2 (en) Device and method for frame lost concealment
WO2019000178A1 (en) Frame loss compensation method and device
EP2140637B1 (en) Method of transmitting data in a communication system
US8401865B2 (en) Flexible parameter update in audio/speech coded signals
CN107248411B (en) Lost frame compensation processing method and device
US8996389B2 (en) Artifact reduction in time compression
CN112489665A (en) Voice processing method and device and electronic equipment
JP2023166423A (en) Spectral shape estimation from mdct coefficients
JP6264673B2 (en) Method and decoder for processing lost frames
US20040138878A1 (en) Method for estimating a codec parameter
JP6759927B2 (en) Utterance evaluation device, utterance evaluation method, and utterance evaluation program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17915498

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17915498

Country of ref document: EP

Kind code of ref document: A1