CN118230741A - Low-rate voice encoding and decoding method based on sine harmonic model - Google Patents

Low-rate voice encoding and decoding method based on sine harmonic model Download PDF

Info

Publication number
CN118230741A
CN118230741A CN202410397584.4A CN202410397584A CN118230741A CN 118230741 A CN118230741 A CN 118230741A CN 202410397584 A CN202410397584 A CN 202410397584A CN 118230741 A CN118230741 A CN 118230741A
Authority
CN
China
Prior art keywords
speech
signal
harmonic
frame
phase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410397584.4A
Other languages
Chinese (zh)
Inventor
郑展恒
李华健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Publication of CN118230741A publication Critical patent/CN118230741A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a low-rate voice encoding and decoding method based on a sine harmonic model. The method is based on a voice parameter coding technology, and is improved in voice parameter extraction, especially in clear/voiced sound judgment and pitch parameter extraction, so that the synthesized voice has higher accuracy and stronger robustness; on the synthesis end, reducing the parameter quantity changing with time to the parameter of fixed quantity required by fixed bit rate, and meeting the application situation of low-rate voice encoding and decoding; in order to maintain good performance under different speaker and background noise conditions, post-filtering algorithms and parameter correction measures are added to suppress background noise. The method can maintain good voice quality while having high compression efficiency, lower complexity and low time delay, and has certain robustness.

Description

Low-rate voice encoding and decoding method based on sine harmonic model
Technical Field
The invention relates to the technical field of digital voice signal compression, in particular to a low-rate voice encoding and decoding method based on a sine harmonic model.
Background
The purpose of low-rate speech compression is to reduce the coding bit rate of speech signals as much as possible to meet different application requirements, while guaranteeing speech quality. The low-rate speech compression technique has been widely used in the fields of mobile communication, satellite communication, IP telephone, multimedia, speech synthesis, etc.
However, current techniques for low rate speech compression have the following limitations: 1) The low-rate voice compression technology reduces the data volume of the voice signal and simultaneously causes the distortion of the voice signal, so that the voice quality is reduced and the voice intelligibility is reduced. 2) In order to improve the compression efficiency of the speech signal, the low-rate speech compression technique often needs to use more complex algorithms, such as parametric coding based on a speech generation model, hybrid coding, code-excited linear prediction coding, and the like. The complexity and delay of these algorithms are higher than waveform coding, which presents difficulties and challenges for real-time processing and transmission of speech. 3) Low-rate speech compression techniques often require extraction and modeling of the parametric features of the speech signal. Due to the diversity and complexity of speech signals, such as different languages, dialects, accents, gender, age, emotion, etc. This introduces complexity and inaccuracy to the parametric feature extraction of speech.
Disclosure of Invention
The invention aims at overcoming the defects of the prior art and provides a low-rate voice encoding and decoding method based on a sine harmonic model. The method can keep better voice quality while the programming rate is low, realizes low complexity and low time delay, and has certain robustness.
The technical scheme for realizing the aim of the invention is as follows:
a low-rate voice encoding and decoding method based on a sine harmonic model comprises the following steps:
1) The speech signal is first frequency domain converted. Performing Short Time Fourier Transform (STFT) on the speech block signal:
Where k is the frequency bin position and l is the frame index position. w (N) is a window function, N is a frame length, and M is a frame shift.
2) Model parameters are extracted.
2-1) Extracting the pitch (fundamental frequency) of the speech frame. Pitch parameter extraction is divided into three stages. A preliminary pitch extraction stage determines a set of candidate values from frames of the input speech. Firstly, square operation is performed on the speech frame in the time domain to obtain a square time domain signal s 2 (n)
Thereafter, a notch filter is used to remove the dc component in s 2 (n). And then low-pass filtering is carried out, and finally the signal is extracted with the extraction factor of 5. The decimated signal is windowed and the spectrum Z (k) of the 512-point DFT is calculated by zero-filling the decimated signal with 448 zeros. Then, the power spectrum U (k) = |z (k) | 2 of the square signal is obtained, and the local maximum value is used as a pitch candidate value ω v.
A post-processing stage makes a decision by evaluating the cost function E (ω v) at each candidate fundamental frequency ω v, selecting the candidate of E (ω v) of the smallest value as the pitch estimate for the current frame.
Wherein E mv) is the error between the original speech and the speech synthesized using the harmonic model.
And in the pitch estimation refinement stage, omega 0 is further refined, and the estimation precision is increased. This stage proposes a low complexity pitch refinement step. Defining a cost function:
Wherein S w is the frame frequency.
The function simply samples the power spectrum of the windowed speech signal for that frame, and the argument of S w is rounded to the nearest integer. In this step, the function is sampled in two steps. First, in 1 sampling step, the function is sampled over a range of plus or minus 5 sampling points of the pitch period. The function is then sampled in a range of plus or minus one sample of the pitch period with a sampling step size of 0.25. Taking the smallest sampling value as the pitch of the optimal estimation
2-2) Clear and voiced decision. The method uses a signal-to-noise ratio (SNR) based decision to determine the unvoiced/voiced sound and checks for optimization after the decision. Fitting the frame by using model parameters estimated by a harmonic model, wherein the fitting result is mapped on SNR, and the expression of the SNR in a frequency domain is as follows:
By comparison with a fixed threshold, which takes 6dB, i.e., 10lgSNR > 6, the energy in the band is determined to be voiced and vice versa unvoiced. The method applies the ratio of low frequency to high frequency energy for checking whether a previously determined unvoiced decision is correct or not, and if so, correcting.
2-3) Linear prediction coefficient extraction. Direct quantization of the harmonic amplitude requires a large number of bits. Using 10 th order linear prediction, the time domain analysis produces a set of p LPC coefficients { a k } and an LPC gain function G for generating harmonic amplitudes at the synthesis end, these parameters being found by:
Wherein, S (n) is a time domain speech signal.
The { a k } algorithm was solved by the Levinson-Durbin algorithm.
The energy E of the predicted signal may be determined by a linear prediction coefficient a k and an autocorrelation function R (k):
The linear prediction coefficients a k are then converted into a Line Spectrum Pair (LSP) form. LSP is obtained by solving the conjugate complex root of the p+1 order symmetric and antisymmetric polynomials.
3) The parameters are quantized encoded. A uniform scalar quantization is used for the voiced and unvoiced states, the pitch omega 0, the energy E of the individual harmonics. The required vector quantizer is trained using a K-means vector quantizer design algorithm, after which the LSP parameters are quantized and encoded. It quantizes multiple values at a time, only one index is needed to reference the values, and correlations in the input data can be taken into account. The present invention also assigns more bits to the low order LSPs because the decoded speech quality is more sensitive to low order LSP errors. Coarser quantization is performed for higher LSPs.
4) And synthesizing voice. The previously quantized encoded parameters need to be decoded before synthesizing speech.
4-1) Determining the sinusoidal amplitude { B m } for the synthesized speech signal. The decoded LSP parameters are applied to the line spectrum pairs by a bandwidth extension algorithm. The above parameters are then interpolated. The interpolated line spectrum pair LSPs are converted into linear prediction coefficients. Fourier transform of linear predictive coefficients can be converted into spectral amplitude samplesAnd then determine { B m } from the average energy E of each band. The method comprises the following specific steps: first, according to the frequency spectrum/>Find the power spectrum/>
The sum of the energies of the mth harmonic of the sinusoids that synthesize the signal can be represented as E m, then B m:
4-2) synthesizing the phase using a unvoiced/voiced state based and a rule based approach. The pulse train in the time domain is equivalent to harmonics in the frequency domain, the phase of each of which can be modeled as the phase of an LPC filter excited by the pulse. Since the phase of unvoiced sound tends to be random, the following synthetic phase is performed in a state where the frame is voiced. Since the method does not transmit the pulse position of this model, it needs to be synthesized. The excitation pulse occurs at a rate of omega 0, the phase of the harmonic will advance 80 over one composite frame (80 samples). Then let the excitation phase that generates the first harmonic be:
arg(Ex[1])=ω0*80 (12),
Where E x [1] is the first complex excitation (frequency domain) sample and arg (z) is the phase of the complex sample z.
Then, the phase of the mth excitation harmonic is correlated with the phase of the first harmonic as:
arg(Ex[m])=m*arg(Ex[1]) 1<m≤L (13),
And determining the final harmonic phase by passing the obtained E x m through an LPC synthesis filter.
To improve errors in the presence of background noise, the phase parameters are modified prior to synthesizing the speech. When the average energy e is below the threshold and the frame is unvoiced, the background noise estimate is updated
Wherein,Beta is background noise and the initial value is zero.
When the frame is voiced and the harmonic amplitude a m is less than the threshold tau,The phase is disturbed such that any harmonics are smaller than the background estimate.
4-3) Constructing a frequency domain synthesized signal by using sinusoidal model parameters of the current frame (i.e., the model parameters generated above)
Wherein,This signal represents the DFT of the synthesized speech signal and consists of pulses spaced by ω 0, ω 0 being weighted by the complex harmonic amplitude a m.
4-4) Further noise reduction using OM-LSA post-filters. Noise-reduced synthesized speech spectrum estimateFrom the following componentsObtained by a spectral gain filter G (k, l):
Wherein, Is the signal after noise reduction,/>For a frequency domain synthesized speech signal, G (k, l) is a spectral gain filter. The spectral gain calculation function formula is as follows:
Where ζ (k, l) is the a priori signal to noise ratio, p (k, l) is the probability of speech presence, and q (k, l) is the probability of speech absence. G H1 is the conditional gain in the presence of speech, G min is the lower gain limit of the filter in the absence of speech.
4-5) Overlap-add and inverse fourier transform. In order to reconstruct a continuous synthesized speech waveform, it is necessary to smoothly connect adjacent noise-reduced synthesized speech spectra. This is performed by windowing each frame and then shifting and superimposing adjacent frames using an overlap-add algorithm. A triangular window is used for this algorithm and is defined by:
And finally, performing inverse Fourier transform to recover the time domain signal.
The invention has the advantages and beneficial effects that:
1. The invention has stronger robustness in parameter estimation, especially in pitch estimation and clear and voiced sound judgment estimation by improving the pitch extraction algorithm.
2. The present invention reduces the number of parameters over time to a fixed number of parameters required for a fixed bit rate.
3. Under different speaker and background noise conditions, the speech can be correctly encoded, and meanwhile, better speech quality is maintained.
Drawings
FIG. 1 is a schematic flow chart of a speech encoding and decoding method according to an embodiment;
FIG. 2 is a flow chart of an improved pitch estimation algorithm according to an embodiment;
Fig. 3 is a comparison chart of a spectrum before voice compression and a spectrum after compression by the method of the present invention, and it can be seen that the compressed spectrum can be well restored to the original spectrum, which illustrates that the method of the present invention ensures the intelligibility of the original voice and the voice quality after compression.
Detailed Description
The present invention will now be further described with reference to the accompanying drawings and examples, which are not intended to limit the scope of the invention.
Referring to fig. 1, a low-rate speech coding and decoding method based on a sinusoidal harmonic model includes the steps of:
1) The speech signal is first frequency domain converted. First, a speech signal is picked up by a microphone, and the sampling rate is set to 8kHz. To enable real-time encoding and decoding, the speech block signals are acquired by a microphone and processed, and the length of the speech block is 320 sampling points (40 ms). The following model parameter extraction and estimation are performed on the speech block.
Then, STFT of N dft = 512 points is carried out on the voice block signal to obtain a frequency domain voice signal
Where k is the position of the frequency bin and l is the position of the frame index. The above w (N) takes a hamming window, the frame length N of the discrete fourier transform takes 80, and the frame shift M takes 80.
2) Model parameter extraction and estimation. From knowledge about the processing of the speech signal, the voiced speech signal can be constructed as a sinusoidal harmonic model:
Where { B m},{θm } represents the amplitude, phase of the sinusoid. Fundamental frequency omega 0=2πF0/Fs, harmonic quantity F 0,Fs represents the pitch frequency and the sampling frequency, respectively.
This model presents problems for unvoiced sounds, but by appropriate choice of harmonic phases, sinusoidal harmonic models can also be used to represent unvoiced sounds more appropriately.
To synthesize speech using a sinusoidal harmonic model, the decoder side needs to generate L sinusoids with an estimated amplitude { B m }, a fundamental frequency ω 0, and a phase { θ m }. To reduce the bit rate, the method is to discard the phase of each sine wave and then regenerate it at the decoder using a method based on zero phase model rules. Then the parameters that need to be extracted at the encoding end are the linear prediction coefficient a k, the predicted signal energy, the pitch ω 0 and the sum of the clean/dirty state bits that can generate the amplitude { B m } information.
The pitch omega 0 is first extracted. The pitch omega 0 is one of the most difficult problems in speech analysis and is also the most important parameter in the method. The method proposes an improved pitch estimation algorithm. The algorithm is divided into three steps, namely a preliminary pitch extraction stage, a post-processing stage and a pitch estimation refinement stage.
In the preliminary pitch extraction stage, a set of candidate values is determined from frames of the input speech using a process based on square law non-linear characteristics. The spectrum of square-law nonlinear effects can be obtained by the following equation:
Where X (k) is the N-point DFT of X (N), k is the position of the frequency point, and l is the position of the frame index.
Equation (4) represents that the product of two time-domain signals corresponds to the (circular) convolution of the discrete fourier transforms of the two signals. Thus, multiplication (squaring) of the time domain signal itself will cause autocorrelation of the DFT of the signal in the frequency domain. The amplitude of Z (k) will have peaks at multiples corresponding to the fundamental frequencies F 0 and F 0 and therefore is simple and effective to use as a pitch detector. A set of candidate pitch estimates may be determined by peak picking Discrete Fourier Transform (DFT).
The specific estimation steps are as follows: first, square operation is performed on a speech block in the time domain to obtain a squared time domain signal s 2 (n):
When m-l=1, the second term in square brackets will introduce a large number of components at the fundamental frequency. When m=l, there will also be a large Direct Current (DC) term. A notch filter is used here to remove the dc component. This prevents the large amplitude direct current term from interfering with the slightly smaller amplitude term at the fundamental frequency. Particularly for male speech signals, as they may have a low frequency component close to direct current. The direct current notch filter is applied to the time domain square signal, and the transfer function is as follows:
the signal is then low pass filtered because all the energy in the square signal above 400Hz is superfluous and reduces the resolution of the frequency domain peak pickup. The FIR low-pass filter used here has an order N of 48 and a cut-off frequency of 600 Hz:
The signal is then decimated by a decimation factor of 5. The decimated signal is windowed and a 512-point DFT is calculated by zero padding the decimated signal with 448 zeros, yielding Z (k). And obtaining a power spectrum U (k) = |Z (k) | 2.
Typically, the global maximum of the DFT power spectrum U (k) of the square signal corresponds to the pitch frequency F 0. Occasionally, however, the global maximum corresponds to a spurious peak or multiple of F 0. Therefore, it is not appropriate to simply select the global maximum of U (k) as the basic estimate of the frame. The local maxima of U (k) are passed as candidates to the post-processing stage for further analysis.
In the post-processing stage, the post-processing algorithm is responsible for evaluating each candidate local maximum determined by the pitch extractor and selecting the optimal one as the pitch estimate for the frame. The decision is made by evaluating the cost function E (ω v) at each candidate fundamental frequency ω v. The cost function E (ω v) is analyzed by synthesis in the frequency domain. For each possible fundamental frequency, harmonic model parameters are estimated and used for full voiced estimation of synthesized frequency domain speechThis estimate is then compared to the original frequency domain speech in the sense of MSE (minimum mean square error) to determine the best pitch estimate. The cost function E (ω v) is defined as:
Wherein the method comprises the steps of Is the frequency domain synthesized speech of the mth frequency band. /(I)Is the original frequency domain speech. /(I)An amplitude estimate is made for each harmonic. G (k) is a frequency weighting function, with the aim of reducing the computational complexity and excluding all harmonics above 1000 Hz. A m,bm in the formula (13) is:
The E (ω v) is then sampled over the entire pitch range, the smallest candidate for E (ω v) is selected as the pitch estimate ω 0.E(ωv for the current frame, and samples are taken in steps of 2.5Hz between plus/minus 10Hz for each candidate frequency.
Finally, in the pitch estimation refinement stage, omega 0 is further refined, and the estimation precision is increased. It is common practice to add a pitch tracking algorithm: dynamic programming methods are used to evaluate the basic pitch estimation results for several frames in the past and in the future. However, this clearly increases the amount of computation and delay, so the present invention improves this, reduces the amount of computation and increases the accuracy of the estimation. This stage is computationally very simple. Defining a cost function:
Wherein S w is the frame frequency.
The function samples the power spectrum of the windowed speech signal for that frame, rounding the argument of S w to the nearest integer significantly favors longer pitch periods because it converges to the energy in the frequency spectrum. Moreover, sampling the function over a small frequency range, the deviation is not significant. Most importantly, the function does exhibit a local maximum near the base plane of the frame. The function is then sampled. First, in 1 sampling step, the function is sampled over a range of plus or minus 5 sampling points of the pitch period. The function is then sampled in a range of plus or minus one sample of the pitch period with a sampling step size of 0.25. Taking the smallest sampling value as the optimal estimated pitch of the frame
For fully voiced speech, the model parameters for all voiced estimates will closely match the original speech. Completely unvoiced frames are more prone to random white noise. Therefore, it is necessary to make a voiced/unvoiced state determination on the speech signal so that the speech is correctly synthesized at the synthesizing end. The method uses a signal-to-noise ratio (SNR) based decision on the presence of unvoiced/voiced sounds and then performs a check optimization.
The method comprises the following specific steps: fitting the frame by using model parameters estimated by a harmonic model, wherein the fitting result is mapped on SNR, and the expression of the SNR in a frequency domain is as follows:
Wherein of the formula Given by formula (8), and formula G (k) required for formula (8) takes G (k) =1.
As previously described, if the non-periodic (unvoiced) regions of the spectrum where the speech of the analysis frame is fully voiced do not contain harmonics of the pitch, the harmonic model will collapse in these regions. Therefore, for fully voiced speech, the signal-to-noise ratio will be large because all model parameters of the voiced estimate will closely match the original speech. A completely silent frame will result in a very low SNR.
By comparison with a fixed threshold, where the threshold takes 6dB, i.e., 10lgSNR > 6, the energy in the band is determined to be voiced and vice versa unvoiced.
The method then applies the ratio of low frequency to high frequency energy to examine and correct the previous decisions:
Wherein, The amplitude estimate for each harmonic of the frame is given by equation (12). l low=L×2000/(Fs/2),lhigh=L×2000/(Fs/2). L is the number of harmonics.
Because voiced speech tends to be dominated by low frequency energy, while unvoiced speech tends to be dominated by high frequency energy. This step is used to check if the previous binary unvoiced decision is correct. When 10lgE ratio > 10.0 and the frame was previously determined to be unvoiced, re-determining to be voiced; when 10lgE ratio < -10.0 and the frame is voiced, it is determined to be voiced.
Direct quantization of the harmonic amplitude B m requires a large number of bits, which is not consistent with the low-rate speech codec of this example. The spectral amplitude parameters are effectively modeled by using intermediate Linear Prediction (LPC), and the model parameters can be restored to a predicted spectrum after being sent to a synthesis end, so that harmonic amplitude { B m } is generated according to the spectrum.
The method used here is to use a time domain analysis to obtain the LPC model, which generates a set of p LPC coefficients { a k } and an LPC gain function G, as well as the predicted signal energy E. The procedure for determining these parameters is as follows:
the excitation source X (z) drives the linear predictive synthesis filter H (z) to generate a synthesized speech (Z-transform of time domain speech S) such that:
Where { a k } is a set of p linear prediction coefficients that characterize the frequency response of the filter, and G is a scalar gain factor. The number of poles in the all-pole filter is equal to the LPC order p.
The most suitable { a k } and G need to be calculated so thatAs close as possible to the original speech signal S (z). This can be achieved in the following manner. In an ideal case, S (z) and/>, are desiredIdentical, therefore S (z) is used instead of/>And re-arranging formula (18) to obtain:
performing z inverse transformation:
if H (z) is a good approximation of S (z), then the energy in signal x (n) will be minimized, where the total energy is given by:
Therefore, { a k } can be found by equating equation (22) to 0 and finding the partial derivative of a k for i=1, 2, …, p. From this process, p equations for p unknowns are derived:
this is expressed by an autocorrelation value R (k):
Wherein,
Then a k can be solved by the Levinson-Durbin algorithm.
Then further, the energy E of the signal is predicted:
Since the linear prediction coefficient a k is directly encoded, the stability of the linear prediction synthesis filter H (z) is not ensured. Thus, after deriving the LPC model, the linear prediction coefficients a k need to be converted into a Line Spectrum Pair (LSP) form equivalent to linear prediction for efficient quantization and transmission. LSP can be obtained by solving the conjugate complex root of the p+1 order symmetric and antisymmetric polynomials. The process is as follows:
set up linear prediction inverse filter The p+1 order symmetric and antisymmetric polynomials are:
p (z) and Q (z) are real coefficient polynomials of symmetry and antisymmetry, respectively, which have conjugate complex roots.
Where cos ω k and cos θ k (k=1, 2,..p/2) are representations of LSP coefficients in the cosine domain.
The LSP coefficients are values of cos ω k and cos θ k solved such that equation (27) is equal to zero. The solution here can be found using Chebyshev (Chebyshev) polynomials.
3) And (5) parameter quantization coding. The example uses Vector Quantization (VQ) for the quantization encoding of the parameters. VQ is very efficient because they can quantize multiple values simultaneously and can take into account correlation in the input data. VQ quantization noise can be quite large because the table entries may not match all values in the input vector exactly. In order to fully exploit the efficiency of vector quantization and reduce quantization noise it brings as much as possible, this example uses a K-means vector quantizer design algorithm for simple splitting of the VQ for LSPs. The K-means algorithm flow is as follows:
① Initializing: selecting a proper method to set K initial codebook centers z i, wherein i is more than or equal to 1 and less than or equal to K;
② Nearest neighbor classification: the training data vector X t is allocated to the nearest codebook z i according to the nearest neighbor principle, and the euclidean distance, mahalanobis distance, etc. can be used in this step.
③ Codebook updating: after all training data are assigned to the codebook nearest to it, a new centroid, i.e. a new codebook, is generated.
④ ②、③ Are repeated until adjacent iteration errors meet the threshold requirement.
We assign more bits to the low-order LSPs because the decoded speech quality is more sensitive to low-order LSP errors. LSPs tend to be closer together due to the higher energy of the low frequencies, which makes them more sensitive to quantization. Coarser quantization is performed for higher LSPs. In the method proposed in this example, LSPs represent all spectrum information, and 37 bits/frame coding is possible. Finally, a uniform scalar quantization is used for the pitch frequency ω 0 and the predicted signal energy E parameter.
4) The synthesizing end synthesizes the voice by using the sine harmonic model parameters. The previously encoded parameters need to be decoded before synthesizing the speech. LSPs are of paramount importance in these decoded parameters, since closely spaced LSPs represent peaks in the speech spectrum. The ear is very sensitive to these peaks and so very closely spaced LSPs must be carefully considered. The example here applies bandwidth extension on these line spectrum pairs, which is used as the minimum LSP split, preventing any two LSPs from being too close after previous quantization, since LSP quantization errors < 12.5Hz (25 Hz step) are inaudible.
The line spectrum versus LSP coefficients q k (k=1, 2,..p-1) are then converted back to linear prediction coefficients a k, k=1, 2,..p, which proceeds as follows: p' (k) was first calculated using q k (k=1, 2,..p-1) in the following recurrence relation
Wherein q 2k-1=cosω2k-1.
Q' (k) can be obtained by replacing q 2k-1 in the above recurrence relation with q 2k.
By obtaining Q '(k) and P' (k), P '(z) and Q' (z) can be obtained, P '(z) is multiplied by (1+z -1) to obtain P (z), and Q' (z) is multiplied by (1+z -1) to obtain Q (z), namely
Finally obtaining the prediction coefficient as
The linear prediction coefficients construct an all-pole synthesis filter. The transfer function of this filter is:
fourier transforming the transfer function of the filter to convert it into spectral amplitude samples
To reconstruct the speech signal, an appropriate amplitude { B m } needs to be determined for the sinusoid. { B m } is determined from the average energy E of each band. The method comprises the following specific steps: first, according to the frequency spectrumFind the power spectrum/>
Where E is the LPC predicted signal energy.
The energy of the mth harmonic of the sinusoid that synthesizes the signal can be expressed as E m:
/>
then B m can obtain:
And synthesizing the phase. The phase is synthesized based on the unvoiced/voiced state and the rule-based method as follows. Since the pulse train in the time domain is equivalent to harmonics in the frequency domain, the phase of each harmonic can be modeled as the phase of the LPC filter excited by the pulse. Since the phase of unvoiced sound tends to be random, the following synthetic phase is performed in a state where the frame is voiced.
Since the method does not transmit the pulse position of this model, it needs to be synthesized. The excitation pulse occurs at a rate of omega 0, the phase of the harmonic will advance 80 over one composite frame (80 samples). For example, if ω 0 is pi/20, then over a 10ms frame the phase of the first harmonic will advance (pi/20) 80=4 pi, i.e., two complete cycles. The following we generate the excitation phase of the first harmonic:
arg(Ex[1])=ω0*80 (34),
where E x [1] is the first complex excitation (frequency domain) sample and arg (z) is the phase of the complex z.
Then, the phase of the mth excitation harmonic is correlated with the phase of the first harmonic as:
arg(Ex[m])=m*arg(Ex[1]) 1<m≤L (35),
From the above, E x m can be obtained, and the final harmonic phase can be determined by passing E x m through an LPC synthesis filter.
Then, the phase parameter correction is carried out, and the average energy e of the whole frequency spectrum is calculated and determined:
comparing the average energy e with a background noise threshold (threshold set to 40), if the threshold is lower and the frame is unvoiced, updating the background noise estimate
Wherein β is background noise and the initial value is zero.
When the frame is voiced and the harmonic amplitude a m is less than the threshold tau,The phase is disturbed such that any harmonics are smaller than the background estimate. The phase after the disturbed phase is represented by the following formula:
θm=(2π/32767)*ξ (38),
where ζ is a randomly generated random number of 0 to 2.
Then, a frequency domain synthesized signal is constructed by using sinusoidal model parameters of the current frame (i.e., the model parameters generated above)The OM-LSA post-filter is used to further reduce noise. The composite signal from the adjacent frames is then interpolated using an overlap-add algorithm. Finally, the frequency domain information is converted to the time domain using an inverse FFT (ift). To generate a continuous time domain waveform, IDFT from adjacent frames is smoothly interpolated using a weighted overlap-add algorithm. The synthesis steps are as follows:
constructing synthesized speech using the above sinusoidal model parameters
Wherein,This signal represents a frequency domain synthesized speech signal and consists of pulses at intervals ω 0, ω 0 weighted by the complex harmonic amplitude a m.
The synthesized voice is subjected to noise reduction treatment through OMLSA post-filtering. Noise-reduced synthesized speech spectrum estimateBy/>Obtained by a spectral gain filter G (k, l):
the spectral gain calculation function formula is as follows:
Where ζ (k, l) is a priori signal-to-noise ratio, γ (k, l) is a posterior signal-to-noise ratio, p (k, l) is a probability of speech presence, and q (k, l) is a probability of speech absence. Beta is a weight factor used to control the balance between noise reduction and speech distortion, and beta is 0.92 in this example. G H1 is the conditional gain in the presence of speech, G min is the lower gain limit of the filter in the absence of speech, lambda d (k, l) is the noise power spectrum estimate, For the time-varying smoothing parameter, α d is 0.85 in this example.
In order to reconstruct a continuous synthesized speech waveform, it is necessary to smoothly connect adjacent noise-reduced synthesized speech spectra. This is performed by windowing each frame and then shifting and superimposing adjacent frames using an overlap-add algorithm. A triangular window is used for this algorithm and is defined by:
And finally, performing inverse Fourier transform to recover the time domain signal.

Claims (1)

1. A low-rate voice encoding and decoding method based on a sine harmonic model is characterized by comprising the following steps:
1) The speech signal is first frequency domain converted. Performing Short Time Fourier Transform (STFT) on the speech block signal:
where k is the position of the frequency bin and l is the position of the frame index. w (N) is a window function, N is a frame length, and M is a frame shift.
2) Model parameters are extracted.
2-1) Extracting the pitch (fundamental frequency) of the speech frame. Pitch parameter extraction is divided into three stages. A preliminary pitch extraction stage determines a set of candidate values from frames of the input speech. Firstly, square operation is performed on the speech frame in the time domain to obtain a square time domain signal s 2 (n)
Thereafter, a notch filter is used to remove the dc component in s 2 (n). And then low-pass filtering is carried out, and finally the signal is extracted with the extraction factor of 5. The decimated signal is windowed and the spectrum Z (k) of the 512-point DFT is calculated by zero-filling the decimated signal with 448 zeros. Then, the power spectrum U (k) = |z (k) | 2 of the square signal is obtained, and the local maximum value is used as a pitch candidate value ω v.
A post-processing stage makes a decision by evaluating the cost function E (ω v) at each candidate fundamental frequency ω v, selecting the candidate of E (ω v) of the smallest value as the pitch estimate for the current frame.
Wherein E mv) is the error between the original speech and the speech synthesized using the harmonic model.
And in the pitch estimation refinement stage, omega 0 is further refined, and the estimation precision is increased. This stage proposes a low complexity pitch refinement step. Defining a cost function:
Wherein S w is the frame frequency.
The function simply samples the power spectrum of the windowed speech signal for that frame, and the argument of S w is rounded to the nearest integer. In this step, the function is sampled in two steps. First, in 1 sampling step, the function is sampled over a range of plus or minus 5 sampling points of the pitch period. The function is then sampled in a range of plus or minus one sample of the pitch period with a sampling step size of 0.25. Taking the smallest sampling value as the pitch of the optimal estimation
2-2) Clear and voiced decision. The method uses a signal-to-noise ratio (SNR) based decision to determine the unvoiced/voiced sound and checks for optimization after the decision. Fitting the frame by using model parameters estimated by a harmonic model, wherein the fitting result is mapped on SNR, and the expression of the SNR in a frequency domain is as follows:
By comparison with a fixed threshold, which takes 6dB, i.e., 10lgSNR > 6, the energy in the band is determined to be voiced and vice versa unvoiced. The method applies the ratio of low frequency to high frequency energy for checking whether a previously determined unvoiced decision is correct or not, and if so, correcting.
2-3) Linear prediction coefficient extraction. Direct quantization of the harmonic amplitude requires a large number of bits. The method uses 10-order linear prediction, and the time domain analysis generates a group of p LPC coefficients { a k } and an LPC gain function G, which are used for generating harmonic amplitudes at the synthesis end, and the parameters are obtained by the following steps:
Wherein, S (n) is a time domain speech signal.
The { a k } algorithm was solved by the Levinson-Durbin algorithm.
The energy E of the predicted signal may be determined by a linear prediction coefficient a k and an autocorrelation function R (k):
The linear prediction coefficients a k are then converted into a Line Spectrum Pair (LSP) form. LSP is obtained by solving the conjugate complex root of the p+1 order symmetric and antisymmetric polynomials.
3) The parameters are quantized encoded. A uniform scalar quantization is used for the unvoiced state, pitch omega 0, and predicted signal energy E. The required vector quantizer is trained using a K-means vector quantizer design algorithm, after which the LSP parameters are quantized and encoded. It quantizes multiple values at a time, only one index is needed to reference the values, and correlations in the input data can be taken into account. The method also assigns more bits to the low order LSPs because the decoded speech quality is more sensitive to low order LSP errors. Coarser quantization is performed for higher LSPs.
4) And synthesizing voice. The previously quantized encoded parameters need to be decoded before synthesizing speech.
4-1) Determining the sinusoidal amplitude { B m } for the synthesized speech signal. The decoded LSP parameters are applied to the line spectrum pairs by a bandwidth extension algorithm. The above parameters are then interpolated. The interpolated line spectrum pair LSPs are converted into linear prediction coefficients. Fourier transform of linear predictive coefficients can be converted into spectral amplitude samplesWe then determine { B m } from the average energy E for each band. The method comprises the following specific steps: first, we are according to the spectrum/>Find the power spectrum/>
The energy of the mth harmonic of the sinusoid that synthesizes the signal can be expressed as E m:
then B m can obtain:
4-2) synthesizing the phase using a unvoiced/voiced state based and a rule based approach. The pulse train in the time domain is equivalent to harmonics in the frequency domain, the phase of each of which can be modeled as the phase of an LPC filter excited by the pulse. Since the phase of unvoiced sound tends to be random, the following synthetic phase is performed in a state where the frame is voiced. Since the method does not transmit the pulse position of this model, it needs to be synthesized. The excitation pulse occurs at a rate of omega 0, the phase of the harmonic will advance 80 over one composite frame (80 samples). Then let the excitation phase that generates the first harmonic be:
arg(Ex[1])=ω0*80 (12)
Where E x [1] is the first complex excitation (frequency domain) sample and arg (z) is the phase of the complex sample z.
Then, the phase of the mth excitation harmonic is correlated with the phase of the first harmonic as:
arg(Ex[m])=m*arg(Ex[1])1<m≤L (13)
The resulting E x m is passed through an LPC synthesis filter to determine the final harmonic phase.
To improve errors in the presence of background noise, the phase parameters are modified prior to synthesizing the speech. When the average energy e is below the threshold and the frame is unvoiced, we update the background noise estimate
Wherein,Beta is background noise and the initial value is zero.
When the frame is voiced and the harmonic amplitude a m is less than the threshold tau,The phase is disturbed such that any harmonics are smaller than the background estimate.
4-3) Constructing a frequency domain synthesized signal by using sinusoidal harmonic model parameters of the current frame (i.e., the model parameters generated above)
Wherein,This signal represents the DFT of the synthesized speech signal and consists of pulses spaced by ω 0, ω 0 being weighted by the complex harmonic amplitude a m.
4-4) Further noise reduction using OM-LSA post-filters. Noise-reduced synthesized speech spectrum estimateBy/>Obtained by a spectral gain filter G (k, l):
Wherein, Is the signal after noise reduction,/>For a frequency domain synthesized speech signal, G (k, l) is a spectral gain filter. The spectral gain calculation function formula is as follows:
Where ζ (k, l) is the a priori signal to noise ratio, p (k, l) is the probability of speech presence, and q (k, l) is the probability of speech absence. G H1 is the conditional gain in the presence of speech, G min is the lower gain limit of the filter in the absence of speech.
4-5) Overlap-add and inverse fourier transform. In order to reconstruct a continuous synthesized speech waveform, it is necessary to smoothly connect adjacent noise-reduced synthesized speech spectra. This is performed by windowing each frame and then shifting and superimposing adjacent frames using an overlap-add algorithm. A triangular window is used for this algorithm and is defined by:
And finally, performing inverse Fourier transform to recover the time domain signal.
CN202410397584.4A 2024-04-03 Low-rate voice encoding and decoding method based on sine harmonic model Pending CN118230741A (en)

Publications (1)

Publication Number Publication Date
CN118230741A true CN118230741A (en) 2024-06-21

Family

ID=

Similar Documents

Publication Publication Date Title
US10381020B2 (en) Speech model-based neural network-assisted signal enhancement
AU656787B2 (en) Auditory model for parametrization of speech
Shrawankar et al. Techniques for feature extraction in speech recognition system: A comparative study
Srinivasan et al. Codebook-based Bayesian speech enhancement for nonstationary environments
JP2002516420A (en) Voice coder
Milner et al. Speech reconstruction from mel-frequency cepstral coefficients using a source-filter model
JPH07271394A (en) Removal of signal bias for sure recognition of telephone voice
CN103503061A (en) Apparatus and method for processing a decoded audio signal in a spectral domain
Pannala et al. Robust Estimation of Fundamental Frequency Using Single Frequency Filtering Approach.
Yoneyama et al. Unified source-filter GAN: Unified source-filter network based on factorization of quasi-periodic parallel WaveGAN
WO2003083833A1 (en) Method for modeling speech harmonic magnitudes
Saeki et al. SelfRemaster: Self-supervised speech restoration with analysis-by-synthesis approach using channel modeling
Chong et al. A new waveform interpolation coding scheme based on pitch synchronous wavelet transform decomposition
Honda Speech coding using waveform matching based on LPC residual phase equalization
Srivastava Fundamentals of linear prediction
CN112270934B (en) Voice data processing method of NVOC low-speed narrow-band vocoder
US5937374A (en) System and method for improved pitch estimation which performs first formant energy removal for a frame using coefficients from a prior frame
CN118230741A (en) Low-rate voice encoding and decoding method based on sine harmonic model
Giacobello et al. Speech coding based on sparse linear prediction
US6438517B1 (en) Multi-stage pitch and mixed voicing estimation for harmonic speech coders
Demuynck et al. Synthesizing speech from speech recognition parameters
Kawahara et al. Beyond bandlimited sampling of speech spectral envelope imposed by the harmonic structure of voiced sounds.
Park et al. Unsupervised noise reduction scheme for voice-based information retrieval in mobile environments
Zavarehei et al. Interpolation of lost speech segments using LP-HNM model with codebook post-processing
US20220277754A1 (en) Multi-lag format for audio coding

Legal Events

Date Code Title Description
PB01 Publication