CN104658547A

CN104658547A - Method for expanding artificial voice bandwidth

Info

Publication number: CN104658547A
Application number: CN201310590362.6A
Authority: CN
Inventors: 盖丽
Original assignee: Dalian You Jia Software Science And Technology Ltd
Current assignee: Dalian You Jia Software Science And Technology Ltd
Priority date: 2013-11-20
Filing date: 2013-11-20
Publication date: 2015-05-27

Abstract

The invention discloses a method for expanding artificial voice bandwidth. A working process of the method is as follows: a narrowband voice signal passes through a curve-fitting module and is input to an extrapolation high-frequency envelope module for processing, and an output signal of the extrapolation high-frequency envelope module enters a frequency spectrum formation module; the narrowband voice signal passes through a feature extraction module, each frame obtains a set of linear prediction coefficients, an auto-regression (AR) model and a filter module are constructed by utilizing the linear prediction coefficients, white noise is processed through the AR model to generate a high-frequency noise random sequence associated with low frequency, and the high-frequency noise random sequence enters the frequency spectrum formation module; the frequency spectrum formation module outputs a high-frequency voice; the high-frequency voice and an original narrowband voice signal pass through a voice synthesis module to obtain a bandwidth voice.

Description

A kind of method of artificial speech bandwidth expansion

Technical field

The present invention relates to a kind of method of artificial speech bandwidth expansion, belong to digital signal processing technique field.

Background technology

At present, public telephone network (PSTN) effective frequency range is only that 0.3 ~ 3.4KHz, GSM digital cellular telephone effective bandwidth is no more than 4KHz.Although the main energetic of speech signal concentrates on 0.3 ~ 3.4KHz frequency range, the actual frequency range taken wants large many.4KHz narrowband speech is owing to having lacked high fdrequency component, and its naturalness, the aspects such as intelligibility are obviously deteriorated, and sound " vexed ".

Summary of the invention

In order to overcome above-mentioned deficiency, the object of the present invention is to provide a kind of method of artificial speech bandwidth expansion.

A method for artificial speech bandwidth expansion, its course of work is as follows:

Narrow band voice signal is through extrapolation high-frequency envelope module after curve fitting module, and the output signal of extrapolation high-frequency envelope module enters spectral shaping module; Narrow band voice signal every frame after characteristic extracting module obtains one group of linear predictor coefficient, autoregressive model and filtration module is constructed after utilizing linear predictor coefficient, undertaken processing by this autoregressive model by white noise and produce the high frequency noise random series relevant to low frequency, high frequency noise random series enters spectral shaping module; Spectral shaping module exports High frequency speech; High frequency speech and narrow band voice signal obtain broadband voice through voice synthetic module.

The principle of the invention and beneficial effect: keep the advantage that algorithm complex is lower, produce and truly encourage the artificial excitation that correlativity is higher.First the present invention carries out curve fitting to known low frequency log-domain frequency spectrum, obtains curvilinear equation, and then extrapolated high frequency log-domain spectrum envelope curve.From narrowband speech medium and low frequency parameter, utilize linear predictor coefficient to form autoregressive model, use uniform white noise sequence by this autoregressive model, obtain high frequency noise sequence.This high frequency noise sequence is the white noise with narrowband speech with certain correlativity, is converted into log-domain frequency spectrum, then through the modulation of high frequency log spectrum envelope, can recover High frequency speech, and at cepstrum territory synthetic wideband voice.The present invention is a kind of total blindness's Speech bandwidth extension technology, can directly apply to narrowband speech receiving end.The present invention is without any need for priori or high-frequency information, and algorithm complex is lower, can recover the HFS higher with low correlation, and the broadband voice auditory effect of synthesis is good.

Accompanying drawing explanation

Fig. 1 is process flow diagram of the present invention.

Fig. 2 is broadband voice building-up process of the present invention.

Fig. 3 (a) original broadband voice sound spectrograph.

Fig. 3 (b) narrowband speech sound spectrograph.

Voice sound spectrograph after Fig. 3 (c) bandwidth expansion.

Fig. 4 (a) algorithm of the present invention exports and the output comparing result distribution plan of adaptive variable rate audio coder & decoder (codec) when code rate is 12.2kbps.

Fig. 4 (b) algorithm of the present invention exports and the output comparing result distribution plan of wideband adaptive rate speech codec when code rate is 8.85kbps.

The Spectrum Distortion Measure figure of the broadband voice of Fig. 5 narrowband speech and the present invention's synthesis.

Fig. 6 shows subjective testing standard.

Embodiment

Below in conjunction with accompanying drawing, the present invention will be further described.

Fig. 1 is process flow diagram of the present invention.As shown in Figure 1:

Narrow band voice signal is through extrapolation high-frequency envelope module after curve fitting module, and the output signal of extrapolation high-frequency envelope module enters spectral shaping module; Narrow band voice signal every frame after characteristic extracting module obtains one group of linear predictor coefficient, structure autoregressive model and filtration module, undertaken processing by this AR model by white noise and produce the high frequency noise random series relevant to low frequency, high frequency noise random series enters spectral shaping module; Spectral shaping module exports High frequency speech; High frequency speech and narrow band voice signal obtain broadband voice through voice synthetic module.

Curve fitting module

This module adopts the method for curve to obtain narrowband speech low frequency log spectrum enveloping curve equation, by curvilinear equation extrapolated high frequency log spectrum envelope.Choose the input of resonance peak as curve of low frequency part.First the narrowband speech of 8kHz sampling is inputted, estimate pitch period, and time-domain signal is transformed in log-spectral domain, by the pitch period search log-spectral domain peak point estimated, the change curve of resonance peak is described through curve fitting technique again, and then extrapolated high frequency log spectrum enveloping curve.

First, to narrowband speech sub-frame processing, frame length is 128, overlapping 64 sampled points of interframe.Frequency domain method is adopted namely to calculate the correlativity of signal to calculate the pitch period T of this frame voice.If input narrowband speech is x (n), autocorrelation function R (k) is

R (k) = Σ_{n = 0}^{N - 1} x (n) x (n - k)

Wherein, N is frame length, described N=128, searches for the position k' of the maximal value of R (k) in the scope of correlation delay k=20 ~ 143, and k' is the valuation T of pitch period.Narrowband speech x (n) is done Fourier transform, is then transformed into log-spectral domain, search out first resonance peak in log-spectral domain, first resonance peak is set to p ₀.Due to the size of pitch period and the spacing of resonance peak roughly equal, by fixed first resonance peak p ₀with pitch period T, other low-frequency resonance peak can be searched out.When searching for other low-frequency resonance peaks, only need, searching for apart near the point for T with last resonance peak, the accurate location of other resonance peaks can be obtained, if its amplitude is lo_env (ω), i.e. low frequency log spectrum envelope, corresponding Frequency point ω.Lo_env (ω) and ω is as the input of curve.

Low frequency log spectrum envelope lo_env (ω) is set up mapping relations with low frequency frequency ω

lo_env(ω)＝a·e ^bω+c·e ^dω，ω＝0～2π×4000

Obtain the parameter a in fitting function, b, c, d, both determine mapping equation.

Extrapolation high-frequency envelope module

By fixed mapping equation, higher frequency point is substituted into formula, the high frequency spectrum envelope data hi_env (ω) of the unknown is extrapolated, extrapolated high frequency log spectrum envelope hi_env (ω)

hi_env(ω)＝a·e ^bω+c·e ^dω，ω＝2π×4000～2π×8000。

Characteristic extracting module

Carry out linear prediction analysis to narrowband speech, every frame obtains one group of linear predictor coefficient, structure autoregressive model.First narrowband speech structure autoregressive model is used.Linear prediction analysis is carried out to speech frame x (n) that each length is N (N=128), namely the autocorrelation function of each windowing speech frame is calculated, and using Levinson-Durbin algorithm to convert thereof into linear predictor coefficient, concrete steps are as follows.

Here Hamming window window (n)=0.5-0.5cos (2 π n/N), n=0 is used, 1 ..., N-1 carries out windowing process to input speech signal x (n), voice x'(n after windowing) be

x'(n)＝x(n)·window(n)，

Calculate autocorrelation function

R (k) = Σ_{n = k}^{N - 1} x' (n) x' (n - k),

K=0,1 ..., N-1, N are positive integer.

L rank linear predictor coefficient a can be obtained by solving following system of equations _i, i=1,2 ..., L, L are positive integer.

Σ_{i = 1}^{L} a_{i} R (| i - k |) = - R (k)

K=1 ..., L, L are positive integer.

Adopt Levinson-Durbin algorithm, solve above-mentioned system of equations, linear predictor coefficient a can be obtained _i, i=1 .2.., L.

Structure autoregressive model and filtration module

By low frequency speech linear predictive coefficient a _i, i=1 ..., L constructs composite filter, namely

H (z) = \frac{G}{1 - Σ_{i = 1}^{L} a_{i} z^{- i}},

Wherein, L is autoregressive model exponent number, and described L is positive integer, and L is certain integer between 8 ~ 20, and G is certain decimal between 0.1 ~ 1.Embodiments of the invention arrange L=10, and G=1 is optimum embodiment.

White noise is processed by this composite filter, produces the random series relevant to low frequency voice.The production method of white noise sequence is

w(n)＝[w(n-1)·31821+13849]，

Wherein, w (0)=0.

White noise sequence w (n), by after above-mentioned composite filter, exports high frequency noise sequences y (n), namely

y (n) = w (n) + Σ_{i = 1}^{L} a_{i} y (n - i),

Wherein, a _ifor composite filter coefficient.In order to limit HFS energy, high frequency noise sequences y (n) is normalized, namely

y (n) = \frac{y (n)}{Σ_{i = 0}^{N - 1} \sqrt{y (n) \cdot y (n)}},

Wherein, N is frame length, and the present invention's suggestion arranges N=128.

Spectral shaping module

High frequency log-spectral domain envelope hi_env (ω) that utilization is estimated above is modulated high frequency noise sequence ^[7].First, Fourier transform is carried out to high frequency noise sequences y (n), then is transformed into log-domain, obtain the frequency domain logarithm value C of high frequency noise sequence _y(ω).Use high frequency log spectrum envelope to modulate high frequency noise sequence spectrum, obtain the frequency spectrum logarithm value C of High frequency speech _wide(ω)

C _wide(ω)＝C _y(w)·hi_env(w)，

If the frequency domain value of High frequency speech and High frequency speech time-domain value use S respectively _wide(ω) and S _widen () represents, then have

S _wide(ω)＝exp(C _wide(ω))， (1)

s _wide(n)＝IFFT(S _wide(ω))， (2)

Wherein, exp () is exponent arithmetic, and IFFT () is inverse Fourier transform.Through formula (1), formula (2) inverse transformation process, High frequency speech can be obtained.

Voice synthetic module

The present invention utilizes the feature of cepstrum, the HFS of voice and low frequency part is synthesized [8], and then obtains the broadband voice after synthesizing.The building-up process of voice as shown in Figure 2.

Be the method raising sampling rate of narrow band signal by interpolation of 8KHz by sample frequency, be promoted to 16KHz, obtain the cepstrum of narrowband speech through cepstrum computation process, High frequency speech obtains the cepstrum of High frequency speech equally through cepstrum computation process.The cepstrum of narrowband speech and High frequency speech is transformed into frequency domain respectively, and the frequency domain amplitude of narrowband speech does following process:

C _wide(ω)＝C _narrow(ω)+C _high(ω)

Wherein, C _narrow(ω) and C _high(ω) the cepstrum frequency domain value of narrowband speech and High frequency speech is respectively; C _wide(ω) be the frequency domain value of the broadband cepstrum of synthesis.Again through inverse Fourier transform, obtain the cepstrum of broadband voice, eventually pass the inverse process of cepstrum, obtain the broadband voice after synthesizing.As shown in Figure 2.

The present invention is a kind of total blindness's Speech bandwidth extension technology, can directly apply to narrowband speech receiving end.The present invention is without any need for priori or high-frequency information, and algorithm complex is lower, can recover the HFS higher with low correlation, and the broadband voice auditory effect of synthesis is good.

In order to verify validity of the present invention, objective examination and subjective testing are carried out.

Objective examination's result

Spectrum Distortion Measure and sound spectrograph are the effective ways of objective evidence voice quality.Without loss of generality, the method calculating Spectrum Distortion Measure and draw sound spectrograph is selected in objective examination's link.

Spectrum Distortion Measure is defined as

D_{HC}^{2} = \frac{1}{K} Σ_{k = 1}^{k} {&Integral;}_{0.25 ω}^{0.5 ω} {(20 \log_{10} (\frac{A_{k} (ω)}{A_{k}^{'} (ω)}) + G_{C})}^{2} dω,

G_{C} = \frac{1}{0.25 ω_{s}} {&Integral;}_{0.25 ω_{s}}^{0.5 ω_{s}} 20 \log_{10} (\frac{A_{k}^{'} (ω)}{A_{k} (ω)}) dω,

Wherein, ω _sbe 2 π, G _cfor gain compensation factor, it can remove the square error between two original envelope effectively, and K is total number of speech frames, A _k(ω) and A' _k(ω) be respectively the spectrum envelope of kth frame original reference voice and tested voice, computing formula is as follows

A_{k} (ω) = | Σ_{n = 0}^{N - 1} x (n) e^{- jωn} |,

A_{k}^{'} (ω) = | Σ_{n = 0}^{N - 1} x^{'} (n) e^{- jωn} |,

The present invention's suggestion arranges N=128, and x (n) and x ' (n) represents original reference voice and tested voice respectively, and original reference voice are original broadband voice here, and tested voice are the broadband voice of original narrow-band voice or synthesis.

Respectively in the manner described above Spectrum Distortion Measure is calculated to the broadband voice of original narrow-band voice and the synthesis of this algorithm of use.Test result is shown in Fig. 5.As can be seen from Figure 5, the spectrum distortion of the broadband voice of algorithm synthesis herein obviously reduces compared with the spectrum distortion of narrowband speech, illustrates that algorithm can estimate High frequency speech and synthetic wideband voice preferably herein.

Sound spectrograph is the energy information representing one section of voice intermediate frequency spectrum with gray level image, and the brighter part of image illustrates that this portion of energy is larger, and darker part illustrates that the energy of this partial frequency spectrum is less.Sound spectrograph can show the change of voice medium frequency intuitively, therefore, in order to contrast frequency spectrum difference more intuitively, give the narrowband speech of man in tested speech, the sound spectrograph of original broadband voice and the broadband voice through the synthesis of this illiteracy bandwidth expansion algorithm, as shown in Fig. 3 (a), (b), (c).From the sound spectrograph that Fig. 3 (a) is primary speech signal, can find out, sound spectrograph is all brighter in 0 ~ 8KHz frequency range.Fig. 3 (b) is the sound spectrograph of narrow band voice signal, and the sound spectrograph of narrowband speech is very dark in 4 ~ 8KHz frequency range, illustrates at HFS energy very little, so narrowband speech sounds natural not.Fig. 3 (c) is the sound spectrograph of the blind bandwidth expansion algorithm output voice that the present invention proposes, and in 4 ~ 8KHz frequency range, sound spectrograph obviously brightens, and illustrates that the high fdrequency component of voice obviously increases.

Subjective test results

Subjective testing adopts subjective testing standards of grading method conventional in the world, namely compares mean opinion score.Fig. 6 gives subjective testing standards of grading, and scoring scope is between-3 ~+3.

The tested speech that the present invention chooses is as follows: the narrowband telephone voice that (1) adaptive variable rate audio coder & decoder (codec) exports under code rate is 12.2kbps; (2) the wideband telephony voice that export under code rate is 8.85kbps of wideband adaptive rate speech codec; (3) the wideband telephony voice of narrowband telephone voice after the new blind bandwidth expansion algorithm that the present invention proposes that export under code rate is 12.2kbps of adaptive variable rate audio coder & decoder (codec).

The narrowband telephone voice that the wideband telephony voice of narrowband telephone voice after the new blind bandwidth expansion algorithm that the present invention proposes and adaptive variable rate audio coder & decoder (codec) export under code rate is 12.2kbps are as first group of tested speech; The wideband telephony voice that the wideband telephony voice of narrowband telephone voice after the new blind bandwidth expansion algorithm that the present invention proposes and wideband adaptive rate speech codec export under code rate is 8.85kbps are as second group of tested speech.Every section of voice all will be clipped to-26 decibels.

In subjective testing, invite 20 audiences (10 male 10 female) to test in same environment, the age of test subject is between 20 years old ~ 40 years old, and within half a year, do not participate in the relevant subjective testing in any voice.Before the test begins, by the effect after bandwidth expansion to audience display, and inform that audience needs to evaluate two main aspects of voice, evaluate voice quality and experience the high fdrequency component expanded.When test subject understanding of guidance, first they will listen to preliminary feelings row, and provide their suggestion.During test, often organize tested speech and show test subject according to random order, and allow them unrestrictedly to repeat to listen to.Finally, every bit test main body will provide their suggestion according to subjective testing standards of grading.Fig. 4 (a) and 4 (b) give the distribution plan of the comparing result of two groups of tested speech.

In distribution plan, horizontal ordinate represents subjective testing standards of grading score, and ordinate represents the audience's proportion providing a certain mark.Comment scoring criteria according to subjective testing, positive number represent the narrowband telephone voice that herein algorithm exports compared with adaptive variable rate audio coder & decoder (codec) under code rate is 12.2kbps or the wideband telephony voice that wideband adaptive rate speech codec exports under code rate is 8.85kbps better.This process adopts difference analysis method, adopts the fiducial interval of 95%, analyzes bandwidth expansion pattern test result.Fig. 4 (a) is the comparing result distribution plan of the narrowband telephone voice that Output rusults of the present invention and adaptive variable rate audio coder & decoder (codec) export under code rate is 12.2kbps; Fig. 4 (b) is the comparing result figure of the wideband telephony voice that this paper algorithm Output rusults and wideband adaptive rate speech codec export under code rate is 8.85kbps.As can be seen from Fig. 4 (a) and 4 (b), the result that algorithm draws herein is slightly better than the broadband voice that wideband adaptive rate speech codec exports under 8.85kbps code rate, but had larger improvement compared with the narrowband speech exported under 12.2kbps code rate with adaptive variable rate audio coder & decoder (codec), auditory effect significantly improves.

The above; be only the present invention's preferably embodiment; but protection scope of the present invention is not limited thereto; anyly be familiar with those skilled in the art in the technical scope that the present invention discloses; be equal to according to technical scheme of the present invention and inventive concept thereof and replace or change, all should be encompassed within protection scope of the present invention.

Claims

1. a method for artificial speech bandwidth expansion, is characterized in that:

Narrow band voice signal is through extrapolation high-frequency envelope module after curve fitting module, and the output signal of extrapolation high-frequency envelope module enters spectral shaping module; Narrow band voice signal every frame after characteristic extracting module obtains one group of linear predictor coefficient, structure autoregressive model and filtration module, undertaken processing by this autoregressive model by white noise and produce the high frequency noise random series relevant to low frequency, high frequency noise random series enters spectral shaping module; Spectral shaping module exports High frequency speech; High frequency speech and narrow band voice signal obtain broadband voice through voice synthetic module.

2. the method for a kind of artificial speech bandwidth expansion according to claim 1, it is characterized in that: curve fitting module adopts the method for curve to obtain narrowband speech low frequency log spectrum enveloping curve equation, by curvilinear equation extrapolated high frequency log spectrum envelope, choose the input of resonance peak as linear fit of low frequency part; First input the narrowband speech of 8kHz sampling, estimate pitch period, and time-domain signal is transformed in log-spectral domain, by the pitch period search log-spectral domain peak point estimated, the change curve of resonance peak is described through curve fitting technique again, and then extrapolated high frequency log spectrum enveloping curve

To narrowband speech sub-frame processing: frame length is 128, overlapping 64 sampled points of interframe, adopt frequency domain method namely to calculate the correlativity of signal to calculate the pitch period T of this frame voice, input narrowband speech is x (n), and autocorrelation function R (k) is

Wherein N is frame length, described N=128, the position k' of the maximal value of R (k) is searched in the scope of correlation delay k=20 ~ 143, k' is the valuation T of pitch period, narrowband speech is done Fourier transform, then be transformed into log-spectral domain, search out first resonance peak in log-spectral domain, first resonance peak is set to p ₀.Due to the size of pitch period and the spacing of resonance peak roughly equal, by fixed first resonance peak p ₀with pitch period T, other low-frequency resonance peak can be searched out, when searching for other low-frequency resonance peaks, only need, searching for apart near the point for T with last resonance peak, the accurate location of other resonance peaks can be obtained, if its amplitude is lo_env (ω), i.e. low frequency log spectrum envelope, low frequency log spectrum envelope lo_env (ω), as the input of curve, is set up mapping relations with low frequency frequency ω by corresponding Frequency point ω, lo_env (ω) and ω

Lo_env (ω)=ae ^{b ω}+ ce ^{d ω}, ω=0 ~ 2 π * 4000, obtains the parameter a in fitting function, b, c, d, both determined mapping equation.

3. the apparatus and method of a kind of artificial speech bandwidth expansion according to claim 1, it is characterized in that: extrapolation high-frequency envelope module is by fixed mapping equation, higher frequency point is substituted into formula, the high frequency log spectrum envelope data hi_env (ω) of the unknown is extrapolated, extrapolated high frequency log spectrum envelope hi_env (ω)

hi_env(ω)＝a·e ^bω+c·e ^dω，ω＝2π*4000～2π*8000。

4. the method for a kind of artificial speech bandwidth expansion according to claim 1, is characterized in that: characteristic extracting module carries out linear prediction analysis to narrowband speech, and every frame obtains one group of linear predictor coefficient, structure autoregressive model; First narrowband speech structure autoregressive model is used, linear prediction analysis is carried out to speech frame x (n) that each length is N, described N=128, namely the autocorrelation function of each windowing speech frame is calculated, and use Levinson-Durbin algorithm to convert thereof into linear predictor coefficient, concrete steps are as follows:

Use Hamming window window (n)=0.5-0.5cos (2 π n/N), n=0,1 ..., N-1, N are positive integer, carry out windowing process to input speech signal x (n), voice x'(n after windowing) be

x'(n)＝x(n)·window(n)，

Calculate autocorrelation function,

k=0,1 ..., N-1, N are positive integer,

Adopting Levinson-Durbin algorithm, L rank autoregressive model coefficient a can be obtained by solving following system of equations _i, i=1,2 ..., L, L are positive integer

。

5. the method for a kind of artificial speech bandwidth expansion according to claim 1, is characterized in that: structure autoregressive model and filtration module method as follows:

By low frequency voice autoregressive model coefficient a _i, i=1 ..., L, L are positive integer, structure composite filter model, namely

Wherein, G is gain, and L is autoregressive model exponent number, and described L is 8,9,10 ..., certain positive integer between 20, L is integer, and G is certain decimal between 0.1 ~ 1.

White noise is processed by this composite filter, produces the random series relevant to low frequency voice; The production method of white noise sequence is

w(n)＝[w(n-1)·31821+13849]，

Wherein, w (0)=0;

Wherein, N is frame length, described N=128.

6. the method for a kind of artificial speech bandwidth expansion according to claim 1, is characterized in that: spectral shaping module utilizes high frequency log-spectral domain envelope hi_env (ω) estimated to modulate high frequency noise sequence above,

First, Fourier transform is carried out to high frequency noise sequences y (n), then is transformed into log-domain, obtain the frequency domain logarithm value C of high frequency noise sequence _y(ω), use high frequency log spectrum envelope to modulating, obtains the frequency spectrum logarithm value C of High frequency speech to high frequency noise sequence spectrum _wide(ω)

C _wide(ω)＝C _y(w)·hi_env(w)，

S _wide(ω)＝exp(C _wide(ω))， (1)

s _wide(n)＝IFFT(S _wide(ω))， (2)

7. the method for a kind of artificial speech bandwidth expansion according to claim 1, it is characterized in that: voice synthetic module is that the narrow band signal of 8KHz improves sampling rate by the method for interpolation by sample frequency, be promoted to 16KHz, obtain the cepstrum of narrowband speech through cepstrum computation process, High frequency speech obtains the cepstrum of High frequency speech equally through cepstrum computation process; The cepstrum of narrowband speech and High frequency speech is transformed into frequency domain respectively, and the frequency domain amplitude of narrowband speech does following process:

C _wide(ω)＝C _narrow(ω)+C _high(ω)，

Wherein, C _narrow(ω) and C _high(ω) the cepstrum frequency domain value of narrowband speech and High frequency speech is respectively; C _wide(ω) be the frequency domain value of the broadband cepstrum of synthesis, then obtain the cepstrum of broadband voice through inverse Fourier transform, eventually pass the inverse process of cepstrum, obtain the broadband voice after synthesizing.