CN105989834A

CN105989834A - Voice recognition apparatus and voice recognition method

Info

Publication number: CN105989834A
Application number: CN201510059977.5A
Authority: CN
Inventors: 杜博仁; 张嘉仁; 曾凯盟
Original assignee: Acer Inc
Current assignee: Acer Inc
Priority date: 2015-02-05
Filing date: 2015-02-05
Publication date: 2016-10-05
Anticipated expiration: 2035-02-05
Also published as: CN105989834B

Abstract

The invention provides a voice recognition apparatus and a voice recognition method. The voice recognition apparatus and the voice recognition method can determine whether an original voice sampling signal corresponding to a target voice frame is noise according to at least one of a ratio between the first consonant frequency range signal energy and the second consonant frequency range signal energy, a ratio between the first consonant frequency range signal energy and the original voice sampling signal energy, and a ratio between the second consonant frequency range signal energy and the original voice sampling signal energy. The voice recognition apparatus and the voice recognition method can effectively recognize whether a voice signal is a consonant signal.

Description

Voice identification apparatus and speech identifying method

Technical field

The invention relates to a kind of device for identifying, and in particular to a kind of voice identification apparatus and language Sound discrimination method.

Background technology

For Hearing Impaired, it often cannot clearly receive the voice signal of higher-frequency, example Such as consonant signal, but the voice signal for low frequency can clearly be heard.Existing consonant signal is sentenced Disconnected mode is for carrying out signal processing in a frequency domain, it is judged that mode mainly has two kinds, and non-instant consonant signal is sentenced Disconnected and instant consonant judges.Non-instant consonant signal judges, is mainly judged by energy and zero-crossing rate. Whether instant consonant signal judges, mainly fix more than one according to the ratio of high-frequency signal with gross energy Value and the ratio of low frequency signal and gross energy whether determine that whether voice signal is less than fixing value Consonant signal.Though existing consonant signal judgment mode can distinguish consonant signal and noise, but its accuracy Still cannot meet the demand of reality.

Summary of the invention

The present invention provides a kind of voice identification apparatus and speech identifying method, can effectively pick out voice letter Number whether it is consonant signal.

The voice identification apparatus of the present invention, including bandpass filtering unit and processing unit.The wherein logical filter of band Ripple unit carries out the first consonant frequency range and the bandpass filtering of the second consonant frequency range to voice signal, with respectively Produce the first bandpass filtered signal and the second bandpass filtered signal.Processing unit coupled belt pass filtering unit, Voice signal, the first bandpass filtered signal and the second bandpass filtered signal are divided into multiple speech frame, The most each speech frame includes N number of sampled signal, and N is positive integer, and processing unit also calculates target voice frame The energy of middle sampled signal, to obtain raw tone sampled signal energy, the first consonant frequency band signals energy And the second consonant frequency band signals energy, believe with the second consonant frequency range according to the first consonant frequency band signals energy The ratio of number ratio of energy, the first consonant frequency band signals energy and raw tone sampled signal energy and The ratio in judgement corresponding target voice frame of the second consonant frequency band signals energy and raw tone sampled signal energy Raw tone sampled signal whether be noise.

In one embodiment of this invention, above-mentioned processing unit judges the first consonant frequency band signals energy and The ratio of two consonant frequency band signals energy, the first consonant frequency band signals energy and raw tone sampled signal energy Whether the ratio of amount and the ratio of the second consonant frequency band signals energy and raw tone sampled signal energy divide Do not fall within the default ratio range of correspondence, if the first consonant frequency band signals energy and the second consonant frequency band signals The ratio of the ratio of energy, the first consonant frequency band signals energy and raw tone sampled signal energy and the Two consonant frequency band signals energy fall within corresponding default ratio respectively with the ratio of raw tone sampled signal energy Value scope, then the raw tone sampled signal of target voice frame is noise signal.

In one embodiment of this invention, above-mentioned processing unit also calculate multiple before be judged as noise letter Number the energy weighted mean of speech frame of raw tone sampled signal, add obtaining noise signal energy Weight average value, and whether make an uproar more than this according to the raw tone sampled signal energy corresponding to target voice frame Acoustical signal energy weighted mean judges that whether the raw tone sampled signal corresponding to target voice frame is Consonant signal.

In one embodiment of this invention, above-mentioned correspondence each be judged as noise signal raw tone sampling The weighted value of the speech frame of signal is judged as the raw tone sampled signal of noise signal with corresponding each Interval length between speech frame and target voice frame is different and changes.

In one embodiment of this invention, above-mentioned processing unit also according to the second consonant frequency band signals energy with The ratio of raw tone sampled signal energy and the first consonant frequency band signals energy are believed with raw tone sampling Whether the ratio sum of number energy is more than or equal to presetting and value judges the original language corresponding to target voice frame Whether sound sampled signal is consonant signal.

In one embodiment of this invention, above-mentioned processing unit also calculate multiple before be judged as noise letter Number first consonant frequency band signals energy and the raw tone corresponding to speech frame of raw tone sampled signal The weighted mean of the ratio of sampled signal energy, to obtain the first consonant energy proportion weighted mean, And according to the first consonant frequency band signals energy corresponding to target voice frame and raw tone sampled signal energy Ratio whether former less than what the first consonant energy proportion weighted mean judged corresponding to target voice frame Whether beginning phonetic sampling signal is consonant signal.

In one embodiment of this invention, each raw tone being judged as noise signal of above-mentioned correspondence takes The first consonant frequency band signals energy corresponding to the speech frame of sample signal and raw tone sampled signal energy The weighted value of ratio with corresponding each be judged as noise signal raw tone sampled signal speech frame with Interval length between target voice frame is different and changes.

In one embodiment of this invention, above-mentioned processing unit also according to the second consonant frequency band signals energy with More than or equal to default ratio, whether the ratio of raw tone sampled signal energy judges that target voice frame institute is right Whether the raw tone sampled signal answered is consonant signal.

In one embodiment of this invention, above-mentioned processing unit according to raw tone sampled signal energy is also No judge whether the raw tone sampled signal corresponding to target voice frame is consonant more than or equal to lower limit Signal.

In one embodiment of this invention, above-mentioned processing unit also calculates the first of raw tone sampled signal Zero-crossing rate, the second zero-crossing rate and the 3rd zero-crossing rate, and before calculating target voice frame and target voice frame The average zero-crossing rate of raw tone sampled signal of multiple speech frames, with obtain the first average zero-crossing rate, Second average zero-crossing rate and the 3rd average zero-crossing rate, and according to the first average zero-crossing rate, the second average mistake Whether zero rate and the 3rd average zero-crossing rate are respectively greater than and judge equal to the default average zero-crossing rate of its correspondence Whether the raw tone sampled signal corresponding to target voice frame is consonant signal, wherein the first zero-crossing rate, Second zero-crossing rate and the 3rd zero-crossing rate are respectively in target voice frame original phonetic sampling signal by the One preset value, the second preset value and the number of times of the 3rd preset value, the second preset value is less than the first preset value And more than the 3rd preset value.

In one embodiment of this invention, whether above-mentioned processing unit is also more than or equal to according to the second zero-crossing rate Default zero-crossing rate judges whether the raw tone sampled signal corresponding to target voice frame is consonant signal.

The speech identifying method of the present invention comprises the following steps.Voice signal is carried out the first consonant frequency range with And second bandpass filtering of consonant frequency range, to produce the first bandpass filtered signal and the second bandpass filtering respectively Signal.Voice signal, the first bandpass filtered signal are divided into multiple speech frame with the second bandpass filtered signal, The most each speech frame includes N number of sampled signal, and N is positive integer.Calculate sampled signal in target voice frame Energy, auxiliary to obtain raw tone sampled signal energy, the first consonant frequency band signals energy and second Audio band signals energy.According to the first consonant frequency band signals energy and the ratio of the second consonant frequency band signals energy The ratio of value, the first consonant frequency band signals energy and raw tone sampled signal energy and the second consonant frequency The raw tone of segment signal energy and the ratio in judgement corresponding target voice frame of raw tone sampled signal energy Whether sampled signal is noise.

In one embodiment of this invention, above-mentioned speech identifying method also comprises the following steps.Judge first Consonant frequency band signals energy and the ratio of the second consonant frequency band signals energy, the first consonant frequency band signals energy Sample with raw tone with ratio and the second consonant frequency band signals energy of raw tone sampled signal energy The ratio of signal energy falls within the default ratio range of correspondence the most respectively.If the first consonant frequency band signals energy Amount samples with raw tone with the ratio of the second consonant frequency band signals energy, the first consonant frequency band signals energy The ratio of the ratio of signal energy and the second consonant frequency band signals energy and raw tone sampled signal energy Fall within the default ratio range of correspondence respectively, then the raw tone sampled signal of target voice frame is noise letter Number.

In one embodiment of this invention, above-mentioned speech identifying method also comprises the following steps.Calculate multiple It is judged as the energy weighted mean of the speech frame of the raw tone sampled signal of noise signal before, with Obtain noise signal energy weighted mean.According to the raw tone sampled signal corresponding to target voice frame Whether energy judges the raw tone corresponding to target voice frame more than noise signal energy weighted mean Whether sampled signal is consonant signal.

In one embodiment of this invention, each raw tone being judged as noise signal of above-mentioned correspondence takes The weighted value of the speech frame of sample signal is judged as the raw tone sampled signal of noise signal with corresponding each Speech frame and target voice frame between interval length different and change.

In one embodiment of this invention, above-mentioned speech identifying method also includes, according to the second consonant frequency range The ratio of signal energy and raw tone sampled signal energy and the first consonant frequency band signals energy are with original More than or equal to default and value, whether the ratio sum of phonetic sampling signal energy judges that target voice frame institute is right Whether the raw tone sampled signal answered is consonant signal.

In one embodiment of this invention, above-mentioned speech identifying method also comprises the following steps.Calculate multiple It is judged as the first consonant frequency range corresponding to speech frame of the raw tone sampled signal of noise signal before The weighted mean of the ratio of signal energy and raw tone sampled signal energy, to obtain the first consonant energy Amount proportion weighted meansigma methods.According to the first consonant frequency band signals energy corresponding to target voice frame with original Whether the ratio of phonetic sampling signal energy judges target less than the first consonant energy proportion weighted mean Whether the raw tone sampled signal corresponding to speech frame is consonant signal.

In one embodiment of this invention, each raw tone being judged as noise signal of above-mentioned correspondence takes Adding of the ratio of the first consonant frequency band signals energy corresponding to sample signal and raw tone sampled signal energy Weights are with corresponding each speech frame of raw tone sampled signal being judged as noise signal and target voice Interval length between frame is different and changes.

In one embodiment of this invention, above-mentioned speech identifying method also includes, according to the second consonant frequency range Whether signal energy judges target more than or equal to default ratio with the ratio of raw tone sampled signal energy Whether the raw tone sampled signal corresponding to speech frame is consonant signal.

In one embodiment of this invention, above-mentioned speech identifying method also includes, samples according to raw tone Whether signal energy judges the raw tone sampled signal corresponding to target voice frame more than or equal to lower limit Whether it is consonant signal.

In one embodiment of this invention, above-mentioned speech identifying method also comprises the following steps.Calculate original First zero-crossing rate of phonetic sampling signal, the second zero-crossing rate and the 3rd zero-crossing rate, and calculate target voice The average zero-crossing rate of the raw tone sampled signal of the most N number of speech frame before frame and target voice frame, with To the first average zero-crossing rate, the second average zero-crossing rate and the 3rd average zero-crossing rate, wherein N is positive integer, Wherein the first zero-crossing rate, the second zero-crossing rate and the 3rd zero-crossing rate are respectively original language in target voice frame Sound sampled signal is by the first preset value, the second preset value and the number of times of the 3rd preset value, and second presets Value is less than the first preset value and more than the 3rd preset value.According to the first average zero-crossing rate, the second average zero passage Whether rate and the 3rd average zero-crossing rate are respectively greater than and judge mesh equal to the default average zero-crossing rate of its correspondence Whether mark raw tone sampled signal corresponding to speech frame is consonant signal.

In one embodiment of this invention, above-mentioned speech identifying method also includes, according to the second zero-crossing rate is No judge that whether the raw tone sampled signal corresponding to target voice frame is more than or equal to presetting zero-crossing rate Consonant signal.

Based on above-mentioned, embodiments of the invention are according to the first consonant frequency band signals energy and the second consonant frequency range The ratio of the ratio of signal energy, the first consonant frequency band signals energy and raw tone sampled signal energy with And the second corresponding target voice of ratio in judgement of consonant frequency band signals energy and raw tone sampled signal energy Whether the raw tone sampled signal of frame is noise, to lower, raw tone sampled signal is mistaken for consonant The situation of signal occurs, and then improves the identification precision of consonant signal.

For the features described above of the present invention and advantage can be become apparent, special embodiment below, and coordinate Accompanying drawing is described in detail below.

Accompanying drawing explanation

Fig. 1 is shown as the schematic diagram of the voice identification apparatus of one embodiment of the invention；

Fig. 2 A～2B illustrates the schematic flow sheet of the speech identifying method of one embodiment of the invention；

Fig. 3 A～3B illustrates the schematic flow sheet of the speech identifying method of another embodiment of the present invention.

Description of reference numerals:

102: bandpass filtering unit；

104: processing unit；

S1: voice signal；

S2: the first bandpass filtered signal；

S3: the second bandpass filtered signal；

The process step of S202～S230, S302: speech identifying method.

Detailed description of the invention

Fig. 1 is shown as the schematic diagram of the voice identification apparatus of one embodiment of the invention, refer to Fig. 1.Language Sound device for identifying includes bandpass filtering unit 102 and processing unit 104, bandpass filtering unit 102 coupling Connecing processing unit 104, bandpass filtering unit 102 can such as be implemented with band filter, and processes list Unit 104 can such as implement with CPU, but is not limited.Bandpass filtering unit 102 can Voice signal S1 is carried out the first consonant frequency range and the bandpass filtering of the second consonant frequency range, to produce respectively First bandpass filtered signal S2 and the second bandpass filtered signal S3, in the present embodiment the first consonant frequency Section and the second consonant frequency range are respectively 2kHz～4kHz and 4kHz～10kHz, but are not limited.

Processing unit 104 can be to voice signal S1, the first bandpass filtered signal S2 and the logical filter of the second band Ripple signal S3 is sampled, and by logical to voice signal S1, the first bandpass filtered signal S2 and the second band Filtering signal S3 is divided into multiple speech frame, and wherein each speech frame can include N number of voice signal S1's Sampled signal, the sampled signal of N number of first bandpass filtered signal S2 and N number of second bandpass filtering letter The sampled signal of number S3.Processing unit 104 also can calculate the energy of sampled signal in each speech frame, with Obtain raw tone sampled signal energy, the first consonant frequency band signals energy and the second consonant frequency band signals Energy, wherein raw tone sampled signal energy, the first consonant frequency band signals energy and the second consonant frequency The sampled signal of voice signal S1, the first bandpass filtered signal in the most corresponding speech frame of segment signal energy The sampled signal of S2 and the energy of the sampled signal of the second bandpass filtered signal S3.Obtaining original language After sound sampled signal energy, the first consonant frequency band signals energy and the second consonant frequency band signals energy, place Reason unit 104 just can be according to the first consonant frequency band signals energy and the ratio of the second consonant frequency band signals energy The ratio of value, the first consonant frequency band signals energy and raw tone sampled signal energy and the second consonant frequency The raw tone of segment signal energy and ratio in judgement each speech frame corresponding of raw tone sampled signal energy Whether sampled signal is noise.

Specifically, processing unit 104 can determine whether the first consonant frequency band signals energy and the second consonant frequency range The ratio of the ratio of signal energy, the first consonant frequency band signals energy and raw tone sampled signal energy with And second consonant frequency band signals energy and the ratio of raw tone sampled signal energy to fall within it the most respectively right The default ratio range answered, if the first consonant frequency band signals energy and the ratio of the second consonant frequency band signals energy The ratio of value, the first consonant frequency band signals energy and raw tone sampled signal energy and the second consonant frequency Segment signal energy falls within its corresponding default ratio model respectively with the ratio of raw tone sampled signal energy Enclose, then the raw tone sampled signal of target voice frame is noise signal.

For example, processing unit 104 judges corresponding a target voice frame (such as m-th speech frame, m For positive integer) raw tone sampled signal be whether the mode of noise, can judge with following formula:

0.7 < \frac{{EB 1}_{m}}{{EB 2}_{m}} < 1.3 - - - (1)

0.25 < \frac{{EB 2}_{m}}{E_{m}} < 0.5 - - - (2)

0.25 < \frac{{EB 1}_{m}}{E_{m}} < 0.5 - - - (3)

Wherein EB1_mIt is the first consonant frequency band signals energy, EB2_mIt is the second consonant frequency band signals energy, and E_m For raw tone sampled signal energy, when formula (1), (2), (3) all meet, processing unit 104 judges The raw tone sampled signal of m speech frame is noise signal.

After the raw tone sampled signal judging target voice frame is noise signal, processing unit 104 Also calculate multiple voices of the raw tone sampled signal being judged as noise signal before target voice frame The energy weighted mean of frame, to obtain noise signal energy weighted mean, and according to target voice frame Whether corresponding raw tone sampled signal energy judges mesh more than noise signal energy weighted mean Whether mark raw tone sampled signal corresponding to speech frame is consonant signal.

For example, noise signal energy weighted mean can be to calculate to be judged before target voice frame Obtain for the energy weighted mean of 3 speech frames of raw tone sampled signal of noise signal, false Before being located at m-th speech frame, three speech frames being judged as noise recently are respectively m-10 Speech frame, m-12 speech frame and the m-20 speech frame, then the making an uproar of corresponding m-th speech frame Acoustical signal energy weighted mean AK_mCan be as follows shown in formula:

{AK}_{m} = \frac{a 0 \times E_{m - 10} + a 1 \times E_{m - 12} + a 2 \times E_{m - 20}}{a 0 + a 1 + a 2} - - - (4)

Wherein E_m-10、E_m-12、E_m-20Be respectively the m-10 speech frame, m-12 speech frame and The raw tone sampled signal energy of m-20 speech frame, and a0, a1, a2 are respectively m-10 The weighted value that speech frame, m-12 speech frame and the m-20 speech frame are corresponding.Wherein weighted value A0, a1, a2 can be fixed value or variation value.For example, corresponding each is judged as noise letter Number the weighted value of speech frame of raw tone sampled signal can be judged as noise signal with corresponding each Interval length between speech frame and the target voice frame of raw tone sampled signal is different and changes.As In the present embodiment, weighted value a0, a1, a2 can be with the interval length between speech frame and m-th speech frame Different and change.As noise signal energy weighted mean AK_mMeet the following formula period of the day from 11 p.m. to 1 a.m, can determine whether corresponding the The raw tone sampled signal of m speech frame is consonant signal:

E_m>AK_m (5)

It addition, processing unit can calculate multiple raw tone sampled signal being judged as noise signal before The ratio of the first consonant frequency band signals energy corresponding to speech frame and raw tone sampled signal energy Weighted mean is to obtain the first consonant energy proportion weighted mean and right according to target voice frame The the first consonant frequency band signals energy answered is the most auxiliary less than first with the ratio of raw tone sampled signal energy Sound energy proportion weighted mean judges that whether the raw tone sampled signal corresponding to target voice frame is Consonant signal.For example, the first consonant energy proportion weighted mean can be to calculate at target voice frame It is judged as the first consonant frequency range letter of 3 speech frames of the raw tone sampled signal of noise signal before Number energy obtains with the weighted mean of the ratio of raw tone sampled signal energy, it is assumed that in m-th Before speech frame, be judged as recently three speech frames of noise be respectively the m-10 speech frame, M-12 speech frame and the m-20 speech frame, then the first consonant energy of corresponding m-th speech frame Proportion weighted meansigma methods AF_mCan be as follows shown in formula:

{AK}_{m} = \frac{c 0 \times \frac{{EB 1}_{m - 10}}{E_{m - 10}} + c 1 \times \frac{{EB 1}_{m - 12}}{E_{m - 12}} + c 2 \times \frac{{EB 1}_{m - 20}}{E_{m - 20}}}{c 0 + c 1 + c 2} - - - (6)

Wherein EB1_m-10、EB1_m-12、EB1_m-20It is respectively the m-10 speech frame, the m-12 speech frame And the first consonant frequency band signals energy of m-20 speech frame, E_m-10、E_m-12、E_m-20It is respectively the M-10 speech frame, m-12 speech frame and the raw tone sampled signal of m-20 speech frame Energy, and c0, c1, c2 are respectively the m-10 speech frame, m-12 speech frame and m-20 The weighted value that individual speech frame is corresponding.Wherein weighted value c0, c1, c2 can be fixed value or variation value. For example, corresponding to the speech frame of each raw tone sampled signal being judged as noise signal corresponding The weighted value of the first consonant frequency band signals energy and the ratio of raw tone sampled signal energy can be with corresponding Between between speech frame and the target voice frame of each raw tone sampled signal being judged as noise signal Change every length difference.As in the present embodiment, weighted value c0, c1, c2 can be with speech frame and m Interval length between individual speech frame is different and changes.As the first consonant energy proportion weighted mean AF_mFull The foot column period of the day from 11 p.m. to 1 a.m, can determine whether that the raw tone sampled signal of corresponding m-th speech frame is consonant signal:

\frac{{EB 1}_{m}}{E_{m}} < {AF}_{m} - - - (7)

Additionally, processing unit 104 can be according to the second consonant frequency band signals energy and raw tone sampled signal The ratio of energy and the first consonant frequency band signals energy with the ratio sum of raw tone sampled signal energy are No more than or equal to preset and value judge whether the raw tone sampled signal corresponding to target voice frame is auxiliary Tone signal.Such as, for m-th speech frame, above-mentioned judgment mode can be with following formula subrepresentation:

\frac{{EB 1}_{m}}{E_{m}} + - \frac{{EB 2}_{m}}{E_{m}} &GreaterEqual; 1 - - - (8)

In the present embodiment, preset and value is 1, but be not limited thereto, preset and value also can be according to reality Situation is adjusted to other values.

Also, processing unit 104 also can be according to the second consonant frequency band signals energy and raw tone sampled signal Whether the ratio of energy judges the raw tone sampling corresponding to target voice frame more than or equal to default ratio Whether signal is consonant signal.Such as, for m-th speech frame, above-mentioned judgment mode can be following Formula represents:

\frac{{EB 2}_{m}}{E_{m}} &GreaterEqual; 0.8 - - - (9)

In the present embodiment, default ratio is 0.8, but is not limited, and presets ratio in some embodiments Value is alternatively other values, is shown below:

\frac{{EB 2}_{m}}{E_{m}} &GreaterEqual; 0.35 - - - (10)

In formula (7), default ratio is 0.35.

It addition, whether processing unit 104 also can be according to raw tone sampled signal energy more than or equal to lower limit Value judges whether the raw tone sampled signal corresponding to target voice frame is consonant signal.Such as, right For m-th speech frame, above-mentioned judgment mode can be with following formula subrepresentation:

E_m≥50 (11)

In the present embodiment, lower limit is 50, but is not limited, and lower limit is also in some embodiments Can be adjusted according to practical situation.

Owing to consonant signal there may be the situation appearance that energy varies in size, in the part that energy comparison is little Can may be considered noise, for avoiding this situation, judge that raw tone takes except above-mentioned according to energy Whether sample signal is outside consonant signal, according to zero-crossing rate, processing unit 104 also can judge that raw tone takes Whether sample signal is consonant signal.Processing unit 104 can calculate the first zero passage of raw tone sampled signal Rate, the second zero-crossing rate and the 3rd zero-crossing rate, and many before calculating target voice frame and target voice frame The average zero-crossing rate of the raw tone sampled signal of individual speech frame, with obtain the first average zero-crossing rate, second Average zero-crossing rate and the 3rd average zero-crossing rate, and according to the first average zero-crossing rate, the second average zero-crossing rate And the 3rd average zero-crossing rate whether be respectively greater than and judge target equal to the default average zero-crossing rate of its correspondence Whether the raw tone sampled signal corresponding to speech frame is consonant signal.Wherein the first zero-crossing rate, second It is pre-by first that zero-crossing rate and the 3rd zero-crossing rate are respectively original phonetic sampling signal in target voice frame If value, the second preset value and the number of times of the 3rd preset value, wherein the second preset value is less than the first preset value And more than the 3rd preset value.

For m-th speech frame, original zero-crossing rateCan be shown below:

Z_{m}^{0} = Σ_{j = 1}^{N - 1} 0.5 {sgn [{\hat{x}}_{m} (mL + j)] - sgn [{\hat{x}}_{m} (mL + j - 1)]} - - - (12)

Wherein N is positive integer, and it represents the number of the sampled signal in m-th speech frame, and mL is Amplitude threshold value, andFor the raw tone sampled signal in m-th speech frame.Processing unit 104 Can foundationWhether preset zero-crossing rate to judge whether raw tone sampled signal is consonant letter more than or equal to one Number, such as can judge according to following formula:

Z_{m}^{0} &GreaterEqual; 22 - - - (13)

Wherein presetting zero-crossing rate not to be limited with 22, its value also can be entered according to practical situation in some embodiments Row sum-equal matrix.Additionally, processing unit 104 additionally can comprise energy condition according to raw tone sampled signal Zero-crossing rateJudge whether raw tone sampled signal is consonant signal, zero-crossing rateCan It is shown below:

Z_{m}^{+} = Σ_{j = 1}^{N - 1} 0.5 {sgn [x_{m}^{+} (mL + j)] - sgn [x_{m}^{+} (mL + j - 1)]} - - - (14)

Z_{m}^{-} = Σ_{j = 1}^{N - 1} 0.5 {sgn [x_{m}^{-} (mL + j)] - sgn [x_{m}^{-} (mL + j - 1)]} - - - (15)

WhereinCan represent with following formula:

x_{m}^{+} (j) = {\hat{x}}_{m} (j + mL) - α_{x} F_{m} - - - (16)

x_{m}^{-} (j) = {\hat{x}}_{m} (j + mL) + α_{x} F_{m} - - - (17)

In the present embodiment, α_xValue be 0.5, but be not limited, its value also may be used in some embodiments It is adjusted according to practical situation.Thus by the benchmark of Adjustable calculation zero-crossing rate, can judge former more accurately Whether beginning phonetic sampling signal is consonant signal.Processing unit 104 also can average according to multiple speech frames Zero-crossing rate judges whether raw tone sampled signal is consonant signal, for example, to m-th voice For frame, can be according to its zero-crossing rate with nearest two speech frames (namely m-1, m-2 speech frame) Meansigma methods judges whether raw tone sampled signal is consonant signal, and it judges that formula can be as follows:

\frac{Z_{m}^{0} + Z_{m - 1}^{0} + Z_{m - 2}^{0}}{3} &GreaterEqual; 34 - - - (18)

\frac{Z_{m}^{+} + Z_{m - 1}^{+} + Z_{m - 2}^{+}}{3} &GreaterEqual; 30 - - - (19)

\frac{Z_{m}^{-} + Z_{m - 1}^{-} + Z_{m - 2}^{-}}{3} &GreaterEqual; 30 - - - (20)

Described in example performed as described above, processing unit 104 can judge former according to energy or zero-crossing rate at least one Whether beginning phonetic sampling signal is consonant signal, namely processing unit 104 can the condition of summary formula At least one judges whether the raw tone sampled signal of corresponding target voice frame is consonant signal.Citing For, processing unit 104 can determine whether that formula (5), (7), (9), (11), (13), (18), (19), (20) are No meet simultaneously, just judge that the raw tone sampled signal of correspondence target voice frame is consonant if meet simultaneously Signal.The most such as, processing unit 104 also can determine whether formula (5), (8), (10), (11), (13), (18), (19), (20) meet, if meet the raw tone sampling just judging corresponding target voice frame simultaneously the most simultaneously Signal is consonant signal.

Fig. 2 A～2B illustrates the schematic flow sheet of the speech identifying method of one embodiment of the invention, refer to figure 2A～2B.From above-described embodiment, the speech identifying method of voice identification apparatus can comprise the following steps. First, voice signal is carried out the first consonant frequency range and the bandpass filtering of the second consonant frequency range, with respectively Produce the first bandpass filtered signal and the second bandpass filtered signal (step S202).Then, by voice signal, First bandpass filtered signal and the second bandpass filtered signal are divided into multiple speech frame (step S204), respectively Speech frame includes N number of sampled signal, and N is positive integer.Then, sampled signal in target voice frame is calculated Energy, to obtain a raw tone sampled signal energy, one first consonant frequency band signals energy and Second consonant frequency band signals energy (step S206).Afterwards, according to the first consonant frequency band signals energy and the The ratio of two consonant frequency band signals energy, the first consonant frequency band signals energy and raw tone sampled signal energy The ratio of amount and the ratio in judgement pair of the second consonant frequency band signals energy and raw tone sampled signal energy Whether the raw tone sampled signal answering target voice frame is noise (step S208).Such as, can determine whether One consonant frequency band signals energy and the ratio of the second consonant frequency band signals energy, the first consonant frequency band signals energy Measure ratio and the second consonant frequency band signals energy with raw tone sampled signal energy to take with raw tone The ratio of sample signal energy falls within the default ratio range of correspondence the most respectively, if the first consonant frequency band signals Energy and the ratio of the second consonant frequency band signals energy, the first consonant frequency band signals energy take with raw tone The ratio of sample signal energy and the second consonant frequency band signals energy and the ratio of raw tone sampled signal energy Value falls within the default ratio range of correspondence respectively, then the raw tone sampled signal of target voice frame is noise Signal.

Afterwards, the speech frame of multiple raw tone sampled signal being judged as noise signal before is calculated Energy weighted mean, to obtain noise signal energy weighted mean (step S210).Then mesh is judged Whether mark raw tone sampled signal energy corresponding to speech frame is more than noise signal energy weighted mean (step S212), wherein corresponding each is judged as the speech frame of raw tone sampled signal of noise signal Weighted value can be with corresponding each speech frame of raw tone sampled signal being judged as noise signal and mesh Interval length between mark speech frame is different and changes.If the raw tone sampling corresponding to target voice frame Signal energy is more than noise signal energy weighted mean, then judge corresponding to target voice frame is original Phonetic sampling signal non-consonant signal (step S214).If on the contrary, original corresponding to target voice frame Phonetic sampling signal energy be more than noise signal energy weighted mean, then calculate multiple before be judged as The first consonant frequency band signals energy corresponding to raw tone sampled signal of noise signal takes with raw tone The weighted mean of the ratio of sample signal energy, to obtain the first consonant energy proportion weighted mean (step S216).Judge that the first consonant frequency band signals energy corresponding to target voice frame takes with raw tone the most again Whether the ratio of sample signal energy is less than the first consonant energy proportion weighted mean (step S218), wherein Corresponding each is judged as the first consonant frequency band signals corresponding to raw tone sampled signal of noise signal Energy is judged as noise signal with the weighted value of the ratio of raw tone sampled signal energy with corresponding each Raw tone sampled signal speech frame and target voice frame between interval length different and change.

If the first consonant frequency band signals energy corresponding to target voice frame and raw tone sampled signal energy Ratio less than the first consonant energy proportion weighted mean, the then original language corresponding to target voice frame Sound sampled signal non-consonant signal (step S214).If on the contrary, first auxiliary corresponding to target voice frame Audio band signals energy weights less than the first consonant energy proportion with the ratio of raw tone sampled signal energy Meansigma methods, the most then judges the ratio of the second consonant frequency band signals energy and raw tone sampled signal energy Whether more than or equal to presetting ratio (step S220).If the second consonant frequency band signals energy takes with raw tone The ratio of sample signal energy is not more than or equal to presetting ratio, then the raw tone corresponding to target voice frame takes Sample signal non-consonant signal (step S214).If on the contrary, the second consonant frequency band signals energy and original language The ratio of sound sampled signal energy more than or equal to presetting ratio, then judges that raw tone sampled signal energy is No more than or equal to lower limit (step S222).If raw tone sampled signal energy is not more than or equal to lower limit, The then raw tone sampled signal non-consonant signal (step S214) corresponding to target voice frame.

If on the contrary, raw tone sampled signal energy is more than or equal to lower limit, the most then calculating this original First zero-crossing rate of phonetic sampling signal, the second zero-crossing rate and the 3rd zero-crossing rate, and calculate target voice The average zero-crossing rate of the raw tone sampled signal of the multiple speech frames before frame and target voice frame, with To one first average zero-crossing rate, one second average zero-crossing rate and one the 3rd average zero-crossing rate (step S224). Wherein the first zero-crossing rate, the second zero-crossing rate and the 3rd zero-crossing rate are respectively original language in target voice frame Sound sampled signal by the first preset value, the second preset value and the number of times of the 3rd preset value, wherein second Preset value is less than the first preset value and more than the 3rd preset value.Judge the most again the first average zero-crossing rate, Whether two average zero-crossing rates and the 3rd average zero-crossing rate are respectively greater than the default average zero passage equal to its correspondence Rate (step S226).If the first average zero-crossing rate, the second average zero-crossing rate and the 3rd average zero-crossing rate are not All more than or equal to the default average zero-crossing rate of its correspondence, then the raw tone sampling corresponding to target voice frame Signal non-consonant signal (step S214).If on the contrary, the first average zero-crossing rate, the second average zero-crossing rate And the 3rd average zero-crossing rate more than or equal to the default average zero-crossing rate of its correspondence, the most then judge the second mistake Whether zero rate is more than or equal to presetting zero-crossing rate (step S228).If the second zero-crossing rate is more than or equal to presetting Zero rate, then the raw tone sampled signal non-consonant signal (step S214) corresponding to target voice frame.Phase Instead, if the second zero-crossing rate is more than or equal to presetting zero-crossing rate, the then raw tone corresponding to target voice frame Sampled signal is consonant signal (step S230).

Fig. 3 A～3B illustrates the schematic flow sheet of the speech identifying method of one embodiment of the invention, refer to figure 3A～3B.The present embodiment is with the difference of Fig. 2 A～2B embodiment, and the present embodiment is in step S212 Judge that the raw tone sampled signal energy corresponding to target voice frame is flat more than noise signal energy weighting After average, then judge the ratio of the second consonant frequency band signals energy and raw tone sampled signal energy with And first the ratio sum of consonant frequency band signals energy and raw tone sampled signal energy whether be more than or equal to Preset and value (step S302), if the second consonant frequency band signals energy and raw tone sampled signal energy The ratio sum of ratio and the first consonant frequency band signals energy and raw tone sampled signal energy is not more than In presetting and value, then the raw tone sampled signal non-consonant signal (step corresponding to target voice frame S214).If on the contrary, the ratio of the second consonant frequency band signals energy and raw tone sampled signal energy with The ratio sum of the first consonant frequency band signals energy and raw tone sampled signal energy more than or equal to presetting and Value, then be directly entered step S220, it is judged that the second consonant frequency band signals energy and raw tone sampled signal Whether the ratio of energy is more than or equal to preset ratio, and as Fig. 2 A～2B embodiment continue executing with voice below The step of discrimination method.

In sum, embodiments of the invention the condition at least one of summary formula can judge correspondence Whether the raw tone sampled signal of target voice frame is consonant signal, accurate to improve the identification of consonant signal Exactness.Such as can according to the first consonant frequency band signals energy and the ratio of the second consonant frequency band signals energy, First consonant frequency band signals energy and the ratio of raw tone sampled signal energy and the second consonant frequency range are believed Number energy samples with the raw tone of the ratio in judgement corresponding target voice frame of raw tone sampled signal energy Whether signal is noise, to lower the situation generation that raw tone sampled signal is mistaken for consonant signal, And then improve the identification precision of consonant signal.

Last it is noted that various embodiments above is only in order to illustrate technical scheme, rather than right It limits；Although the present invention being described in detail with reference to foregoing embodiments, this area common Skilled artisans appreciate that the technical scheme described in foregoing embodiments still can be modified by it, Or the most some or all of technical characteristic is carried out equivalent；And these amendments or replacement, and The essence not making appropriate technical solution departs from the scope of various embodiments of the present invention technical scheme.

Claims

1. a voice identification apparatus, it is characterised in that including:

One bandpass filtering unit, carries out one first consonant frequency range and one second consonant frequency to a voice signal The bandpass filtering of section, to produce one first bandpass filtered signal and one second bandpass filtered signal respectively； And

One processing unit, couples this bandpass filtering unit, by this voice signal, this first bandpass filtering letter Number and this second bandpass filtered signal be divided into multiple speech frame, respectively this speech frame includes N number of taking Sample signal, N is positive integer, calculates the energy of sampled signal in target voice frame, to obtain an original language Sound sampled signal energy, one first consonant frequency band signals energy and one second consonant frequency band signals energy, According to this first consonant frequency band signals energy and the ratio of this second consonant frequency band signals energy, this is first auxiliary The ratio of audio band signals energy and this raw tone sampled signal energy and this second consonant frequency band signals The ratio in judgement of energy and this raw tone sampled signal energy is to should the raw tone of target voice frame take Whether sample signal is noise.

Voice identification apparatus the most according to claim 1, it is characterised in that this processing unit is also Judge this first consonant frequency band signals energy and the ratio of this second consonant frequency band signals energy, this is first auxiliary The ratio of audio band signals energy and this raw tone sampled signal energy and this second consonant frequency band signals Energy falls within corresponding default ratio range the most respectively with the ratio of this raw tone sampled signal energy, If this first consonant frequency band signals energy and the ratio of this second consonant frequency band signals energy, this first consonant The ratio of frequency band signals energy and this raw tone sampled signal energy and this second consonant frequency band signals energy Amount falls within corresponding default ratio range, then this mesh respectively with the ratio of this raw tone sampled signal energy The raw tone sampled signal of mark speech frame is noise signal.

Voice identification apparatus the most according to claim 1, it is characterised in that this processing unit is also The energy weighting of the speech frame calculating multiple raw tone sampled signal being judged as noise signal before is flat Average, to obtain a noise signal energy weighted mean, and former according to corresponding to this target voice frame Whether beginning phonetic sampling signal energy judges this target voice more than this noise signal energy weighted mean Whether the raw tone sampled signal corresponding to frame is consonant signal.

Voice identification apparatus the most according to claim 3, it is characterised in that respectively this is judged to correspondence With correspondence, respectively this is judged as making an uproar the weighted value of speech frame of the raw tone sampled signal for noise signal of breaking Interval length between speech frame and this target voice frame of the raw tone sampled signal of acoustical signal different and Change.

Voice identification apparatus the most according to claim 3, it is characterised in that this processing unit is also First auxiliary with this according to this second consonant frequency band signals energy and the ratio of this raw tone sampled signal energy Audio band signals energy is the most default more than or equal to one with the ratio sum of this raw tone sampled signal energy Judge whether the raw tone sampled signal corresponding to this target voice frame is consonant signal with value.

Voice identification apparatus the most according to claim 5, it is characterised in that this processing unit is also Calculate multiple before be judged as noise signal raw tone sampled signal speech frame corresponding to this The weighted mean of the ratio of one consonant frequency band signals energy and this raw tone sampled signal energy, with To one first consonant energy proportion weighted mean and first auxiliary according to this corresponding to this target voice frame Whether audio band signals energy is less than this first consonant energy with the ratio of this raw tone sampled signal energy Proportion weighted meansigma methods judges whether this raw tone sampled signal corresponding to target voice frame is consonant Signal.

Voice identification apparatus the most according to claim 6, it is characterised in that respectively this is judged to correspondence Break this first consonant frequency band signals energy corresponding to the speech frame of the raw tone sampled signal of noise signal Measure with the weighted value of the ratio of this raw tone sampled signal energy that respectively this is judged as noise signal with corresponding Raw tone sampled signal speech frame and this target voice frame between interval length different and change.

Voice identification apparatus the most according to claim 6, it is characterised in that this processing unit is also Whether it is more than according to the ratio of this second consonant frequency band signals energy Yu this raw tone sampled signal energy Whether preset ratio to judge the raw tone sampled signal corresponding to this target voice frame in one is consonant letter Number.

Voice identification apparatus the most according to claim 8, it is characterised in that this processing unit is also Whether this target voice frame institute is judged more than or equal to a lower limit according to this raw tone sampled signal energy Whether corresponding raw tone sampled signal is consonant signal.

Voice identification apparatus the most according to claim 9, it is characterised in that this processing unit is also Calculate the first zero-crossing rate of this raw tone sampled signal, the second zero-crossing rate and the 3rd zero-crossing rate, and count Calculate raw tone sampled signal flat of multiple speech frames before this target voice frame and this target voice frame All zero-crossing rates, to obtain one first average zero-crossing rate, one second average zero-crossing rate and one the 3rd average mistake Zero rate, and according to this first average zero-crossing rate, this second average zero-crossing rate and the 3rd average zero-crossing rate Whether it is respectively greater than former equal to what the default average zero-crossing rate of its correspondence judged corresponding to this target voice frame Whether beginning phonetic sampling signal is consonant signal, this first zero-crossing rate, this second zero-crossing rate and the 3rd Zero-crossing rate be respectively in this target voice frame this raw tone sampled signal by one first preset value, one Second preset value and the number of times of one the 3rd preset value, this second preset value is less than this first preset value and big In the 3rd preset value.

11. voice identification apparatus according to claim 10, it is characterised in that this processing unit Also whether preset zero-crossing rate to judge corresponding to this target voice frame more than or equal to one according to this second zero-crossing rate Raw tone sampled signal whether be consonant signal.

12. 1 kinds of speech identifying methods, it is characterised in that including:

One voice signal is carried out one first consonant frequency range and the bandpass filtering of one second consonant frequency range, with Produce one first bandpass filtered signal and one second bandpass filtered signal respectively；

This voice signal, this first bandpass filtered signal are divided into multiple language with this second bandpass filtered signal Sound frame, respectively this speech frame includes N number of sampled signal, and N is positive integer；

Calculate the energy of sampled signal in target voice frame, with obtain a raw tone sampled signal energy, One first consonant frequency band signals energy and one second consonant frequency band signals energy；And

According to this first consonant frequency band signals energy and this second consonant frequency band signals energy ratio, this The ratio of one consonant frequency band signals energy and this raw tone sampled signal energy and this second consonant frequency range The ratio in judgement of signal energy and this raw tone sampled signal energy is to should the original language of target voice frame Whether sound sampled signal is noise.

13. speech identifying methods according to claim 12, it is characterised in that also include:

Judge this first consonant frequency band signals energy and this second consonant frequency band signals energy ratio, this The ratio of one consonant frequency band signals energy and this raw tone sampled signal energy and this second consonant frequency range Signal energy falls within corresponding default ratio model the most respectively with the ratio of this raw tone sampled signal energy Enclose；And

If the ratio of this first consonant frequency band signals energy and this second consonant frequency band signals energy, this first Consonant frequency band signals energy and the ratio of this raw tone sampled signal energy and this second consonant frequency range are believed Number energy falls within corresponding default ratio range respectively with the ratio of this raw tone sampled signal energy, then The raw tone sampled signal of this target voice frame is noise signal.

14. speech identifying methods according to claim 12, it is characterised in that also include:

The energy of the speech frame calculating multiple raw tone sampled signal being judged as noise signal before adds Weight average value, to obtain a noise signal energy weighted mean；And

According to the raw tone sampled signal energy corresponding to this target voice frame whether more than this noise signal Energy weighted mean judges whether this raw tone sampled signal corresponding to target voice frame is consonant Signal.

15. speech identifying methods according to claim 14, it is characterised in that correspondence respectively this quilt It is judged as that respectively this is judged as with correspondence for the weighted value of speech frame of the raw tone sampled signal of noise signal Interval length between speech frame and this target voice frame of the raw tone sampled signal of noise signal is different And change.

16. speech identifying methods according to claim 14, it is characterised in that also include:

According to this second consonant frequency band signals energy and this raw tone sampled signal energy ratio with this Whether one consonant frequency band signals energy is more than or equal to one with the ratio sum of this raw tone sampled signal energy Preset and value judges whether this raw tone sampled signal corresponding to target voice frame is consonant signal.

17. speech identifying methods according to claim 16, it is characterised in that also include:

Calculate corresponding to the speech frame of multiple raw tone sampled signal being judged as noise signal before The weighted mean of the ratio of this first consonant frequency band signals energy and this raw tone sampled signal energy, To obtain one first consonant energy proportion weighted mean；And

Sample with this raw tone according to this first consonant frequency band signals energy corresponding to this target voice frame Whether the ratio of signal energy judges this target voice less than this first consonant energy proportion weighted mean Whether the raw tone sampled signal corresponding to frame is consonant signal.

18. speech identifying methods according to claim 17, it is characterised in that correspondence respectively this quilt It is judged as this first consonant frequency band signals energy corresponding to the raw tone sampled signal of noise signal and is somebody's turn to do With correspondence, respectively this is judged as the original of noise signal to the weighted value of the ratio of raw tone sampled signal energy Interval length between speech frame and this target voice frame of phonetic sampling signal is different and changes.

19. speech identifying methods according to claim 17, it is characterised in that also include:

The biggest with the ratio of this raw tone sampled signal energy according to this second consonant frequency band signals energy Whether it is auxiliary in presetting ratio equal to one to judge the raw tone sampled signal corresponding to this target voice frame Tone signal.

20. speech identifying methods according to claim 19, it is characterised in that also include:

Whether this target voice is judged more than or equal to a lower limit according to this raw tone sampled signal energy Whether the raw tone sampled signal corresponding to frame is consonant signal.

21. speech identifying methods according to claim 20, it is characterised in that also include:

Calculate the first zero-crossing rate of this raw tone sampled signal, the second zero-crossing rate and the 3rd zero-crossing rate, And calculate the raw tone sampled signal of the most N number of speech frame before this target voice frame and this target voice frame Average zero-crossing rate, flat to obtain one first average zero-crossing rate, one second average zero-crossing rate and one the 3rd All zero-crossing rates, wherein N is positive integer, this first zero-crossing rate, this second zero-crossing rate and the 3rd zero passage Rate be respectively in this target voice frame this raw tone sampled signal by one first preset value, one second Preset value and the number of times of one the 3rd preset value, this second preset value is less than this first preset value and more than being somebody's turn to do 3rd preset value；And

According to this first average zero-crossing rate, this second average zero-crossing rate and the 3rd average zero-crossing rate whether It is respectively greater than and judges the original language corresponding to this target voice frame equal to the default average zero-crossing rate of its correspondence Whether sound sampled signal is consonant signal.

22. speech identifying methods according to claim 21, it is characterised in that also include:

Zero-crossing rate whether is preset more than or equal to one right to judge this target voice frame institute according to this second zero-crossing rate Whether the raw tone sampled signal answered is consonant signal.