CN105989834B

CN105989834B - Voice recognition device and voice recognition method

Info

Publication number: CN105989834B
Application number: CN201510059977.5A
Authority: CN
Inventors: 杜博仁; 张嘉仁; 曾凯盟
Original assignee: Acer Inc
Current assignee: Acer Inc
Priority date: 2015-02-05
Filing date: 2015-02-05
Publication date: 2019-12-24
Anticipated expiration: 2035-02-05
Also published as: CN105989834A

Abstract

The invention provides a voice recognition device and a voice recognition method. And judging whether the original voice sampling signal of the corresponding target voice frame is noise or not according to the ratio of the energy of the first consonant frequency band signal to the energy of the second consonant frequency band signal, the ratio of the energy of the first consonant frequency band signal to the energy of the original voice sampling signal and the ratio of the energy of the second consonant frequency band signal to the energy of the original voice sampling signal. The invention can effectively identify whether the voice signal is a consonant signal.

Description

Voice recognition device and voice recognition method

Technical Field

The present invention relates to an identification device, and more particularly, to a speech identification device and a speech identification method.

Background

Generally, for hearing impaired people, it is often impossible to clearly receive a higher frequency voice signal, such as a consonant signal, but can be clearly heard for a low frequency voice signal. The existing auxiliary sound signal judging method is to process signals in a frequency domain, and the judging method mainly comprises two judging methods, namely non-instant auxiliary sound signal judging and instant auxiliary sound judging. The non-instant consonant signal judgment is mainly judged through energy and zero crossing rate. The real-time consonant signal judgment is mainly to determine whether the speech signal is a consonant signal according to whether the ratio of the high-frequency signal to the total energy is greater than a fixed value or not and whether the ratio of the low-frequency signal to the total energy is less than the fixed value or not. Although the existing consonant signal judging mode can distinguish the consonant signal from the noise, the accuracy of the existing consonant signal judging mode cannot meet the actual requirement.

Disclosure of Invention

The invention provides a voice recognition device and a voice recognition method, which can effectively recognize whether a voice signal is a consonant signal.

The voice recognition device comprises a band-pass filtering unit and a processing unit. The band-pass filtering unit performs band-pass filtering of a first consonant frequency band and a second consonant frequency band on the voice signal to respectively generate a first band-pass filtering signal and a second band-pass filtering signal. The processing unit is coupled with the band-pass filtering unit and divides the voice signal, the first band-pass filtering signal and the second band-pass filtering signal into a plurality of voice frames, wherein each voice frame comprises N sampling signals, N is a positive integer, the processing unit also calculates the energy of the sampling signals in the target voice frame to obtain the energy of original voice sampling signals, the energy of first consonant frequency band signals and the energy of second consonant frequency band signals, and judges whether the original voice sampling signals corresponding to the target voice frame are noise or not according to the ratio of the energy of the first consonant frequency band signals to the energy of the second consonant frequency band signals, the ratio of the energy of the first consonant frequency band signals to the energy of the original voice sampling signals and the ratio of the energy of the second consonant frequency band signals to the energy of the original voice sampling signals.

In an embodiment of the invention, the processing unit determines whether a ratio of a first consonant frequency band signal energy to a second consonant frequency band signal energy, a ratio of the first consonant frequency band signal energy to an original voice sampling signal energy, and a ratio of the second consonant frequency band signal energy to the original voice sampling signal energy respectively fall within corresponding preset ratio ranges, and if the ratio of the first consonant frequency band signal energy to the second consonant frequency band signal energy, the ratio of the first consonant frequency band signal energy to the original voice sampling signal energy, and the ratio of the second consonant frequency band signal energy to the original voice sampling signal energy respectively fall within corresponding preset ratio ranges, the original voice sampling signal of the target voice frame is a noise signal.

In an embodiment of the invention, the processing unit further calculates an energy weighted average of original voice sampling signals of a plurality of voice frames of the original voice sampling signals previously determined as noise signals to obtain a noise energy weighted average, and determines whether the original voice sampling signal corresponding to the target voice frame is a consonant signal according to whether the energy of the original voice sampling signal corresponding to the target voice frame is greater than the noise energy weighted average.

In an embodiment of the present invention, the weighted value of the speech frame corresponding to each original speech sample signal determined as a noise signal is changed according to the length of the interval between the speech frame corresponding to each original speech sample signal determined as a noise signal and the target speech frame.

In an embodiment of the invention, the processing unit further determines whether the original speech sample signal corresponding to the target speech frame is a consonant signal according to whether a sum of a ratio of the energy of the second consonant frequency band signal to the energy of the original speech sample signal and a ratio of the energy of the first consonant frequency band signal to the energy of the original speech sample signal is greater than or equal to a preset sum.

In an embodiment of the invention, the processing unit further calculates a weighted average of ratios of energy of a first consonant band signal corresponding to a plurality of speech frames of the original speech sample signal previously determined as the noise signal to energy of the original speech sample signal, so as to obtain a first consonant energy ratio weighted average, and determines whether the original speech sample signal corresponding to the target speech frame is the consonant signal according to whether the ratio of the energy of the first consonant band signal corresponding to the target speech frame to the energy of the original speech sample signal is smaller than the first consonant energy ratio weighted average.

In an embodiment of the present invention, a weighted value of a ratio of energy of the first consonant band signal to energy of the original voice sample signal corresponding to each voice frame of the original voice sample signal determined as the noise signal changes with a difference in length of an interval between the voice frame of the original voice sample signal determined as the noise signal and the target voice frame.

In an embodiment of the invention, the processing unit further determines whether the original speech sample signal corresponding to the target speech frame is a consonant signal according to whether a ratio of the energy of the second consonant frequency band signal to the energy of the original speech sample signal is greater than or equal to a predetermined ratio.

In an embodiment of the invention, the processing unit further determines whether the original voice sampling signal corresponding to the target voice frame is a consonant signal according to whether the energy of the original voice sampling signal is greater than or equal to a lower limit value.

In an embodiment of the invention, the processing unit further calculates a first zero-crossing rate, a second zero-crossing rate and a third zero-crossing rate of the original voice sampling signal, and calculates the average zero crossing rate of the target speech frame and the original speech sample signals of a plurality of speech frames before the target speech frame, to obtain a first average zero-crossing rate, a second average zero-crossing rate and a third average zero-crossing rate, and determining whether the original voice sampling signal corresponding to the target voice frame is a consonant signal according to whether the first average zero-crossing rate, the second average zero-crossing rate and the third average zero-crossing rate are respectively greater than or equal to the corresponding preset average zero-crossing rate, the first zero-crossing rate, the second zero-crossing rate and the third zero-crossing rate are respectively times of the original voice sampling signal passing through a first preset value, a second preset value and a third preset value in the target voice frame, and the second preset value is smaller than the first preset value and larger than the third preset value.

In an embodiment of the invention, the processing unit further determines whether the original voice sampling signal corresponding to the target voice frame is a consonant signal according to whether the second zero-crossing rate is greater than or equal to a predetermined zero-crossing rate.

The speech recognition method of the invention comprises the following steps. The method comprises the steps of performing band-pass filtering on a first consonant frequency band and a second consonant frequency band on a voice signal to respectively generate a first band-pass filtering signal and a second band-pass filtering signal. Dividing the voice signal, the first band-pass filtering signal and the second band-pass filtering signal into a plurality of voice frames, wherein each voice frame comprises N sampling signals, and N is a positive integer. And calculating the energy of the sampling signal in the target voice frame to obtain the energy of the original voice sampling signal, the energy of the first consonant frequency band signal and the energy of the second consonant frequency band signal. And judging whether the original voice sampling signal of the corresponding target voice frame is noise or not according to the ratio of the energy of the first consonant frequency band signal to the energy of the second consonant frequency band signal, the ratio of the energy of the first consonant frequency band signal to the energy of the original voice sampling signal and the ratio of the energy of the second consonant frequency band signal to the energy of the original voice sampling signal.

In an embodiment of the invention, the voice recognition method further includes the following steps. And judging whether the ratio of the energy of the first consonant frequency band signal to the energy of the second consonant frequency band signal, the ratio of the energy of the first consonant frequency band signal to the energy of the original voice sampling signal and the ratio of the energy of the second consonant frequency band signal to the energy of the original voice sampling signal respectively fall within the corresponding preset ratio range. And if the ratio of the first consonant frequency band signal energy to the second consonant frequency band signal energy, the ratio of the first consonant frequency band signal energy to the original voice sampling signal energy and the ratio of the second consonant frequency band signal energy to the original voice sampling signal energy respectively fall within the corresponding preset ratio range, the original voice sampling signal of the target voice frame is a noise signal.

In an embodiment of the invention, the voice recognition method further includes the following steps. An original speech sample signal energy weighted average of a plurality of speech frames of an original speech sample signal previously determined as a noise signal is calculated to obtain a noise signal energy weighted average. And judging whether the original voice sampling signal corresponding to the target voice frame is a consonant signal or not according to whether the original voice sampling signal energy corresponding to the target voice frame is greater than the noise signal energy weighted average value or not.

In an embodiment of the invention, the voice recognition method further includes determining whether the original voice sampling signal corresponding to the target voice frame is a consonant signal according to whether a sum of a ratio of the energy of the second consonant frequency band signal to the energy of the original voice sampling signal and a ratio of the energy of the first consonant frequency band signal to the energy of the original voice sampling signal is greater than or equal to a preset sum.

In an embodiment of the invention, the voice recognition method further includes the following steps. And calculating a weighted average value of the ratio of the energy of the first consonant frequency band signal corresponding to a plurality of voice frames of the original voice sampling signal which is judged as the noise signal before to the energy of the original voice sampling signal so as to obtain a first consonant energy proportion weighted average value. And judging whether the original voice sampling signal corresponding to the target voice frame is a consonant signal or not according to whether the ratio of the first consonant frequency band signal energy corresponding to the target voice frame to the original voice sampling signal energy is smaller than the first consonant energy proportion weighted average value or not.

In an embodiment of the present invention, the weighted value of the ratio of the energy of the first consonant band signal corresponding to each original voice sample signal determined as the noise signal to the energy of the original voice sample signal changes with the difference of the length of the interval between the voice frame corresponding to each original voice sample signal determined as the noise signal and the target voice frame.

In an embodiment of the invention, the voice recognition method further includes determining whether the original voice sampling signal corresponding to the target voice frame is a consonant signal according to whether a ratio of the energy of the second consonant frequency band signal to the energy of the original voice sampling signal is greater than or equal to a predetermined ratio.

In an embodiment of the invention, the voice recognition method further includes determining whether the original voice sampling signal corresponding to the target voice frame is a consonant signal according to whether the energy of the original voice sampling signal is greater than or equal to a lower limit value.

In an embodiment of the invention, the voice recognition method further includes the following steps. Calculating a first zero-crossing rate, a second zero-crossing rate and a third zero-crossing rate of an original voice sampling signal, and calculating an average zero-crossing rate of the original voice sampling signal of a target voice frame and a plurality of N voice frames before the target voice frame to obtain a first average zero-crossing rate, a second average zero-crossing rate and a third average zero-crossing rate, wherein N is a positive integer, the first zero-crossing rate, the second zero-crossing rate and the third zero-crossing rate are times of the original voice sampling signal passing a first preset value, a second preset value and a third preset value in the target voice frame respectively, and the second preset value is smaller than the first preset value and larger than the third preset value. And judging whether the original voice sampling signal corresponding to the target voice frame is a consonant signal or not according to whether the first average zero-crossing rate, the second average zero-crossing rate and the third average zero-crossing rate are respectively greater than or equal to the corresponding preset average zero-crossing rate.

In an embodiment of the invention, the voice recognition method further includes determining whether the original voice sampling signal corresponding to the target voice frame is a consonant signal according to whether the second zero-crossing rate is greater than or equal to a preset zero-crossing rate.

Based on the above, the embodiment of the present invention determines whether the original speech sample signal corresponding to the target speech frame is noise according to the ratio of the energy of the first consonant frequency band signal to the energy of the second consonant frequency band signal, the ratio of the energy of the first consonant frequency band signal to the energy of the original speech sample signal, and the ratio of the energy of the second consonant frequency band signal to the energy of the original speech sample signal, so as to reduce the occurrence of misjudgment of the original speech sample signal as the consonant signal, and further improve the recognition accuracy of the consonant signal.

In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.

Drawings

FIG. 1 is a diagram illustrating a voice recognition apparatus according to an embodiment of the present invention;

FIGS. 2A-2B are schematic diagrams illustrating a speech recognition method according to an embodiment of the invention;

fig. 3A-3B are schematic flow charts illustrating a speech recognition method according to another embodiment of the invention.

Description of reference numerals:

102: a band-pass filtering unit;

104: a processing unit;

s1: a speech signal;

s2: a first band pass filtered signal;

s3: a second band-pass filtered signal;

s202 to S230, S302: the method comprises the following steps.

Detailed Description

Fig. 1 is a schematic diagram of a voice recognition device according to an embodiment of the invention, and fig. 1 is a schematic diagram. The speech recognition apparatus includes a band-pass filtering unit 102 and a processing unit 104, the band-pass filtering unit 102 is coupled to the processing unit 104, the band-pass filtering unit 102 can be implemented by a band-pass filter, and the processing unit 104 can be implemented by a central processing unit, but not limited thereto. The band pass filter unit 102 performs band pass filtering on the voice signal S1 to generate a first band pass filtered signal S2 and a second band pass filtered signal S3, wherein the first and second bands are 2kHz to 4kHz and 4kHz to 10kHz, respectively, but not limited thereto.

The processing unit 104 may sample the voice signal S1, the first band-pass filtered signal S2, and the second band-pass filtered signal S3, and divide the voice signal S1, the first band-pass filtered signal S2, and the second band-pass filtered signal S3 into a plurality of voice frames, wherein each voice frame may include N samples of the voice signal S1, N samples of the first band-pass filtered signal S2, and N samples of the second band-pass filtered signal S3. The processing unit 104 may further calculate the energy of the sampled signal in each speech frame to obtain the energy of the original speech sampled signal, the energy of the first consonant frequency band signal, and the energy of the second consonant frequency band signal, wherein the energy of the original speech sampled signal, the energy of the first consonant frequency band signal, and the energy of the second consonant frequency band signal respectively correspond to the energy of the sampled signal of the speech signal S1, the sampled signal of the first band-pass filtered signal S2, and the sampled signal of the second band-pass filtered signal S3 in the speech frame. After obtaining the original speech sampling signal energy, the first consonant frequency band signal energy and the second consonant frequency band signal energy, the processing unit 104 can determine whether the original speech sampling signal corresponding to each speech frame is noise according to the ratio of the first consonant frequency band signal energy to the second consonant frequency band signal energy, the ratio of the first consonant frequency band signal energy to the original speech sampling signal energy and the ratio of the second consonant frequency band signal energy to the original speech sampling signal energy.

In detail, the processing unit 104 may determine whether a ratio of a first consonant band signal energy to a second consonant band signal energy, a ratio of the first consonant band signal energy to an original voice sampling signal energy, and a ratio of the second consonant band signal energy to the original voice sampling signal energy respectively fall within corresponding preset ratio ranges, and if the ratio of the first consonant band signal energy to the second consonant band signal energy, the ratio of the first consonant band signal energy to the original voice sampling signal energy, and the ratio of the second consonant band signal energy to the original voice sampling signal energy respectively fall within corresponding preset ratio ranges, the original voice sampling signal of the target voice frame is a noise signal.

For example, the processing unit 104 determines whether the original speech sample signal corresponding to a target speech frame (e.g. the mth speech frame, m is a positive integer) is noise or not by the following equation:

wherein EB1_mFor the first consonant band signal energy, EB2_mIs the signal energy of the second consonant frequency band, and E_mWhen the equations (1), (2), and (3) are satisfied, the processing unit 104 determines that the original speech sampling signal of the mth speech frame is a noise signal.

After determining that the original speech sampling signal of the target speech frame is a noise signal, the processing unit 104 further calculates a weighted average of the energy of the original speech sampling signals of a plurality of speech frames of the original speech sampling signal determined as the noise signal before the target speech frame to obtain a weighted average of the energy of the noise signal, and determines whether the original speech sampling signal corresponding to the target speech frame determined as the noise signal is a consonant signal according to whether the energy of the original speech sampling signal corresponding to the target speech frame is greater than the weighted average of the energy of the noise signal.

For example, the weighted average of the noise signal energy can be obtained by calculating the weighted average of the original speech sample energy of 3 speech frames of the original speech sample signal determined as the noise signal before the target speech frame, assuming that the speech frame is the m-th speech frameBefore, if the three speech frames recently judged as noise are respectively the m-10 th speech frame, the m-12 th speech frame and the m-20 th speech frame, the weighted average AK of the noise signal energy corresponding to the m-th speech frame_mCan be represented by the following formula:

wherein E_m-10、E_m-12、E_m-20The original speech sample signal energy of the m-10 th speech frame, the m-12 th speech frame and the m-20 th speech frame respectively, and a0, a1 and a2 are weighted values corresponding to the m-10 th speech frame, the m-12 th speech frame and the m-20 th speech frame respectively. The weighting values a0, a1, a2 may be fixed values or variable values. For example, the weighting value of the speech frame corresponding to each original speech sample signal determined as a noise signal may vary according to the length of the interval between the speech frame corresponding to each original speech sample signal determined as a noise signal and the target speech frame. As in the present embodiment, the weighting values a0, a1, a2 may be changed according to the length of the interval between the speech frame and the mth speech frame. When noise signal energy weighted average AK_mWhen the following formula is satisfied, the original speech sampling signal corresponding to the mth speech frame can be judged to be a consonant signal:

E_m＞AK_m (5)

in addition, the processing unit can calculate a weighted average value of ratios of the energy of the first consonant frequency band signal corresponding to a plurality of voice frames of the original voice sampling signal which are previously judged as the noise signal to the energy of the original voice sampling signal so as to obtain a first consonant energy ratio weighted average value, and judge whether the original voice sampling signal corresponding to the target voice frame is the consonant signal according to whether the ratio of the energy of the first consonant frequency band signal corresponding to the target voice frame to the energy of the original voice sampling signal is smaller than the first consonant energy ratio weighted average value or not. For example, the first consonant energy proportional weighted average can be 3 speech frames of the original speech sample signal that is determined to be the noise signal before the target speech frameThe weighted average value of the ratio of the signal energy of the first consonant frequency band to the signal energy of the original voice sampling is obtained, and if the three voice frames which are recently judged as noise before the mth voice frame are respectively the m-10 th voice frame, the m-12 th voice frame and the m-20 th voice frame, the weighted average value AF of the ratio of the first consonant energy corresponding to the mth voice frame is obtained_mCan be represented by the following formula:

wherein EB1_m-10、EB1_m-12、EB1_m-20The signal energy of the first consonant frequency band E of the m-10 th speech frame, the m-12 th speech frame and the m-20 th speech frame respectively_m-10、E_m-12、E_m-20The original speech sample signal energy of the m-10 th speech frame, the m-12 th speech frame and the m-20 th speech frame respectively, and c0, c1 and c2 are weighted values corresponding to the m-10 th speech frame, the m-12 th speech frame and the m-20 th speech frame respectively. The weighting values c0, c1, c2 may be fixed values or variable values. For example, the weighted value of the ratio of the energy of the first consonant band signal to the energy of the original voice sample signal corresponding to each voice frame of the original voice sample signal determined as the noise signal may be changed according to the length of the interval between the voice frame of the original voice sample signal determined as the noise signal and the target voice frame. As in the present embodiment, the weighting values c0, c1, c2 may be changed according to the length of the interval between the speech frame and the mth speech frame. When the first consonant energy ratio weighted average AF_mWhen the following formula is satisfied, the original speech sampling signal corresponding to the mth speech frame can be judged to be a consonant signal:

in addition, the processing unit 104 can determine whether the original speech sample signal corresponding to the target speech frame is a consonant signal according to whether the sum of the ratio of the energy of the second consonant frequency band signal to the energy of the original speech sample signal and the ratio of the energy of the first consonant frequency band signal to the energy of the original speech sample signal is greater than or equal to a preset sum value. For example, for the mth speech frame, the above-mentioned determination method can be expressed by the following formula:

in the embodiment, the preset sum is 1, but not limited thereto, and the preset sum may be adjusted to other values according to actual situations.

In addition, the processing unit 104 can also determine whether the original speech sample signal corresponding to the target speech frame is a consonant signal according to whether the ratio of the energy of the second consonant frequency band signal to the energy of the original speech sample signal is greater than or equal to a predetermined ratio. For example, for the mth speech frame, the above-mentioned determination method can be expressed by the following formula:

in this embodiment, the predetermined ratio is 0.8, but not limited thereto, and in some embodiments, the predetermined ratio may be other values, as shown in the following formula:

in the formula (7), the predetermined ratio is 0.35.

In addition, the processing unit 104 can also determine whether the original voice sampling signal corresponding to the target voice frame is a consonant signal according to whether the energy of the original voice sampling signal is greater than or equal to the lower limit value. For example, for the mth speech frame, the above-mentioned determination method can be expressed by the following formula:

E_m≥50 (11)

in the embodiment, the lower limit value is 50, but not limited thereto, and the lower limit value may be adjusted according to actual situations in some embodiments.

In order to avoid the situation that the consonant signals may have different energy levels and the portion with smaller energy may be regarded as noise, in addition to the above-mentioned determining whether the original voice sample signal is a consonant signal according to the energy, the processing unit 104 may also determine whether the original voice sample signal is a consonant signal according to the zero crossing rate. The processing unit 104 may calculate a first zero-crossing rate, a second zero-crossing rate, and a third zero-crossing rate of the original voice sampling signal, calculate an average zero-crossing rate of the original voice sampling signal of the target voice frame and a plurality of voice frames before the target voice frame to obtain a first average zero-crossing rate, a second average zero-crossing rate, and a third average zero-crossing rate, and determine whether the original voice sampling signal corresponding to the target voice frame is a consonant signal according to whether the first average zero-crossing rate, the second average zero-crossing rate, and the third average zero-crossing rate are respectively greater than or equal to their corresponding preset average zero-crossing rates. The first zero-crossing rate, the second zero-crossing rate and the third zero-crossing rate are respectively times of the original voice sampling signal passing through a first preset value, a second preset value and a third preset value in the target voice frame, wherein the second preset value is smaller than the first preset value and larger than the third preset value.

For the mth speech frame, the original zero-crossing rateCan be represented by the following formula:

where N is a positive integer representing the number of samples in the mth speech frame, mL is an amplitude threshold, andis the original speech sample signal within the mth speech frame. The processing unit 104 may be based onWhether the original speech sampling signal is a consonant signal or not is determined by whether the original speech sampling signal is greater than or equal to a predetermined zero crossing rate, for exampleThe determination is made according to the following formula:

the predetermined zero-crossing rate is not limited to 22, and in some embodiments, the value thereof may be adjusted according to the actual situation. In addition, the processing unit 104 may additionally include a zero crossing rate of the energy condition according to the original speech sample signalTo determine whether the original voice sampling signal is a consonant signal, the zero crossing rateCan be represented by the following formula:

whereinCan be represented by the following formula:

in the present embodiment, α_xThe value of (2) is 0.5, but not limited thereto, and in some embodiments, the value may be adjusted according to actual situations. Therefore, whether the original voice sampling signal is a consonant signal or not can be judged more accurately by adjusting the reference of calculating the zero crossing rate. The processing unit 104 can also determine whether the original speech sample signal is a consonant signal according to the average zero-crossing rate of a plurality of speech frames, for exampleFor the mth speech frame, whether the original speech sampling signal is a consonant signal can be determined according to the average value of the zero crossing rates of the mth speech frame and the two latest speech frames (i.e., the m-1 and m-2 speech frames), and the determination formula can be as follows:

as described in the above embodiments, the processing unit 104 can determine whether the original speech sample signal is a consonant signal according to at least one of the energy and the zero-crossing rate, that is, the processing unit 104 can combine at least one of the conditions of the above equations to determine whether the original speech sample signal corresponding to the target speech frame is a consonant signal. For example, the processing unit 104 can determine whether the expressions (5), (7), (9), (11), (13), (18), (19) and (20) are satisfied at the same time, and determine that the original speech sample signal corresponding to the target speech frame is a consonant signal if satisfied at the same time. For another example, the processing unit 104 may also determine whether the expressions (5), (8), (10), (11), (13), (18), (19) and (20) are simultaneously satisfied, and determine that the original speech sample signal corresponding to the target speech frame is a consonant signal if the expressions are simultaneously satisfied.

Fig. 2A to 2B are schematic diagrams illustrating a flow of a voice recognition method according to an embodiment of the invention, and fig. 2A to 2B are shown. In the above embodiments, the voice recognition method of the voice recognition apparatus may include the following steps. First, band-pass filtering is performed on a speech signal in a first consonant frequency band and a second consonant frequency band to generate a first band-pass filtered signal and a second band-pass filtered signal, respectively (step S202). Next, the speech signal, the first band-pass filtered signal and the second band-pass filtered signal are divided into a plurality of speech frames (step S204), wherein each speech frame includes N sampling signals, and N is a positive integer. Then, the energy of the sampled signal in the target speech frame is calculated to obtain an original speech sampled signal energy, a first consonant frequency band signal energy and a second consonant frequency band signal energy (step S206). Then, whether the original speech sample signal corresponding to the target speech frame is noise is determined according to the ratio of the energy of the first consonant band signal to the energy of the second consonant band signal, the ratio of the energy of the first consonant band signal to the energy of the original speech sample signal, and the ratio of the energy of the second consonant band signal to the energy of the original speech sample signal (step S208). For example, it can be determined whether the ratio of the first consonant band signal energy to the second consonant band signal energy, the ratio of the first consonant band signal energy to the original voice sampling signal energy, and the ratio of the second consonant band signal energy to the original voice sampling signal energy respectively fall within the corresponding preset ratio ranges, and if the ratio of the first consonant band signal energy to the second consonant band signal energy, the ratio of the first consonant band signal energy to the original voice sampling signal energy, and the ratio of the second consonant band signal energy to the original voice sampling signal energy respectively fall within the corresponding preset ratio ranges, the original voice sampling signal of the target voice frame is a noise signal.

Then, the weighted average of the original speech sample signal energy of the speech frames of the original speech sample signals previously determined as noise signals is calculated to obtain the weighted average of the noise signal energy (step S210). Then, it is determined whether the energy of the original speech sample signal corresponding to the target speech frame is greater than the weighted average of the energy of the noise signal (step S212), wherein the weighted value of the speech frame corresponding to each of the original speech sample signals determined as the noise signal can be changed according to the difference between the length of the speech frame corresponding to each of the original speech sample signals determined as the noise signal and the length of the target speech frame. If the energy of the original speech sample signal corresponding to the target speech frame is not greater than the weighted average of the energy of the noise signal, it is determined that the original speech sample signal corresponding to the target speech frame is a non-consonant signal (step S214). On the contrary, if the energy of the original speech sample signal corresponding to the target speech frame is greater than the weighted average of the energy of the noise signal, a weighted average of the ratios of the energy of the first consonant band signal corresponding to a plurality of original speech samples previously determined as the noise signal to the energy of the original speech sample signal is calculated to obtain a weighted average of the first consonant energy ratio (step S216). Then, it is determined whether the ratio of the energy of the first consonant band signal corresponding to the target speech frame to the energy of the original speech sample signal is smaller than the first weighted average of the ratio of the energy of the first consonant band signal corresponding to each original speech sample signal determined as a noise signal to the energy of the original speech sample signal (step S218), wherein the weighted value of the ratio of the energy of the first consonant band signal corresponding to each original speech sample signal determined as a noise signal to the energy of the original speech sample signal varies with the length of the interval between the speech frame corresponding to each original speech sample signal determined as a noise signal and the target speech frame.

If the ratio of the energy of the first consonant band signal corresponding to the target speech frame to the energy of the original speech sample signal is not less than the first weighted average of the consonant energy ratios, the original speech sample signal corresponding to the target speech frame is a non-consonant signal (step S214). On the contrary, if the ratio of the energy of the first consonant frequency band signal corresponding to the target speech frame to the energy of the original speech sample signal is smaller than the first weighted average of the ratio of the energy of the first consonant frequency band signal to the energy of the original speech sample signal, then it is determined whether the ratio of the energy of the second consonant frequency band signal to the energy of the original speech sample signal is greater than or equal to the predetermined ratio (step S220). If the ratio of the energy of the second consonant band signal to the energy of the original voice sample signal is not greater than or equal to the predetermined ratio, the original voice sample signal corresponding to the target voice frame is a non-consonant signal (step S214). On the contrary, if the ratio of the energy of the second consonant frequency band signal to the energy of the original voice sampling signal is greater than or equal to the predetermined ratio, it is determined whether the energy of the original voice sampling signal is greater than or equal to the lower limit (step S222). If the energy of the original voice sampling signal is not greater than or equal to the lower limit, the original voice sampling signal corresponding to the target voice frame is a non-consonant signal (step S214).

On the contrary, if the energy of the original voice sampling signal is greater than or equal to the lower limit, then the first zero-crossing rate, the second zero-crossing rate and the third zero-crossing rate of the original voice sampling signal are calculated, and the average zero-crossing rate of the original voice sampling signals of the target voice frame and the plurality of voice frames before the target voice frame is calculated to obtain a first average zero-crossing rate, a second average zero-crossing rate and a third average zero-crossing rate (step S224). The first zero-crossing rate, the second zero-crossing rate and the third zero-crossing rate are respectively times of the original voice sampling signal passing through a first preset value, a second preset value and a third preset value in the target voice frame, wherein the second preset value is smaller than the first preset value and larger than the third preset value. Then, whether the first average zero-crossing rate, the second average zero-crossing rate and the third average zero-crossing rate are respectively greater than or equal to the corresponding preset average zero-crossing rates is judged (step S226). If the first average zero-crossing rate, the second average zero-crossing rate and the third average zero-crossing rate are not all greater than or equal to the corresponding preset average zero-crossing rate, the original speech sample signal corresponding to the target speech frame is a non-consonant signal (step S214). On the contrary, if the first average zero-crossing rate, the second average zero-crossing rate and the third average zero-crossing rate are greater than or equal to the corresponding preset average zero-crossing rates, it is determined whether the second zero-crossing rate is greater than or equal to the preset zero-crossing rate (step S228). If the second zero-crossing rate is not greater than or equal to the predetermined zero-crossing rate, the original speech sample signal corresponding to the target speech frame is a non-consonant signal (step S214). On the contrary, if the second zero-crossing rate is greater than or equal to the predetermined zero-crossing rate, the original speech sample signal corresponding to the target speech frame is a consonant signal (step S230).

Fig. 3A to 3B are schematic flow charts illustrating a voice recognition method according to an embodiment of the invention, and fig. 3A to 3B are shown. The difference between this embodiment and the embodiments of fig. 2A-2B is that after determining that the energy of the original speech sample signal corresponding to the target speech frame is greater than the weighted average of the energy of the noise signal in step S212, the embodiment then determines whether the sum of the ratio of the energy of the second consonant band signal to the energy of the original speech sample signal and the ratio of the energy of the first consonant band signal to the energy of the original speech sample signal is greater than or equal to a preset sum (step S302), and if the sum of the ratio of the energy of the second consonant band signal to the energy of the original speech sample signal and the ratio of the energy of the first consonant band signal to the energy of the original speech sample signal is not greater than or equal to the preset sum, the original speech sample signal corresponding to the target speech frame is not a consonant. On the contrary, if the sum of the ratio of the energy of the second consonant frequency band signal to the energy of the original voice sampling signal and the ratio of the energy of the first consonant frequency band signal to the energy of the original voice sampling signal is greater than or equal to the preset sum, the process directly proceeds to step S220, and determines whether the ratio of the energy of the second consonant frequency band signal to the energy of the original voice sampling signal is greater than or equal to the preset ratio, and continues to perform the following steps of the voice recognition method as in the embodiment of fig. 2A to 2B.

In summary, the embodiments of the present invention can determine whether the original speech sample signal corresponding to the target speech frame is a consonant signal by combining at least one of the conditions of the above equations, so as to improve the recognition accuracy of the consonant signal. For example, whether the original voice sampling signal corresponding to the target voice frame is noise can be judged according to the ratio of the energy of the first consonant frequency band signal to the energy of the second consonant frequency band signal, the ratio of the energy of the first consonant frequency band signal to the energy of the original voice sampling signal and the ratio of the energy of the second consonant frequency band signal to the energy of the original voice sampling signal, so that the situation that the original voice sampling signal is judged to be the consonant signal by mistake is reduced, and the identification accuracy of the consonant signal is improved.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A speech recognition apparatus, comprising:

a band-pass filtering unit for performing band-pass filtering of a first consonant frequency band and a second consonant frequency band on a voice signal to generate a first band-pass filtering signal and a second band-pass filtering signal respectively; and

a processing unit coupled to the band-pass filtering unit for dividing the speech signal, the first band-pass filtered signal and the second band-pass filtered signal into a plurality of speech frames, wherein each speech frame includes N sampling signals, N is a positive integer, calculating the energy of the sampling signal in the target speech frame to obtain an original speech sampling signal energy, a first consonant frequency band signal energy and a second consonant frequency band signal energy, and determining whether the original speech sampling signal corresponding to the target speech frame is noise according to the ratio of the first consonant frequency band signal energy to the second consonant frequency band signal energy, the ratio of the first consonant frequency band signal energy to the original speech sampling signal energy and the ratio of the second consonant frequency band signal energy to the original speech sampling signal energy,

the processing unit further determines whether a ratio of the first consonant band signal energy to the second consonant band signal energy, a ratio of the first consonant band signal energy to the original voice sampling signal energy, and a ratio of the second consonant band signal energy to the original voice sampling signal energy are within a corresponding preset ratio range, respectively, and if the ratio of the first consonant band signal energy to the second consonant band signal energy, the ratio of the first consonant band signal energy to the original voice sampling signal energy, and the ratio of the second consonant band signal energy to the original voice sampling signal energy are within a corresponding preset ratio range, the original voice sampling signal of the target voice frame is a noise signal.

2. The speech recognition device of claim 1, wherein the processing unit further calculates a weighted average of energy of original speech samples of speech frames previously determined as noise signals to obtain a weighted average of energy of noise signals, and determines whether the original speech sample corresponding to the target speech frame is a consonant signal according to whether the energy of the original speech sample corresponding to the target speech frame is greater than the weighted average of energy of noise signals.

3. The speech recognition device of claim 2, wherein the weighting value corresponding to each speech frame of the original speech sample signal determined as a noise signal varies according to the length of the interval between the speech frame of the original speech sample signal determined as a noise signal and the target speech frame.

4. The apparatus of claim 2, wherein the processing unit further determines whether the original speech sample signal corresponding to the target speech frame is a consonant signal according to whether a sum of a ratio of the energy of the second consonant band signal to the energy of the original speech sample signal and a ratio of the energy of the first consonant band signal to the energy of the original speech sample signal is greater than or equal to a predetermined sum.

5. The apparatus of claim 4, wherein the processing unit further calculates a weighted average of ratios of energy of the first consonant band signal to energy of the original voice sample signal corresponding to a plurality of voice frames previously determined as noise signals to obtain a weighted average of a first consonant energy ratio, and determines whether the original voice sample signal corresponding to the target voice frame is a consonant signal according to whether the ratio of the energy of the first consonant band signal to the energy of the original voice sample signal corresponding to the target voice frame is smaller than the weighted average of the first consonant energy ratio.

6. The speech recognition device of claim 5, wherein the weighted value of the ratio of the energy of the first subband signal to the energy of the original speech sample signal corresponding to each speech frame of the original speech sample signal determined as a noise signal varies with the length of the interval between the speech frame of the original speech sample signal determined as a noise signal and the target speech frame.

7. The apparatus of claim 5, wherein the processing unit further determines whether the original speech sample signal corresponding to the target speech frame is a consonant signal according to whether a ratio of the energy of the second consonant band signal to the energy of the original speech sample signal is greater than or equal to a predetermined ratio.

8. The apparatus of claim 7, wherein the processing unit further determines whether the original voice sample signal corresponding to the target voice frame is a consonant signal according to whether the energy of the original voice sample signal is greater than or equal to a lower threshold.

9. The speech recognition device of claim 8, wherein the processing unit further calculates a first zero-crossing rate, a second zero-crossing rate and a third zero-crossing rate of the original speech sample signals, and calculates an average zero-crossing rate of the original speech sample signals of the target speech frame and the speech frames before the target speech frame to obtain a first average zero-crossing rate, a second average zero-crossing rate and a third average zero-crossing rate, and determines whether the original speech sample signal corresponding to the target speech frame is a consonant signal according to whether the first average zero-crossing rate, the second average zero-crossing rate and the third average zero-crossing rate are respectively greater than or equal to their corresponding preset average zero-crossing rates, where the first zero-crossing rate, the second zero-crossing rate and the third zero-crossing rate are respectively a first preset value, a second preset value, a third preset value, and a third preset value of the pass through of the original speech sample signal in the target speech frame, A second preset value and a third preset value, wherein the second preset value is smaller than the first preset value and larger than the third preset value.

10. The apparatus of claim 9, wherein the processing unit further determines whether the original speech sample signal corresponding to the target speech frame is a consonant signal according to whether the second zero-crossing rate is greater than or equal to a predetermined zero-crossing rate.

11. A speech recognition method, comprising:

performing band-pass filtering on a first consonant frequency band and a second consonant frequency band on a voice signal to respectively generate a first band-pass filtering signal and a second band-pass filtering signal;

dividing the voice signal, the first band-pass filtered signal and the second band-pass filtered signal into a plurality of voice frames, wherein each voice frame comprises N sampling signals, and N is a positive integer;

calculating the energy of the sampling signal in the target voice frame to obtain the energy of an original voice sampling signal, the energy of a first consonant frequency band signal and the energy of a second consonant frequency band signal; and

judging whether the original voice sampling signal corresponding to the target voice frame is noise or not according to the ratio of the first consonant frequency band signal energy to the second consonant frequency band signal energy, the ratio of the first consonant frequency band signal energy to the original voice sampling signal energy and the ratio of the second consonant frequency band signal energy to the original voice sampling signal energy,

the step of judging whether the original voice sampling signal corresponding to the target voice frame is noise or not comprises the following steps:

judging whether the ratio of the first consonant frequency band signal energy to the second consonant frequency band signal energy, the ratio of the first consonant frequency band signal energy to the original voice sampling signal energy and the ratio of the second consonant frequency band signal energy to the original voice sampling signal energy respectively fall within corresponding preset ratio ranges; and

and if the ratio of the first consonant frequency band signal energy to the second consonant frequency band signal energy, the ratio of the first consonant frequency band signal energy to the original voice sampling signal energy and the ratio of the second consonant frequency band signal energy to the original voice sampling signal energy respectively fall within the corresponding preset ratio range, the original voice sampling signal of the target voice frame is a noise signal.

12. The speech recognition method of claim 11, further comprising:

calculating an original voice sampling signal energy weighted average value of a plurality of voice frames of the original voice sampling signal which is judged as the noise signal before to obtain a noise signal energy weighted average value; and

and judging whether the original voice sampling signal corresponding to the target voice frame is a consonant signal or not according to whether the original voice sampling signal energy corresponding to the target voice frame is greater than the noise signal energy weighted average value or not.

13. The speech recognition method of claim 12, wherein the weighting value corresponding to each of the speech frames of the original speech sample signal determined as a noise signal varies according to the length of the interval between the speech frame of the original speech sample signal determined as a noise signal and the target speech frame.

14. The speech recognition method of claim 12, further comprising:

and judging whether the original voice sampling signal corresponding to the target voice frame is a consonant signal or not according to whether the sum of the ratio of the second consonant frequency band signal energy to the original voice sampling signal energy and the ratio of the first consonant frequency band signal energy to the original voice sampling signal energy is greater than or equal to a preset sum value or not.

15. The speech recognition method of claim 14, further comprising:

calculating a weighted average value of the ratio of the first consonant frequency band signal energy to the original voice sampling signal energy corresponding to a plurality of voice frames of the original voice sampling signal which is judged as a noise signal before to obtain a first consonant energy proportion weighted average value; and

judging whether the original speech sampling signal corresponding to the target speech frame is a consonant signal according to whether the ratio of the energy of the first consonant frequency band signal corresponding to the target speech frame to the energy of the original speech sampling signal is smaller than the first consonant energy proportional weighted average value.

16. The speech recognition method of claim 15, wherein the weighted value of the ratio of the energy of the first consonant band signal to the energy of the original speech sample signal corresponding to each original speech sample signal determined as a noise signal varies with the length of the interval between the speech frame corresponding to each original speech sample signal determined as a noise signal and the target speech frame.

17. The speech recognition method of claim 15, further comprising:

and judging whether the original voice sampling signal corresponding to the target voice frame is a consonant signal or not according to whether the ratio of the second consonant frequency band signal energy to the original voice sampling signal energy is greater than or equal to a preset ratio or not.

18. The speech recognition method of claim 17, further comprising:

judging whether the original voice sampling signal corresponding to the target voice frame is a consonant signal according to whether the energy of the original voice sampling signal is greater than or equal to a lower limit value.

19. The speech recognition method of claim 18, further comprising:

calculating a first zero-crossing rate, a second zero-crossing rate and a third zero-crossing rate of the original voice sampling signal, and calculating an average zero-crossing rate of the original voice sampling signal of a plurality of N voice frames before the target voice frame and the target voice frame to obtain a first average zero-crossing rate, a second average zero-crossing rate and a third average zero-crossing rate, wherein N is a positive integer, the first zero-crossing rate, the second zero-crossing rate and the third zero-crossing rate are times when the original voice sampling signal passes a first preset value, a second preset value and a third preset value in the target voice frame respectively, and the second preset value is smaller than the first preset value and larger than the third preset value; and

and judging whether the original voice sampling signal corresponding to the target voice frame is a consonant signal or not according to whether the first average zero-crossing rate, the second average zero-crossing rate and the third average zero-crossing rate are respectively greater than or equal to the corresponding preset average zero-crossing rate or not.

20. The speech recognition method of claim 19, further comprising:

and judging whether the original voice sampling signal corresponding to the target voice frame is a consonant signal or not according to whether the second zero crossing rate is greater than or equal to a preset zero crossing rate or not.