CN108986844B

CN108986844B - Speech endpoint detection method based on speaker speech characteristics

Info

Publication number: CN108986844B
Application number: CN201810887035.XA
Authority: CN
Inventors: 孝大宇; 张淑蕾; 王超; 康雁
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2018-08-06
Filing date: 2018-08-06
Publication date: 2020-08-28
Anticipated expiration: 2038-08-06
Also published as: CN108986844A

Abstract

The invention relates to a voice endpoint detection method based on the voice characteristics of a speaker; the method comprises the following steps: 100. pre-acquiring voice characteristics of at least two persons; 101. collecting and preprocessing voice signals of at least two speakers to obtain background noise signals; 102. respectively windowing a voice signal and a background noise signal to obtain a voice frame and a background noise frame; 103. acquiring short-time energy zero product values and threshold values of a sound frame and a background noise frame; 104. acquiring a voiced segment of the voice signal by a threshold value aiming at all voice frames; 105. updating a threshold value and acquiring an end point of a voice signal according to the voice characteristics of the voice segment; the method combines speaker recognition based on the traditional voice endpoint detection, takes the noise influence into consideration, and extracts and compares the voice characteristics of the speakers, so that the voice endpoint detection is more accurate, and the recognition of multiple speakers is more accurate.

Description

Speech endpoint detection method based on speaker speech characteristics

Technical Field

The invention relates to the technical field of voice information processing and mode recognition, in particular to a voice endpoint detection method based on voice characteristics of a speaker.

Background

Voice endpoint detection is an important link in voice analysis, voice synthesis, voice coding, and speaker recognition. In speech recognition and speaker recognition, a voiced segment and an unvoiced segment in a speech signal are usually segmented according to a certain endpoint detection algorithm, and then recognized according to certain characteristics of speech for the voiced segment. The correct and effective voice endpoint detection can reduce the calculation amount and shorten the processing time, and can eliminate the noise interference of the silence section and improve the accuracy of voice recognition and speaker recognition. Common voice endpoint detection methods include short-term average energy, short-term average zero-crossing rate, and short-term energy-zero-product.

Under the condition of low signal-to-noise ratio, the traditional Voice endpoint Detection based on a threshold is influenced by noise with low accuracy, especially in a multi-speaker recognition scene, the situation that utterances of different speakers are connected closely sometimes occurs, and a Voice segment detected by general Voice endpoint Detection (Voice Activity Detection, abbreviated as VAD) may contain different speakers, so that the Voice segments of different speakers are not easy to detect.

In the recognition scene of multiple speakers, the voiced segments detected by the traditional threshold-based voice endpoint detection method may contain different speakers, which affects the accuracy of speaker recognition in the later period, and the correct voice endpoint detection is a key factor for improving the accuracy of speaker recognition; therefore, a method for detecting a voice endpoint more accurately in a complex scene of speaking multiple speakers is needed, so that the accuracy of later-stage multiple speaker recognition is improved.

Disclosure of Invention

Technical problem to be solved

In order to solve the above problems in the prior art, the present invention provides a method for detecting a speech endpoint based on speech characteristics of a speaker.

(II) technical scheme

In order to achieve the purpose, the invention adopts the main technical scheme that the method comprises the following steps:

100. the method comprises the steps of obtaining voice characteristics of at least two persons in advance through voice information samples;

101. collecting voice signals including at least two speeches of people, carrying out preprocessing aiming at the voice signals, and taking 0-100ms of the preprocessed voice signals as background noise signals;

102. windowing is respectively carried out on the preprocessed voice signals and the preprocessed background noise signals, and at least two voice frames corresponding to the voice signals and at least one background noise frame corresponding to the background noise signals are obtained;

103. acquiring a short-time energy zero product value of each sound frame, a short-time energy zero product value of each background noise frame and a threshold value;

acquiring average energy E and a short-time average zero-crossing rate Z by the following formula (1) and formula (2) respectively aiming at each sound frame and each background noise frame;

n in the formula denotes the length of the window, s_w(k) Representing a windowed speech signal;

wherein, the short-time energy zero product is the product of the average energy E and the short-time average zero-crossing rate Z;

the threshold value is the multiplication of the average value of short-time energy zero product values of all background noise frames and a constant C;

104. according to the corresponding sequence of all sound frames and voice signals, taking the sound frame with the first short-time energy zero product value larger than a threshold value as an initial frame, and taking the sound frame with the first short-time energy zero product value smaller than the threshold value in the sound frames after the initial frame as an end frame;

all sound frames from the starting frame to the ending frame are sound segments of the voice signal;

105. acquiring the voice characteristics of a first judgment area and the voice characteristics of a second judgment area in a voiced sound segment, updating a threshold value according to the voice characteristics of the first judgment area and the voice characteristics of the second judgment area, and acquiring an end point of a voice signal;

the first judgment area is at least one sound frame behind the initial frame of the sound segment, and the voice characteristics of the first judgment area are obtained;

the second judgment area is at least one sound frame before the termination frame of the sound segment, and the voice characteristics of the second judgment area are obtained;

if the voice characteristics of the first judgment area and the voice characteristics of the second judgment area are matched with the voice characteristics of the same person in the pre-acquired voice characteristics of at least two persons, the end point with the sound segment is taken as the end point of the voice signal;

otherwise, increasing the threshold by a preset value, updating the threshold, and executing the step 104 to obtain an updated voiced sound segment according to the updated threshold;

and executing step 105 for the updated voiced segment to obtain a corresponding updated first judgment area and a corresponding updated second judgment area, comparing the voice characteristics, repeating updating for a preset number of times until the updated first judgment area and the updated second judgment area are both matched with the voice characteristics of the same person in the pre-acquired voice characteristics of at least two persons, and taking the end point of the updated voiced segment as the end point of the voice signal.

Optionally, the voice information samples include:

at least two pieces of voice information, wherein the duration of each piece of voice information is more than one minute, and each piece of voice information is the voice information of speaking of different people;

and acquiring a Gaussian mixture model of each voice message to obtain the voice characteristics corresponding to each voice message.

Optionally, the pre-processing comprises:

aiming at voice signal filtering, the upper limit cut-off frequency of the filtering is 3400Hz, and the lower limit cut-off frequency is 60-100 Hz.

Optionally, the windowing process comprises:

in step 102, dividing a speech signal into at least two sound frames according to a Hamming window; .

Alternatively,

the frame length of each sound frame corresponding to the voice signal is 10 ms-30 ms, and the frame shift between adjacent sound frames is half of the frame length;

the frame length of each background noise frame corresponding to the background noise signal is 10ms, and the frame shift between adjacent background noise frames is half of the frame length.

Alternatively,

in step 105, the duration of the first judgment area and the duration of the second judgment area are both 1s-3 s.

Alternatively,

in step 105, a gaussian mixture model is obtained for the first judgment area and the second judgment area of the voiced sound segment; the Gaussian mixture model of the first judgment area is the voice characteristic of the first judgment area;

the Gaussian mixture model of the second judgment area is the voice characteristic of the second judgment area.

Alternatively,

in step 105, the preset number of times of repeated updating is 10 times.

Alternatively,

in step 105, the threshold is increased by a preset value by 5% of the threshold before updating.

(III) advantageous effects

The invention has the beneficial effects that:

the method combines speaker recognition based on the traditional voice endpoint detection, extracts and compares the characteristics of the speakers while considering the noise influence, so that the voice endpoint detection is more accurate, and the recognition of multiple speakers is more accurate.

Drawings

FIG. 1 is a flowchart illustrating a method for detecting a voice endpoint based on characteristics of a speaker according to an embodiment of the present invention;

FIG. 2(a) is a time domain diagram of speaker A pronunciation "0" according to an embodiment of the present invention;

FIG. 2(b) is a diagram of the frequency spectrum of speaker A pronunciation "0" according to an embodiment of the present invention;

FIG. 2(c) is a time domain diagram of the utterance "0" of speaker B according to an embodiment of the present invention;

FIG. 2(d) is a diagram of the frequency spectrum of the speaker B's pronunciation of "0" according to an embodiment of the present invention;

FIG. 3(a) is a diagram of a speaker's voice signal according to an embodiment of the present invention;

FIG. 3(b) is a short-time energy zero-product diagram of a speaker voice signal according to an embodiment of the present invention;

FIG. 3(c) is a voice endpoint detection result of the short-time zero-integration method for the voice signal of the speaker according to an embodiment of the present invention;

FIG. 4 is a functional block diagram of speaker recognition according to an embodiment of the present invention;

fig. 5 is a flowchart of voice endpoint detection according to an embodiment of the present invention.

Detailed Description

For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

As shown in fig. 1, the method of the present invention comprises the following steps:

the voice information samples include: at least two pieces of voice information, wherein the duration of each piece of voice information is more than one minute, and each piece of voice information is the voice information of speaking of different people;

and acquiring a Gaussian mixture model of each voice message to obtain the voice characteristics corresponding to each voice message. For example, in the present embodiment, the speaker a and the speaker B are taken as an example, specifically, the speech information of the speaker a and the speaker B is collected in advance, and a gaussian mixture model of the speech information of the speaker a and the speaker B is obtained and is taken as the speech feature;

as shown in fig. 2(a) and 2(b), a time domain diagram and a spectrogram of speaker a uttering "0" respectively;

as shown in fig. 2(c) and 2(d), a time domain diagram and a spectrogram of speaker B utterance "0", respectively;

specifically, in this embodiment, the content of the registered voice information is not limited in the present invention, and this embodiment is only for illustration.

for example, since there is usually a silence region at the beginning of recording, the first 100ms signal is usually taken as the analysis of background noise;

further, filtering is performed on the voice signal;

for example, the upper cut-off frequency of the filtering is 3400Hz, and the lower cut-off frequency is 60-100 Hz.

for example, a speech signal usually has time variability and short-time stationarity, so the speech signal is usually divided into a plurality of sound frames to obtain the feature parameters of the speech signal, in this embodiment, a window function is added to the speech signal, wherein the window function is a hamming window, and the hamming window is shown in the following formula (1);

wherein N in formula (1) represents the length of the window;

for example, the frame length of each voice frame corresponding to the voice signal after windowing is 10ms to 30ms, and the frame shift between adjacent voice frames is half of the frame length.

The frame length of each background noise frame corresponding to the background noise signal after windowing is 10ms, and the frame shift between adjacent background noise frames is half of the frame length.

acquiring average energy E and a short-time average zero-crossing rate Z by the following formula (2) and formula (3) respectively aiming at each sound frame and each background noise frame;

the threshold value is the multiplication of the average value of short-time energy zero product values of all background noise frames and a constant C, for example, C is 1.2 according to an empirical value;

in the present embodiment, since the windowing process is performed in step 102, the original speech signal is converted into a corresponding audio frame, and therefore, the background noise is processed with 10m as one frame in the present embodiment;

all sound frames from the starting frame to the ending frame are used as sound segments of the voice signal;

for example, as shown in fig. 3(a), the present embodiment collects a speech signal of a speaker, 3(b) shows a short-time energy zero product of the speech signal of the speaker, and fig. 3(c) shows a speech endpoint detection result graph using a short-time energy zero product method;

as shown in FIG. 4, FIG. 4 is a block diagram illustrating the principle of using speaker recognition in the present embodiment

Specifically, for example, the first determination area is a total sound frame corresponding to 1s-3s after the start frame of the sound segment, and the speech feature of the first determination area is obtained by obtaining a gaussian mixture model of the first determination area;

the second judgment area is a plurality of sound frames corresponding to 1s-3s before the termination frame of the sound segment, and the voice characteristics of the first judgment area are obtained by obtaining the Gaussian mixture model of the first judgment area;

specifically, as shown in fig. 5, if the voice feature of the first judgment area and the voice feature of the second judgment area are both matched with the voice feature of the same person in the pre-acquired voice features of at least two persons, the endpoint with the sound segment is used as the endpoint of the voice signal;

for example, when the voice feature of the first judgment area is matched with the voice feature of the speaker A, and the voice feature of the second judgment area is also matched with the voice feature of the speaker A, the end point with the voice segment is taken as the end point of the voice signal; the voice characteristics obtained in advance are only used for illustration and are not limited, and the voice characteristics of the corresponding first judgment region and the voice characteristics of the second judgment region can also be matched with the voice characteristics of the speaker B.

Otherwise, increasing the threshold by a preset value, for example, increasing the threshold by 5% of the threshold before updating, updating the threshold, and executing step 104 to obtain an updated voiced sound segment according to the updated threshold;

for example, when the voice feature of the first judgment area is matched with the voice feature of the speaker a and the voice feature of the second judgment area is also matched with the voice feature of the speaker B, the threshold value is updated, and the updated voiced segment is acquired correspondingly according to the updated threshold value;

and executing step 105 for the updated voiced segment to obtain the corresponding updated first judgment area and second judgment area, comparing the voice characteristics, and repeatedly updating for a preset number of times, for example, setting the number of times of repeated updating to 10 times, until the updated first judgment area and second judgment area are both matched with the voice characteristics of the same person in the pre-obtained voice characteristics of at least two persons, and then using the end point of the updated voiced segment as the end point of the voice signal.

Finally, it should be noted that: the above-mentioned embodiments are only used for illustrating the technical solution of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A voice endpoint detection method based on speaker voice characteristics is characterized by comprising the following steps:

2. The method of claim 1, wherein the speech information samples comprise:

3. The method of claim 2, wherein pre-processing comprises:

4. The method of claim 3, wherein the windowing comprises:

in step 102, at least two sound frames are split for a speech signal according to a hamming window.

5. The method of claim 4,

6. The method of claim 5,

7. The method of claim 6,

8. The method of claim 7,

in step 105, the preset number of times of repeated updating is 10 times.

9. The method of claim 8,