CN108053842A

CN108053842A - Shortwave sound end detecting method based on image identification

Info

Publication number: CN108053842A
Application number: CN201711330638.1A
Authority: CN
Inventors: 陈章鑫; 杨孟文; 司进修; 黄际彦
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-12-13
Filing date: 2017-12-13
Publication date: 2018-05-18
Anticipated expiration: 2037-12-13
Also published as: CN108053842B

Abstract

The invention belongs to speech detection fields, are particularly based on the shortwave sound end detecting method that image identifies.The technical scheme is that：Data are pre-processed first, improve signal-to-noise ratio；Then by specific length framing, Short Time Fourier Transform is carried out at the same time, so as to obtain sound spectrograph；The vocal print in sound spectrograph is finally found using image-recognizing method, determines there is words section in data according to vocal print distribution.There is similar signal-to-noise ratio using the voice of the method for the present invention after the pre-treatment, adjustment parameter is not required in subsequent step, and therefore, the method for the present invention can adaptively be chosen from different ambient noises words section.

Description

Shortwave sound end detecting method based on image identification

Technical field

The invention belongs to speech detection field, especially a kind of shortwave sound end detecting method based on image identification.

Background technology

Although novel radio electrical communication system continuously emerges, short-wave radio set is due to its autonomous communication ability and wide coverage The characteristics of, still it is subject to most attention.But short wave communication emitting radio waves are needed by ionospheric reflection, therefore its noise compared with Greatly.The presence of strong background noise causes monitoring personnel that can not work long hours, it is necessary to do noise reduction process, while to no segment of speech into The processing of row noise elimination.Leakage is listened in order to prevent at this time, and the performance of sound end detecting method is particularly important.

In traditional voice processing, according to different characteristic, there is the method for many end-point detections, as based on correlation function End-point detection, the end-point detection based on cepstrum distance, the end-point detection based on zero ratio of energy and the endpoint inspection based on wavelet decomposition Survey etc..For different phonetic, adjusting parameter, can choose voice exactly has words section.But in changeable environment, it is desirable that real-time Communication for Power In the case of, adjustment endpoint detection parameters are unpractical, and traditional voice processing method is just no longer applicable.

Voice spectrum figure abbreviation sound spectrograph, by the Short Time Fourier Transform of voice analyze and research voice short-term spectrum with The variation relation of time.Sound spectrograph horizontal direction is time shaft, and vertical direction is frequency axis, and gray scale striped thereon represents each The voice short-time spectrum at moment.Sound spectrograph reflects the dynamic spectrum characteristic of voice signal, has important reality in speech analysis With value, it is referred to as visual speech.

The content of the invention

The defects of for the prior art, according to there is no vocal prints in the peculiar mechanism and noise spectrum of mankind's sounding This feature, the present invention propose a kind of adaptive processing method.

The technical scheme is that：Data are pre-processed first, improve signal-to-noise ratio；Then by specific length point Frame is carried out at the same time Short Time Fourier Transform, so as to obtain sound spectrograph；The sound in sound spectrograph is finally found using image-recognizing method Line determines there is words section in data according to vocal print distribution.

A kind of shortwave sound end detecting method based on image identification, step are specific as follows：

S1, voice pretreatment is carried out, the purpose for carrying out voice pretreatment is to ensure the sound spectrograph vocal print clarity formed Roughly the same, this is the premise for carrying out effective image identification, is concretely comprised the following steps：

S11, during voice signal data is gathered, test system it is some due to, the meeting in time series The trend error of a linear or slow change is generated, the zero curve of voice signal is made to deviate baseline, the size even deviated from can be with Time change, this can cause the correlation function of voice, and power spectrum function is deformed in processing calculates, intended using least square method Close trend term removal trend error；

S12, amplitude normalization is carried out；

S13, low-pass filtering, noise of the removal higher than 3500Hz；

S14, the spectrum-subtraction composed using more windows strengthen voice；

S2, image identification is carried out to the sound spectrograph of acquisition, obtains structure, this structure includes sound spectrograph vocal print position Starting point and end point are specially：

S21, sub-frame processing is carried out to voice signal, Short Time Fourier Transform is carried out in units of frame, obtains short-term spectrum；

S22, the short-term spectrum obtained by the time sequencing arrangement S21 of frame, obtain sound spectrograph；

Vocal print in sound spectrograph described in S23, identification S22, i.e.,：Colored sound spectrograph is become into gray level image；Extract gray-scale map Image border, identify gray-scale map middle conductor position；The starting point and end point that include sound spectrograph vocal print position that will be obtained Form structure；

S3, end-point detection is carried out, is specially：

S31, initial point position vector ST=[st are extracted from structure described in S2₁,st₂,...,st_i,...,st_n] With end point position vector EN=[en₁,en₂,...,en_i,...,en_n], wherein, st_iRefer to i-th of initial point position, en_iRefer to the I end point position.The initial point position vector ST and end point position vector EN are ranked up according to ascending order；

S32, judgement have words section, are regarded as vocal print when there is three horizontal line sections, remaining is noise.Numerically body It is now to work as en_i＞ st_i+2I.e. it is believed that i-th point of line segment for starting point is in having words section；

S33, line segment of all affirmatives in having words section is selected, has been looked for whether to the left and right in the range of 100 frame of both direction The element st' of ST_iIn the presence of being also contained in if having in words section, and substituted script st_iIt repeats and is sought in the range of 100 frames to the left and right It looks for, until left and right, the element of ST is not present in 100 frame scopes.

Further, voice is carried out strengthening being as follows using the spectrum-subtraction that more windows are composed described in S14：

Step A, the time series of voice signal is set as x (n), carries out adding window point to x (n) with the Hamming window that length is wlen Frame processing obtains the i-th frame voice signal as x_i(m), the x_i(m) frame length is wlen, the x_i(m) discrete Fourier becomes It is changed to

Step B, M frames, X described in common 2M+1 frames calculation procedure A are respectively taken before and after centered on i frames_i(k) each component in Average amplitude is composedAnd phase angleWherein j refers to i frames Centered on rear j frames, Im refers to imaginary part, and Re refers to real part；

Step C, multiple orthogonal data windows is asked to be averaged to obtain Power estimation to same data sequence, more window spectrums are defined asWherein, L be data window number, S^mtFor the spectrum of data window w, i.e., Tx (n) is data sequence, and N is sequence length, a_w(n) it is w-th of data window, a_w(n) it is one group of mutually orthogonal discrete ellipsoid sequence Row, for asking direct spectrum, a respectively with same column signal_w(n) meet it is mutually orthogonal between multiple data windows, i.e.,Definition method is composed to the signal x after framing with above-mentioned more windows_i(m) multiple window spectrum estimation is carried out, I.e.

Step D, more window Spectral power density estimates are smoothed, calculate smooth power spectrum densityCalculate noise average power spectrum densityCalculate gain The factorWherein, NIS represents leading without words section The frame number occupied；

Step E, the amplitude spectrum after being subtracted according to obtained more windows spectrum spectrumVoice letter is strengthened in synthesis NumberWherein, more window spectrum spectrum-subtractions are to utilize the leading work(that noise is obtained without words section Rate after the power of overall sound subtracts the ingredient of noise, recovers voice signal using angle relationship, crosses subtracting coefficient and determine to signal Reinforcement degree, gain compensation factor determine calculate duration.

Further, the choosing method of the subtracting coefficient excessively is as follows：

Ith, it is 1 to cross subtracting coefficient initial value, and takes initial signal-to-noise ratio snr'=0；

IIth, reinforcement processing is carried out to voice using more windows spectrum spectrum-subtraction, the signal-to-noise ratio snr of signal after calculating processing；

If the signal-to-noise ratio snr of the III, treated signal is more than initial signal-to-noise ratio snr', next step is carried out, if processing The signal-to-noise ratio snr of signal afterwards is less than or equal to initial signal-to-noise ratio snr', illustrates that voice is not notable in signal, does not then do and locate Reason, retains all voice signals, directly exports；

If the signal-to-noise ratio snr of the IV, treated signal is less than 8dB, crossing subtracting coefficient increases by 0.5, makes snr'=snr, weight Multiple step II-step IV is more than 8dB until Signal-to-Noise.

The beneficial effects of the invention are as follows：

There is similar signal-to-noise ratio using the voice of the method for the present invention after the pre-treatment, adjustment parameter is not required in subsequent step, Therefore, the method for the present invention can adaptively be chosen from different ambient noises words section.

Description of the drawings

Fig. 1 improves spectrum-subtraction schematic diagram for more windows spectrum.

Fig. 2 strengthens process chart for voice.

Fig. 3 is the method for the present invention flow chart.

Fig. 4 is the voice time domain figure before voice pretreatment in specific embodiment 1.

Fig. 5 is the voice time domain figure after voice pretreatment in specific embodiment 1.

Fig. 6 is each frame frequency spectrogram of voice in specific embodiment 1.

Fig. 7 is the sound spectrograph after gray proces in specific embodiment 1.

Fig. 8 is horizontal line section part in the sound spectrograph after gray proces in specific embodiment 1.

Fig. 9 is the sound spectrograph end-point detection result after gray proces in specific embodiment 1.

Figure 10 is endpoint testing result time-domain diagram in specific embodiment 1, wherein, a left side is raw tone, after pretreatment is in the right side Voice.

Figure 11 is the voice time domain figure before voice pretreatment in specific embodiment 2.

Figure 12 is the pretreated voice time domain figure of voice in specific embodiment 2.

Figure 13 is each frame frequency spectrogram of voice in specific embodiment 2.

Figure 14 is the sound spectrograph after gray proces in specific embodiment 2.

Figure 15 is horizontal line section part in the sound spectrograph after gray proces in specific embodiment 2.

Figure 16 is the sound spectrograph end-point detection result after gray proces in specific embodiment 2.

Figure 17 is endpoint testing result time-domain diagram in specific embodiment 2, wherein, a left side is raw tone, after pretreatment is in the right side Voice.

Specific embodiment

The present invention will be described below in conjunction with the accompanying drawings.

The method of the present invention chooses feature of the vocal print characteristic as sound.Due to unique physiological structure of mankind's sounding, from language It can be seen that vocal print in audio spectrogram (sound spectrograph).The vocal print of human speech has notable feature, is having words section, it can be seen that no Energy distribution has specific rule on same frequency；In the spectrogram of voice, horizontally-parallel several lines are presented, these lines are just It is vocal print.Vocal print can embody personal pronunciation character and phoneme feature, find broad application in terms of speech recognition.

As shown in figure 3, the method for the present invention step is as follows：

S12, amplitude normalization is carried out；

S13, low-pass filtering, noise of the removal higher than 3500Hz；

S14, the spectrum-subtraction composed using more windows strengthen voice, are specially：

Step B, M frames, X described in common 2M+1 frames calculation procedure A are respectively taken before and after centered on i frames_i(k) each component in Average amplitude is composedAnd phase angleWherein j refers to i frames Centered on rear j frames, Im refers to imaginary part, and Re refers to real part.

Step E, the amplitude spectrum after being subtracted according to obtained more windows spectrum spectrumVoice letter is strengthened in synthesis NumberWherein, more window spectrum spectrum-subtractions are to utilize the leading work(that noise is obtained without words section Rate after the power of overall sound subtracts the ingredient of noise, recovers voice signal using angle relationship, crosses subtracting coefficient and determine to signal Reinforcement degree, gain compensation factor determine calculate duration；

The choosing method of the subtracting coefficient excessively is as follows：

If the signal-to-noise ratio snr of the IV, treated signal is less than 8dB, crossing subtracting coefficient increases by 0.5, makes snr'=snr, weight Multiple step II-step IV is more than 8dB until Signal-to-Noise；

S3, end-point detection is carried out, is specially：

S33, line segment of all affirmatives in having words section is selected, has been looked for whether to the left and right in the range of 100 frame of both direction The element st' of ST_iIn the presence of being also contained in if having in words section, and substituted script st_iIt repeats and is sought in the range of 100 frames to the left and right It looks for, until left and right, the element of ST is not present in 100 frame scopes.The purpose for the arrangement is that it prevents due to the ineffective shadow of cut-off line function Ring end-point detection performance.

Specific embodiment 1, pink noise background

Step 1: reading in file, draw time domain figure and see Fig. 4, time-domain diagram is shown in Fig. 5 after voice pretreatment.

By voice framing, frame length 200, frame moves 80, obtains the two-dimensional matrix that the data after framing are 200*2964, each column 200 numbers (per frame) carry out Fourier transformation for a unit and obtain each frame frequency spectrum, then have 2964 frequency spectrums, using transverse axis as when Between, the longitudinal axis draws spectrogram for frequency and sees Fig. 6, takes low frequency part (0Hz~3500Hz) and does gray proces and obtain sound spectrograph, sees Fig. 7.Wherein, show for clarity, Fig. 7, Fig. 8, Fig. 9 are rotated clockwise 90 degree).

There are parallel ripple, i.e. vocal print in visible white part in Fig. 7, this is phonological component, separately there is the not corrugated part of white Caused by being very noisy.Horizontal line section part in figure is chosen, sees Fig. 8.

Starting point end point is stored, is resequenced by X direction size, obtains starting point vector sum end point vector. It is considered that being regarded as vocal print when there is three horizontal line sections, remaining is noise.Numerically it is presented as en_i＞ st_i+2, I.e. i-th of line segment end position is more than the starting position of the i-th+2 line segments, and this will serve as a proof it is judged that whether voice has words.To be true It protects without missing inspection information, words section may be had by being found again toward left and right.Obtain result such as Fig. 9.Conversion to time-domain diagram is shown in Figure 10.Utilize this Inventive method under pink noise background, has words Duan Jun to be detected.

Specific embodiment 2, strong noise background

Step is identical with example one, and experimental result is as follows：

It should be noted that under strong noise background, still can be left compared with very noisy frequency spectrum, such as Figure 14 after voice reinforcement processing It is shown, have that words section is higher and have the region where parallel lines for energy in figure, and after having words section, due to having compared with very noisy In the presence of it is relatively low to leave energy in sound spectrograph, into dotted existing noise spectrum.It, can be by noise when identifying line segment such as Figure 15 A part in spectrum is identified as line segment, so can cause to judge by accident in end-point detection.Last testing result is shown in Figure 16 to Figure 17, It can be seen that all in voice have words section all to identify, but the part that a part can be contained only to very noisy be mistaken for Voice.

Claims

1. a kind of shortwave sound end detecting method based on image identification, which is characterized in that its step is specific as follows：

S1, voice pretreatment is carried out, the purpose for carrying out voice pretreatment is to ensure the sound spectrograph vocal print clarity formed substantially Identical, this is the premise for carrying out effective image identification, is concretely comprised the following steps：

S11, during voice signal data is gathered, test system it is some due to, can be generated in time series The trend error of one linear or slow change makes the zero curve of voice signal deviate baseline, the size even deviated from can with when Between change, this can cause the correlation function of voice, and power spectrum function deforms in processing calculates, and is become using least square fitting Gesture item removes trend error；

S12, amplitude normalization is carried out；

S13, low-pass filtering, noise of the removal higher than 3500Hz；

S14, the spectrum-subtraction composed using more windows strengthen voice；

S2, image identification is carried out to the sound spectrograph of acquisition, obtains structure, this structure includes the starting of sound spectrograph vocal print position Point and end point are specially：

Vocal print in sound spectrograph described in S23, identification S22, i.e.,：Colored sound spectrograph is become into gray level image；Extract the figure of gray-scale map As edge, the position of identification gray-scale map middle conductor；The obtained starting point comprising sound spectrograph vocal print position and end point are formed Structure；

S3, end-point detection is carried out, is specially：

S31, initial point position vector ST=[st are extracted from structure described in S2₁,st₂,...,st_i,...,st_n] and terminate Point position vector EN=[en₁,en₂,...,en_i,...,en_n], wherein, st_iRefer to i-th of initial point position, en_iRefer to i-th of knot Spot position.The initial point position vector ST and end point position vector EN are ranked up according to ascending order；

S32, judgement have words section, are regarded as vocal print when there is three horizontal line sections, remaining is noise.Numerically embody To work as en_i＞ st_i+2I.e. it is believed that i-th point of line segment for starting point is in having words section；

S33, line segment of all affirmatives in having words section is selected, looks for whether ST's in the range of 100 frame of both direction to the left and right Element st'_iIn the presence of being also contained in if having in words section, and substituted script st_iIt repeats and is found in the range of 100 frames to the left and right, Until left and right, the element of ST is not present in 100 frame scopes.

2. a kind of shortwave sound end detecting method based on image identification according to claim 1, it is characterised in that： Voice is carried out strengthening being as follows using the spectrum-subtraction that more windows are composed described in S14：

Step A, the time series of voice signal is set as x (n), and x (n) is carried out at adding window framing with the Hamming window that length is wlen Reason obtains the i-th frame voice signal as x_i(m), the x_i(m) frame length is wlen, the x_i(m) Discrete Fourier Transform is

Step B, M frames, X described in common 2M+1 frames calculation procedure A are respectively taken before and after centered on i frames_i(k) each component is averaged in Amplitude spectrumAnd phase angleWherein j refers to using i frames in The rear j frames of the heart, Im refer to imaginary part, and Re refers to real part；

Step D, more window Spectral power density estimates are smoothed, calculate smooth power spectrum densityCalculate noise average power spectrum densityCalculate gain The factorWherein, NIS represents leading no words The frame number of Duan Zhanyou；

Step E, the amplitude spectrum after being subtracted according to obtained more windows spectrum spectrumSynthesize enhanced speech signalWherein, more windows spectrum spectrum-subtractions be using the leading power that noise is obtained without words section, After the power of overall sound subtracts the ingredient of noise, recover voice signal using angle relationship, cross subtracting coefficient and determine to signal Reinforcement degree, gain compensation factor determine to calculate duration.

3. a kind of shortwave sound end detecting method based on image identification according to claim 1, it is characterised in that：

The choosing method of the subtracting coefficient excessively is as follows：

If the signal-to-noise ratio snr of the III, treated signal is more than initial signal-to-noise ratio snr', next step is carried out, if treated The signal-to-noise ratio snr of signal is less than or equal to initial signal-to-noise ratio snr', illustrates that voice is not notable in signal, then does not process, and protects All voice signals are stayed, are directly exported；

If the signal-to-noise ratio snr of the IV, treated signal is less than 8dB, crossing subtracting coefficient increases by 0.5, makes snr'=snr, repeats to walk Rapid II-step IV is more than 8dB until Signal-to-Noise.