CN105096961B

CN105096961B - Speech separating method and device

Info

Publication number: CN105096961B
Application number: CN201410189386.5A
Authority: CN
Inventors: 杨小洪; 肖玮; 梁山; 刘文举
Original assignee: Huawei Technologies Co Ltd; Institute of Automation of Chinese Academy of Science
Current assignee: Huawei Technologies Co Ltd; Institute of Automation of Chinese Academy of Science
Priority date: 2014-05-06
Filing date: 2014-05-06
Publication date: 2019-02-01
Anticipated expiration: 2034-05-06
Also published as: CN105096961A

Abstract

The embodiment of the present invention provides a kind of speech separating method and device, the present embodiment speech separating method, it include: by obtaining the first signal, initial ideal two-value masking matrix is determined according to the first signal, according to initial ideal two-value masking matrix, harmonic compensation is carried out to the first signal, separation voice signal after obtaining harmonic compensation, according to the separation voice signal after harmonic compensation, the first signal and the second signal are filtered, obtain target separation voice signal, to reduce the generation of Energy volution in target separation voice signal, inhibit the distortion of target separation voice signal.

Description

Speech separating method and device

Technical field

The present embodiments relate to signal processing technology field more particularly to a kind of speech separating methods and device.

Background technique

Speech processing is as a noticeable research field in recent years, so far in the continuous language of large vocabulary Sound identification, speech synthesis, voice communication etc. achieve a series of achievements to attract people's attention.However, existing voice signal Processing technique is researched and developed under the voice environment in clean speech or with small noise, in more noisy environment not Satisfactory effect can be always obtained, this limits part of speech Related product answering in real life to a certain extent With.Therefore, how to inhibit or eliminate background noise, so that isolating targeted voice signal has become Speech processing neck An important research direction in domain.

Computational auditory scene analysis is mainly based upon the research of physiology of hearing and psychological field, shelters plan using acoustics Speech Separation is slightly carried out, so that separation voice more meets the perception characteristics of human ear.In the prior art, it generallys use based on threshold value Ideal two-value masking (Ideal Binary Mask, abbreviation IBM) matrix carries out Computational auditory scene analysis, and IBM matrix is one Dimension 0-1 matrix identical with time-frequency spectrum, wherein 1 corresponding voice dominates time frequency unit, 0 corresponding noise dominates time frequency unit. In target voice synthesis phase, the leading time frequency unit energy of voice is all retained, and noise dominates time frequency unit energy can be complete Portion is rejected.However since to will cause the leading time frequency unit of part of speech wrong for the erroneous estimation of the IBM matrix based on threshold value Accidentally refuse, the leading time frequency unit of partial noise is mistakenly retained, so as to cause generating in voice signal after isolation The cavity of many speech energies, to largely distort primitive sound signal.

Summary of the invention

The embodiment of the present invention provides a kind of speech separating method and device, using Computational auditory scene analysis and ideal floating value Masking strategy obtains separation voice signal, to reduce the generation of Energy volution in separation voice signal, it is suppressed that separation voice The distortion of signal.

In a first aspect, the embodiment of the present invention provides a kind of speech separating method, comprising:

The first signal is obtained, first signal includes voice signal and noise signal；

Determine that initial ideal two-value masking matrix, the initial ideal two-value masking matrix are used for according to first signal Distinguish voice signal and noise signal that first signal includes；

According to the initial ideal two-value masking matrix, harmonic compensation is carried out to first signal, obtains harmonic compensation Separation voice signal afterwards；

According to the separation voice signal after the harmonic compensation, the first signal and the second signal are filtered, are obtained Voice signal is separated to target.

In the first possible implementation of the first aspect, described that initial ideal two is determined according to first signal It is worth masking matrix, comprising:

Calculate the average value of the power spectrum of the noise signal；

According to the average value of the power spectrum of the noise signal, the institute for constituting the initial ideal two-value masking matrix is determined There is the value of time frequency unit；

According to the value for all time frequency units for constituting the initial ideal two-value masking matrix, described initial ideal two are determined It is worth masking matrix.

According to the first possible implementation of first aspect, in the second possible implementation, the calculating The average value of the power spectrum of the noise signal, comprising:

Fourier transformation is carried out according to the frame number for being used for estimated noise in first signal and to first signal Later t frame, kth frequency range frequency-region signal power spectral density, calculate the average value of the power spectrum of the noise signal, t is Integer more than or equal to 1, k are greater than or equal to 1 integer.

According to the first any one into second of possible implementation of first aspect, first aspect, It is described according to the initial ideal two-value masking matrix in three kinds of possible implementations, harmonic wave is carried out to first signal Compensation, the separation voice signal after obtaining harmonic compensation, comprising:

The initial ideal two-value masking matrix is updated, updated two-value masking matrix, the update are obtained Two-value masking matrix afterwards is for purifying the target separation voice signal；

According to the updated two-value masking matrix, harmonic compensation is carried out to first signal, obtains harmonic compensation Separation voice signal afterwards.

According to the third possible implementation of first aspect, in the fourth possible implementation, to described first The ideal that begins two-value masking matrix is updated, and obtains updated two-value masking matrix, comprising:

It is leading to the voice in the initial ideal two-value masking matrix according to current iteration number and maximum number of iterations The value of time frequency unit be updated；

The knot being updated according to the value of the time frequency unit leading to the voice in the initial ideal two-value masking matrix Fruit obtains updated two-value masking matrix.

According to the third or the 4th kind of possible implementation of first aspect, in a fifth possible implementation, It is described according to the updated two-value masking matrix, harmonic compensation is carried out to first signal, after obtaining harmonic compensation Separate voice signal, comprising:

According to the updated two-value masking matrix, the initially-separate voice signal of first signal is obtained；

The initially-separate voice signal is handled, ideal floating value masking matrix is obtained；

According to the ideal floating value masking matrix, harmonic compensation is carried out to first signal, after obtaining harmonic compensation Separate voice signal.

According to the 5th of first aspect the kind of possible implementation, in a sixth possible implementation,

It is described that the initially-separate voice signal is handled, obtain ideal floating value masking matrix, comprising:

Inverse Fourier transform is carried out to the initially-separate voice signal, is obtained corresponding to the initially-separate voice signal Time-domain signal；

Halfwave rectifier processing is carried out to the corresponding time-domain signal of the initially-separate voice signal, after obtaining halfwave rectifier Time-domain signal；

Short Time Fourier Transform is carried out to the time-domain signal after the halfwave rectifier, and is calculated by the Fourier in short-term The power spectral density obtained after transformation；

According to the power spectral density obtained after the Short Time Fourier Transform, the initially-separate voice signal is carried out flat Sliding processing, to obtain the result after smoothing processing；

According to after the average value of the power spectrum of the noise signal and the smoothing processing as a result, obtaining described ideal floating It is worth masking matrix.

According to the 6th of first aspect the kind of possible implementation, in the 7th kind of possible implementation, the basis Separation voice signal after the harmonic compensation, is filtered the first signal and the second signal, obtains the target point From voice signal, comprising:

According to the separation voice signal after the harmonic compensation, determination is filtered the first signal and the second signal The filter of the main channel of Shi Caiyong and the filter of subaisle；

According to the filter of the main channel used when being filtered to the first signal and the second signal and subaisle Filter is filtered the first signal and the second signal, obtains the target separation voice signal.

Second aspect, the embodiment of the present invention provide a kind of speech Separation device, comprising:

Module is obtained, for obtaining the first signal, first signal includes voice signal and noise signal；

Determining module, for determining initial ideal two-value masking matrix according to first signal, described initial ideal two Value masking matrix is for distinguishing voice signal and noise signal that first signal includes；

Harmonic compensation module, for carrying out harmonic wave to first signal according to the initial ideal two-value masking matrix Compensation, the separation voice signal after obtaining harmonic compensation；

Filter module, for believing first signal and second according to the separation voice signal after the harmonic compensation It number is filtered, obtains target separation voice signal.

In the first possible implementation of the second aspect, the determining module is specifically used for calculating the noise The average value of the power spectrum of signal；According to the average value of the power spectrum of the noise signal, determines and constitute described initial ideal two It is worth the value of all time frequency units of masking matrix；According to all time frequency units for constituting the initial ideal two-value masking matrix Value determines the initial ideal two-value masking matrix.

According to the first possible implementation of second aspect, in the second possible implementation, the determination Module, specifically for carrying out Fourier according to the frame number for being used for estimated noise in first signal and to first signal T frame after transformation, kth frequency range frequency-region signal power spectral density, calculate the average value of the power spectrum of the noise signal, T is greater than or equal to 1 integer, and k is greater than or equal to 1 integer.

According to the first any one into second of possible implementation of second aspect, second aspect, In three kinds of possible implementations, the harmonic compensation module is specifically used for carrying out the initial ideal two-value masking matrix It updates, obtains updated two-value masking matrix, the updated two-value masking matrix is for purifying the target separation language Sound signal；According to the updated two-value masking matrix, harmonic compensation is carried out to first signal, after obtaining harmonic compensation Separation voice signal.

According to the third possible implementation of second aspect, in the fourth possible implementation, the harmonic wave Compensating module is specifically used for according to current iteration number and maximum number of iterations, in the initial ideal two-value masking matrix The value of the leading time frequency unit of voice be updated；According to what is dominated to the voice in the initial ideal two-value masking matrix It is that the value of time frequency unit is updated as a result, obtaining updated two-value masking matrix.

According to the third or the 4th kind of possible implementation of second aspect, in a fifth possible implementation, The harmonic compensation module is specifically used for obtaining the initial of first signal according to the updated two-value masking matrix Separate voice signal；The initially-separate voice signal is handled, ideal floating value masking matrix is obtained；According to the ideal Floating value masking matrix, carries out harmonic compensation to first signal, the separation voice signal after obtaining harmonic compensation.

According to the 5th of second aspect the kind of possible implementation, in a sixth possible implementation, the harmonic wave Compensating module is specifically used for carrying out inverse Fourier transform to the initially-separate voice signal, obtain and the initially-separate language The corresponding time-domain signal of sound signal；Halfwave rectifier processing is carried out to the corresponding time-domain signal of the initially-separate voice signal, is obtained Time-domain signal after obtaining halfwave rectifier；Short Time Fourier Transform is carried out to the time-domain signal after the halfwave rectifier, and calculates warp Cross the power spectral density obtained after the Short Time Fourier Transform；According to the power spectrum obtained after the Short Time Fourier Transform Degree, is smoothed the initially-separate voice signal, to obtain the result after smoothing processing；According to the noise signal Power spectrum average value and after the smoothing processing as a result, obtaining the ideal floating value masking matrix.

According to the 6th of second aspect the kind of possible implementation, in the 7th kind of possible implementation, the filtering Module, specifically for according to the separation voice signal after the harmonic compensation, determine to the first signal and the second signal into The filter of the main channel used when row filtering and the filter of subaisle；It is carried out according to the first signal and the second signal The filter of the main channel used when filtering and the filter of subaisle, are filtered the first signal and the second signal, Obtain the target separation voice signal.

Speech separating method of the embodiment of the present invention and device are determined initial by obtaining the first signal according to the first signal Ideal two-value masking matrix carries out harmonic compensation to the first signal, obtains harmonic compensation according to initial ideal two-value masking matrix Separation voice signal afterwards is filtered the first signal and the second signal, obtains according to the separation voice signal after harmonic compensation Voice signal is separated to target, to reduce the generation of Energy volution in target separation voice signal, it is suppressed that target separates language The distortion of sound signal.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without any creative labor, can be with It obtains other drawings based on these drawings.

Fig. 1 is the flow chart of speech separating method provided by the embodiment of the present invention one；

Fig. 2 is the flow chart of speech separating method provided by the embodiment of the present invention two；

Fig. 3 is the structural schematic diagram of speech Separation device 300 provided by the embodiment of the present invention three；

Fig. 4 is the structural schematic diagram of speech Separation device 400 provided by the embodiment of the present invention four.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.

Fig. 1 is the flow chart of speech separating method provided by the embodiment of the present invention one.The method of the present embodiment is suitable for The case where reducing target separation voice signal distortion based on Computational auditory scene analysis.This method is executed by speech Separation device, The device is realized usually in a manner of hardware and/or software.The method of the present embodiment includes the following steps:

S110, the first signal is obtained, the first signal includes voice signal and noise signal.

S120, determine that initial ideal two-value masking matrix, initial ideal two-value masking matrix are used for area according to the first signal The voice signal and noise signal for dividing the first signal to include.

The initial ideal two-value masking matrix of S130, basis, carries out harmonic compensation to the first signal, after obtaining harmonic compensation Separate voice signal.

Often the frequency of occurrences is empty for the separation voice signal that speech separating method based on Computational auditory scene analysis obtains The phenomenon that hole, occurs so as to cause separation voice signal distortion in order to reduce the phenomenon, by according to initial reason in the present embodiment Think two-value masking matrix, harmonic compensation is carried out to the first signal, to obtain the separation voice signal after harmonic compensation, with this Reduce frequency cavitation.

S140, according to the separation voice signal after harmonic compensation, the first signal and the second signal are filtered, mesh is obtained Mark separation voice signal.

Existing single pass speech signal separation technology has extraordinary treatment effect for steady noise, but for similar The non-stationaries noises such as background music, non-targeted people's sound of speaking can then generate speech damage.And according to the separation after harmonic compensation Voice signal is filtered the first signal and the second signal, and the method for obtaining target separation voice signal can make full use of mesh Poster sound and noise are filtered the first signal and the second signal in the redundancy of different spatial, so as to To further suppress target separation voice signal distortion.

Specifically, obtaining the first signal, initial ideal two-value masking matrix is determined according to the first signal, according to initial ideal Two-value masking matrix carries out harmonic compensation to the first signal, the separation voice signal after obtaining harmonic compensation, according to harmonic compensation Separation voice signal afterwards, is filtered the first signal and the second signal, obtains target separation voice signal.

Speech separating method provided in this embodiment is determined initial ideal by obtaining the first signal according to the first signal Two-value masking matrix carries out harmonic compensation to the first signal, after obtaining harmonic compensation according to initial ideal two-value masking matrix Separation voice signal is filtered the first signal and the second signal, obtains mesh according to the separation voice signal after harmonic compensation Mark separation voice signal, to reduce the generation of Energy volution in target separation voice signal, it is suppressed that target separates voice letter Number distortion.

Based on above-described embodiment, further progress optimization, Fig. 2 is provided by the embodiment of the present invention two the present embodiment Speech separating method flow chart, referring to Fig. 2, the method for the present embodiment may include:

S210, the first signal is obtained, the first signal includes voice signal and noise signal.

First signal can come from the main microphon close to speaker mouth.

S220, calculate noise signal power spectrum average value.

The average value for calculating the power spectrum of noise signal can be accomplished in that

Y (t, k) indicates the first signal by t after Fourier transformation Frame, kth frequency range frequency-region signal power spectral density, T indicate the first signal totalframes, D indicate T frame mixing voice signal open Stage beginning is used for the frame number of estimated noise, and D ' expression T frame mixing voice signal ending phase is used for the frame number of estimated noise.

It should be noted that usually starting voice in several frames recorded of recording and terminate in practical Recording Process Energy lacks completely, therefore Speech processing is frequently utilized that these frames carry out estimated noise, such as can use and start to record Scale section and each 20 frame of ending phase carry out estimated noise, that is, the power of 20 frames when 20 frames and recording for starting recording close to an end The average value namely D that the average value of spectrum density is composed as power noise are equal to 20, and value of the D ' equal to 20, D and D ' can phase It together, can not also be identical.

S230, the average value according to the power spectrum of noise signal are determined and are constituted all of initial ideal two-value masking matrix The value of time frequency unit.

Wherein, it according to the average value of the power spectrum of noise signal, determines and constitutes all of initial ideal two-value masking matrix The value of time frequency unit can be accomplished in that

γ indicates control parameter, and 1.5≤γ≤2.5, M (t, k) indicate first The value of the corresponding time frequency unit (t, k) of signal t frame, kth frequency range, wherein M (t, k), which is equal to 1, indicates that the time frequency unit is language Sound dominates time frequency unit, and M (t, k), which is equal to 0, indicates that the time frequency unit is that noise dominates time frequency unit.

It should be noted that using excessively high control parameter γ, more target voice energy can be lost, and too low One control parameter can then remain more noise energies, can both be reached very when the first control parameter is set as 2 Good counterbalance effect.

S240, according to the value for all time frequency units for constituting initial ideal two-value masking matrix, determine initial ideal two-value Masking matrix.

S250, initial ideal two-value masking matrix is updated, obtains updated two-value masking matrix, it is updated Two-value masking matrix is for purifying target separation voice signal.

Since the power spectral density of the leading time frequency unit of some noises may also can be much larger than the power spectrum of noise signal Average value, therefore in the prior art using the ideal two-value masking matrix estimation method based on threshold value can generate it is many discrete Noise, such as power spectral density still have been retained greater than the noise of the average value of the power spectrum of noise signal.And this reality It applies in example and is updated by initial ideal two-value masking matrix, discrete two-value can effectively be inhibited to shelter mistake, so that It is purer to separate voice signal.

For example, initial ideal two-value masking matrix is updated, obtaining updated two-value masking matrix can be with It is accomplished in that

According to current iteration number and maximum number of iterations, when leading to the voice in initial ideal two-value masking matrix The value of frequency unit is updated；It is carried out more according to the value of the time frequency unit leading to the voice in initial ideal two-value masking matrix It is new as a result, obtaining updated two-value masking matrix.

Specifically, according to current iteration number and maximum number of iterations, to the voice in initial ideal two-value masking matrix The value of leading time frequency unit, which is updated, to be accomplished in that

If current iteration number i is less than maximum number of iterations N_iter, then from the time-frequency list of initial ideal two-value masking matrix A time frequency unit (t, k) is randomly choosed in member, wherein N_iter=3 × K × T, K indicate that the first signal passes through Fourier transformation Frequency range number afterwards, 1≤t≤T；If a randomly selected time frequency unit (t, k) is the leading time frequency unit of voice, basis Time frequency unit distribution function calculates N and N', and calculates the value of p (M (t, k)=1) and p (M (t, k)=0)；Calculate p (M (t, k) =1) and the ratio r of p (M (t, k)=0)₀, whereinIf r is less than or equal to r₀, then when will be randomly selected Frequency unit (t, k) is updated to noise and dominates time frequency unit, and r indicates the random number generated using random function, the first signal Corresponding initial IBM matrix M is M ' by the updated matrix of iteration.

Wherein, N indicate the first signal t frame, the corresponding time frequency unit (t, k) of kth frequency range neighborhood in time frequency unit Sum has been marked as the number of the leading time frequency unit of voice in N ' expression neighborhood.Calculate p (M (t, k)=1) and p (M (t, K) value=0) need to be determining according to time frequency unit distribution function, time frequency unit Distribution Function Definition are as follows:

Wherein, α, δ respectively indicate different control parameters, the time frequency unit in the neighborhood of time frequency unit (t, k) be expressed as (t ', K '), the value range of t ', k ' are as follows: (t ', k ') | | and t-t ' |≤N_t,|k-k′|≤N_k, N_tAnd N_kValue be all 1, exp (α × N' α × N' the power for) indicating e indicates that the power of e, exp (α × (N-N')) indicate e α × (N-N') power, indicate the power of e, p (M (t, k)=1) indicates neighborhood Interior that time frequency unit (t, k) is revised as to the probability that voice dominates time frequency unit, p (M (t, k)=0) is indicated time frequency unit in neighborhood (t, k) is revised as the probability that noise dominates time frequency unit,Indicate p (M (t, k)=1) to it is directly proportional, Indicate p (M (t, k)=0) withIt is directly proportional.

It should be noted that control parameter α is equal to 2, control parameter δ is equal to 0.25, has comprehensively considered part and neighborhood Energy distribution information, to effectively inhibit discrete two-value masking mistake.

S260, according to updated two-value masking matrix, harmonic compensation is carried out to the first signal, after obtaining harmonic compensation Separate voice signal.

For example, according to updated two-value masking matrix, harmonic compensation is carried out to the first signal, obtains harmonic compensation Separation voice signal afterwards can be accomplished in that

According to updated two-value masking matrix, the initially-separate voice signal of the first signal is obtained；To initially-separate language Sound signal is handled, and ideal floating value masking matrix is obtained；According to the floating value masking matrix of ideal, harmonic wave benefit is carried out to the first signal It repays, the separation voice signal after obtaining harmonic compensation.

Wherein, initially-separate voice signal is handled, obtaining ideal floating value masking matrix can be in the following way It realizes:

Inverse Fourier transform is carried out to initially-separate voice signal, obtains time domain letter corresponding with initially-separate voice signal Number；Halfwave rectifier processing is carried out to the corresponding time-domain signal of initially-separate voice signal, the time-domain signal after obtaining halfwave rectifier； Short Time Fourier Transform is carried out to the time-domain signal after halfwave rectifier, and calculates the power obtained after Short Time Fourier Transform Spectrum density；According to the power spectral density obtained after Short Time Fourier Transform, initially-separate voice signal is smoothed, with Result after obtaining smoothing processing；According to the average value of the power spectrum of noise signal and smooth treated as a result, obtaining ideal Floating value masking matrix.

Specifically, the process for obtaining the separation voice signal after harmonic compensation is as follows:

According to updated two-value masking matrix M ', obtain the first signal initially-separate voice signal M ' (t, k) y (t, K), wherein y (t, k) indicates the first signal by t frame after Fourier transformation, the frequency-region signal of kth frequency range, and M ' is initial Ideal two-value masking matrix M passes through the updated matrix of iteration；Inverse Fourier transform ISTFT is carried out to M ' (t, k) y (t, k), is obtained It obtains and the corresponding time-domain signal s (t) of M ' (t, k) y (t, k), wherein s (t)=ISTFT (M ' (t, k) y (t, k))；To s (t) into The processing of row halfwave rectifier, wherein, max (s (t), 0) is indicated the time-domain signal after obtaining halfwave rectifier Take the maximum value in s (t) and 0；To progress Short Time Fourier Transform STFT, and calculate the power spectrum obtained after STFT Spending indicates to progress STFT, | it indicates to progress STFT The power spectral density obtained afterwards；According to being smoothed, to obtain the result after smoothing processingμ indicates control parameter, 0.5≤μ≤0.9；According to noise signal The average value of power spectrum and determining ideal floating value masking matrix R (t, k), wherein According to R (t, k), the frequency-region signal Q (t, k) of the separation voice signal after harmonic compensation is obtained, wherein Q (t, k)=R (t, k)y(t,k)；ISTFT, acquisition and Q (t, k) corresponding time-domain signal are carried out to Q (t, k) and are determined as by harmonic wave Compensated separation voice signal, wherein ISTFT (Q (t, k)) indicates to carry out Q (t, k) inverse Fu In leaf transformation.

S270, according to the separation voice signal after harmonic compensation, determine when being filtered to the first signal and the second signal The filter of the main channel of use and the filter of subaisle.

According to the separation voice signal after harmonic compensationCalculation formula H when being minimized₁And h₂, and willH when being minimized₁And h₂ It is expressed asWithy₁Indicate the first signal, y₂Indicate second signal, h₁Indicate the spatial filter of main channel, h₂It indicates The spatial filter of secondary channels, λ expression control parameter, 0.0001≤λ≤0.05,It indicatesTwo norms square, | | h₁||₁Indicate h₁A norm, | | h₂||₁Indicate h₂A norm.Its In, second signal can be the signal from the secondary microphone remote apart from speaker mouth.

It should be noted that the filter of main channel can be for close to the corresponding filtering of the close main microphon in speaker mouth The filter of device, subaisle can be for close to the secondary microphone corresponding filtering remoter than the main microphon of distance of speaker mouth Device.

S280, according to the filter of main channel and subaisle used when being filtered to the first signal and the second signal Filter is filtered the first signal and the second signal, obtains target separation voice signal.

According toWithThe first signal and the second signal are filtered, and are determinedVoice signal is separated for target, In,It should be noted thatWithIt is adopted when being respectively filtered to the first signal and the second signal The filter of main channel and the filter of subaisle.

It should be noted that since the first signal and the second signal may be considered targeted voice signal and background noise warp Adduction signal after crossing corresponding time shift and decaying, target separate voice signal and are equal to the first signal and the second signal through filtering The adduction of signal after wave can approach target, i.e., wherein, y using the separation voice signal after harmonic compensation as target separation voice signal to calculate target separation voice signal₁And y₂Respectively The first signal and the second signal, therefore need to only calculate appropriate h₁And h₂, that is, can determine target separation voice signal and be Calculate appropriate h₁And h₂, can be determined by formula, this implementation H when being minimized in example by calculation formula₁And h₂As to The filter for the main channel that one signal and second signal use when being filtered and the filter of subaisle calculate target and separate language Sound signal wherein, the filter of the filter and subaisle of the main channel used when being filtered to the first signal and the second signal Wave device be expressed as and namely

Speech separating method provided in this embodiment, the average value of the power spectrum by calculating noise signal, according to noise The average value of the power spectrum of signal determines the value for constituting all time frequency units of initial ideal two-value masking matrix, to initial reason Think that two-value masking matrix is updated, obtain updated two-value masking matrix, according to updated two-value masking matrix, to One signal carries out harmonic compensation, the separation voice signal after obtaining harmonic compensation, according to the separation voice signal after harmonic compensation, The filter of the main channel used when being filtered to the first signal and the second signal and the filter of subaisle are determined, according to right The filter of the main channel used when the first signal and the second signal are filtered and the filter of subaisle, to the first signal and Second signal is filtered, and obtains target separation voice signal, to guarantee to retain the leading time frequency unit energy of more voices The leading time frequency unit energy of noise is measured and effectively refused, keeps target separation voice signal purer, and reduce target point Generation from Energy volution in voice signal, it is suppressed that the distortion of target separation voice signal.

Fig. 3 is the structural schematic diagram of speech Separation device 300 provided by the embodiment of the present invention three.The device of the present embodiment The case where suitable for reducing target separation voice signal distortion based on Computational auditory scene analysis.The device usually with hardware and/ Or the mode of software is realized.The device of the present embodiment includes following module: obtaining module 310, determining module 320, harmonic wave and mends Repay module 330 and filter module 340.

Module 310 is obtained for obtaining the first signal, the first signal includes voice signal and noise signal；Determining module 320 for determining initial ideal two-value masking matrix according to the first signal, and initial ideal two-value masking matrix is for distinguishing first The voice signal and noise signal that signal includes；Harmonic compensation module 330 is used for according to initial ideal two-value masking matrix, to the One signal carries out harmonic compensation, the separation voice signal after obtaining harmonic compensation；After filter module 340 is used for according to harmonic compensation Separation voice signal, the first signal and the second signal are filtered, obtain target separation voice signal.

Further, determining module 320, the average value of the power spectrum specifically for calculating noise signal；Believed according to noise Number power spectrum average value, determine the value for constituting all time frequency units of initial ideal two-value masking matrix；It is first according to constituting The value of all time frequency units of the ideal that begins two-value masking matrix determines initial ideal two-value masking matrix.

Further, determining module 320, specifically for according to the frame number for being used for estimated noise in the first signal and to the One signal carry out Fourier transformation after t frame, kth frequency range frequency-region signal power spectral density, calculate the function of noise signal The average value of rate spectrum, t are greater than or equal to 1 integer, and k is greater than or equal to 1 integer.

Further, harmonic compensation module 330 is obtained specifically for being updated to initial ideal two-value masking matrix Updated two-value masking matrix, updated two-value masking matrix is for purifying target separation voice signal；After update Two-value masking matrix, harmonic compensation, separation voice signal after obtaining harmonic compensation are carried out to the first signal.

Further, harmonic compensation module 330 is specifically used for according to current iteration number and maximum number of iterations, to first The value for the time frequency unit that voice in the ideal that begins two-value masking matrix is dominated is updated；Square is sheltered according to initial ideal two-value The value of the leading time frequency unit of voice in battle array be updated as a result, obtaining updated two-value masking matrix.

Further, harmonic compensation module 330 is specifically used for obtaining the first letter according to updated two-value masking matrix Number initially-separate voice signal；Initially-separate voice signal is handled, ideal floating value masking matrix is obtained；According to ideal Floating value masking matrix, carries out harmonic compensation to the first signal, the separation voice signal after obtaining harmonic compensation.

Further, harmonic compensation module 330 is specifically used for carrying out inverse Fourier transform to initially-separate voice signal, Obtain time-domain signal corresponding with initially-separate voice signal；Half-wave is carried out to the corresponding time-domain signal of initially-separate voice signal Rectification processing, the time-domain signal after obtaining halfwave rectifier；Short Time Fourier Transform is carried out to the time-domain signal after halfwave rectifier, and Calculate the power spectral density obtained after STFT；According to the power spectral density obtained after Short Time Fourier Transform, to initial point It is smoothed from voice signal, to obtain the result after smoothing processing；According to the average value of the power spectrum of noise signal and It is after smoothing processing as a result, obtaining ideal floating value masking matrix.

Further, filter module 340, specifically for determining to first according to the separation voice signal after harmonic compensation The filter for the main channel that signal and second signal use when being filtered and the filter of subaisle；According to the first signal and The filter of the main channel used when second signal is filtered and the filter of subaisle, to the first signal and the second signal into Row filtering obtains target separation voice signal.

Speech Separation device provided in this embodiment is determined initial ideal by obtaining the first signal according to the first signal Two-value masking matrix carries out harmonic compensation to the first signal, after obtaining harmonic compensation according to initial ideal two-value masking matrix Separation voice signal is filtered the first signal and the second signal, obtains mesh according to the separation voice signal after harmonic compensation Mark separation voice signal, to reduce the generation of Energy volution in target separation voice signal, it is suppressed that target separates voice letter Number distortion.

Correspondingly, Fig. 4 is the structural representation of speech Separation device 400 provided by the embodiment of the present invention four refering to attached drawing 4 Figure, which includes processor 401, memory 402, communication interface 403 and bus 404.Wherein, processor 401, Memory 402, communication interface 403 are connected with each other by bus 404.

Memory 402, for storing program.Specifically, program may include program code, and program code includes computer Operational order.

Processor 401 executes the program that memory 402 is stored, and realizes speech separating method, comprising:

For processor 401 for obtaining the first signal, the first signal includes voice signal and noise signal；According to the first signal Determine initial ideal two-value masking matrix, initial ideal two-value masking matrix be used to distinguish voice signal that the first signal includes with Noise signal；According to initial ideal two-value masking matrix, harmonic compensation is carried out to the first signal, the separation after obtaining harmonic compensation Voice signal；According to the separation voice signal after harmonic compensation, the first signal and the second signal are filtered, obtain target point From voice signal.

Further, processor 401, the average value of the power spectrum specifically for calculating noise signal；According to noise signal Power spectrum average value, determine the value for constituting all time frequency units of initial ideal two-value masking matrix；It is initial according to constituting The value of all time frequency units of ideal two-value masking matrix determines initial ideal two-value masking matrix.

Further, processor 401, specifically for according in the first signal be used for estimated noise frame number and to first Signal carry out Fourier transformation after t frame, kth frequency range frequency-region signal power spectral density, calculate the power of noise signal The average value of spectrum, t are greater than or equal to 1 integer, and k is greater than or equal to 1 integer.

Further, processor 401, specifically for being updated to initial ideal two-value masking matrix, after obtaining update Two-value masking matrix, updated two-value masking matrix for purify target separation voice signal；According to updated two-value Masking matrix carries out harmonic compensation to the first signal, the separation voice signal after obtaining harmonic compensation.

Further, processor 401 are specifically used for according to current iteration number and maximum number of iterations, to initial ideal The value for the time frequency unit that voice in two-value masking matrix is dominated is updated；According to in initial ideal two-value masking matrix The value of the leading time frequency unit of voice be updated as a result, obtaining updated two-value masking matrix.

Further, processor 401 are specifically used for obtaining the first of the first signal according to updated two-value masking matrix Begin separation voice signal；Initially-separate voice signal is handled, ideal floating value masking matrix is obtained；It is covered according to the floating value of ideal Matrix is covered, harmonic compensation is carried out to the first signal, the separation voice signal after obtaining harmonic compensation.

Further, processor 401, be specifically used for initially-separate voice signal carry out inverse Fourier transform, obtain with The corresponding time-domain signal of initially-separate voice signal；The corresponding time-domain signal of initially-separate voice signal is carried out at halfwave rectifier Reason, the time-domain signal after obtaining halfwave rectifier；Short Time Fourier Transform is carried out to the time-domain signal after halfwave rectifier, and calculates warp Cross the power spectral density obtained after STFT；According to the power spectral density obtained after Short Time Fourier Transform, to initially-separate voice Signal is smoothed, to obtain the result after smoothing processing；According to the average value of the power spectrum of noise signal and smooth place It is after reason as a result, obtaining ideal floating value masking matrix.

Further, processor 401, specifically for determining and believing first according to the separation voice signal after harmonic compensation Number and the filter of the filter of main channel and subaisle that uses when being filtered of second signal；According to the first signal and The filter of the main channel used when binary signal is filtered and the filter of subaisle carry out the first signal and the second signal Filtering obtains target separation voice signal.

Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above-mentioned each method embodiment can lead to The relevant hardware of program instruction is crossed to complete.Program above-mentioned can be stored in a computer readable storage medium.The journey When being executed, execution includes the steps that above-mentioned each method embodiment to sequence；And storage medium above-mentioned include: ROM, RAM, magnetic disk or The various media that can store program code such as person's CD.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations；To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement；And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims

1. a kind of speech separating method characterized by comprising

Initial ideal two-value masking matrix is determined according to first signal, and the initial ideal two-value masking matrix is for distinguishing The voice signal and noise signal that first signal includes；

According to the initial ideal two-value masking matrix, harmonic compensation is carried out to first signal, after obtaining harmonic compensation Separate voice signal；

According to the separation voice signal after the harmonic compensation, the first signal and the second signal are filtered, mesh is obtained Mark separation voice signal；

Wherein, first signal is the signal of the main channel in mixing voice signal, and the second signal is the creolized language The signal of subaisle in sound signal.

2. the method according to claim 1, wherein described determine initial ideal two-value according to first signal Masking matrix, comprising:

Calculate the average value of the power spectrum of the noise signal；

According to the average value of the power spectrum of the noise signal, the institute of the composition initial ideal two-value masking matrix is determined sometimes The value of frequency unit；

According to the value for all time frequency units for constituting the initial ideal two-value masking matrix, determine that the initial ideal two-value is covered Cover matrix.

3. according to the method described in claim 2, it is characterized in that, the power spectrum for calculating the noise signal is averaged Value, comprising:

After according to the frame number for being used for estimated noise in first signal and to first signal progress Fourier transformation T frame, kth frequency range frequency-region signal power spectral density, calculate the average value of the power spectrum of the noise signal, t is greater than Or the integer equal to 1, k are greater than or equal to 1 integer.

4. method described in any one of claim 1 to 3, which is characterized in that described according to the initial ideal two-value Masking matrix carries out harmonic compensation to first signal, the separation voice signal after obtaining harmonic compensation, comprising:

The initial ideal two-value masking matrix is updated, updated two-value masking matrix is obtained, it is described updated Two-value masking matrix is for purifying the target separation voice signal；

According to the updated two-value masking matrix, harmonic compensation is carried out to first signal, after obtaining harmonic compensation Separate voice signal.

5. according to the method described in claim 4, it is characterized in that, be updated to the initial ideal two-value masking matrix, Obtain updated two-value masking matrix, comprising:

According to current iteration number and maximum number of iterations, when leading to the voice in the initial ideal two-value masking matrix The value of frequency unit is updated；

According to the value of the time frequency unit leading to the voice in the initial ideal two-value masking matrix be updated as a result, To updated two-value masking matrix.

6. right according to the method described in claim 5, it is characterized in that, described according to the updated two-value masking matrix First signal carries out harmonic compensation, the separation voice signal after obtaining harmonic compensation, comprising:

According to the ideal floating value masking matrix, harmonic compensation is carried out to first signal, the separation after obtaining harmonic compensation Voice signal.

7. according to the method described in claim 6, it is characterized in that, described handle the initially-separate voice signal, Obtain ideal floating value masking matrix, comprising:

Inverse Fourier transform is carried out to the initially-separate voice signal, when obtaining corresponding with the initially-separate voice signal Domain signal；

Halfwave rectifier processing is carried out to the corresponding time-domain signal of the initially-separate voice signal, the time domain after obtaining halfwave rectifier Signal；

Short Time Fourier Transform is carried out to the time-domain signal after the halfwave rectifier, and is calculated by the Short Time Fourier Transform The power spectral density obtained afterwards；

According to the power spectral density obtained after the Short Time Fourier Transform, the initially-separate voice signal is smoothly located Reason, to obtain the result after smoothing processing；

It is covered according to after the average value of the power spectrum of the noise signal and the smoothing processing as a result, obtaining the ideal floating value Cover matrix.

8. the method according to the description of claim 7 is characterized in that the separation voice according to after the harmonic compensation is believed Number, the first signal and the second signal are filtered, the target separation voice signal is obtained, comprising:

According to the separation voice signal after the harmonic compensation, determines and adopted when being filtered to the first signal and the second signal The filter of main channel and the filter of subaisle；

According to the filtering of the filter of the main channel used when being filtered to the first signal and the second signal and subaisle Device is filtered the first signal and the second signal, obtains the target separation voice signal.

9. a kind of speech Separation device characterized by comprising

Determining module, for determining initial ideal two-value masking matrix according to first signal, the initial ideal two-value is covered Matrix is covered for distinguishing voice signal and noise signal that first signal includes；

Harmonic compensation module, for carrying out harmonic compensation to first signal according to the initial ideal two-value masking matrix, Separation voice signal after obtaining harmonic compensation；

Filter module, for according to the separation voice signal after the harmonic compensation, to the first signal and the second signal into Row filtering obtains target separation voice signal；

10. device according to claim 9, which is characterized in that the determining module is specifically used for calculating the noise letter Number power spectrum average value；According to the average value of the power spectrum of the noise signal, determines and constitute the initial ideal two-value The value of all time frequency units of masking matrix；According to all time frequency units for constituting the initial ideal two-value masking matrix Value determines the initial ideal two-value masking matrix.

11. device according to claim 10, which is characterized in that the determining module is specifically used for according to described first In signal for estimated noise frame number and to first signal carry out Fourier transformation after t frame, kth frequency range frequency The power spectral density of domain signal calculates the average value of the power spectrum of the noise signal, and t is greater than or equal to 1 integer, and k is Integer more than or equal to 1.

12. the device according to any one of claim 9~11, which is characterized in that the harmonic compensation module is specific to use It is updated in the initial ideal two-value masking matrix, obtains updated two-value masking matrix, described updated two Value masking matrix is for purifying the target separation voice signal；According to the updated two-value masking matrix, to described One signal carries out harmonic compensation, the separation voice signal after obtaining harmonic compensation.

13. device according to claim 12, which is characterized in that the harmonic compensation module is specifically used for according to current The number of iterations and maximum number of iterations, the value of the time frequency unit leading to the voice in the initial ideal two-value masking matrix into Row updates；The knot being updated according to the value of the time frequency unit leading to the voice in the initial ideal two-value masking matrix Fruit obtains updated two-value masking matrix.

14. device according to claim 13, which is characterized in that the harmonic compensation module is specifically used for according to Updated two-value masking matrix obtains the initially-separate voice signal of first signal；The initially-separate voice is believed It number is handled, obtains ideal floating value masking matrix；According to the ideal floating value masking matrix, first signal is carried out humorous Wave compensation, the separation voice signal after obtaining harmonic compensation.

15. device according to claim 14, which is characterized in that the harmonic compensation module is specifically used for described first Begin to separate voice signal progress inverse Fourier transform, obtains time-domain signal corresponding with the initially-separate voice signal；To institute It states the corresponding time-domain signal of initially-separate voice signal and carries out halfwave rectifier processing, the time-domain signal after obtaining halfwave rectifier；It is right Time-domain signal after the halfwave rectifier carries out Short Time Fourier Transform, and calculates and obtain after the Short Time Fourier Transform Power spectral density；According to the power spectral density obtained after the Short Time Fourier Transform, to the initially-separate voice signal It is smoothed, to obtain the result after smoothing processing；According to the average value of the power spectrum of the noise signal and described flat It is sliding that treated as a result, obtaining the ideal floating value masking matrix.

16. device according to claim 15, which is characterized in that the filter module is specifically used for according to the harmonic wave Compensated separation voice signal determines the filtering of the main channel used when being filtered to the first signal and the second signal The filter of device and subaisle；According to the filter of the main channel used when being filtered to the first signal and the second signal With the filter of subaisle, the first signal and the second signal are filtered, obtain the target separation voice signal.