CN107578784A

CN107578784A - A kind of method and device that target source is extracted from audio

Info

Publication number: CN107578784A
Application number: CN201710816430.4A
Authority: CN
Inventors: 郑羲光; 尚梦宸; 刘飞
Original assignee: Sound Man (beijing) Technology Co Ltd
Current assignee: Suzhou Yinman Technology Co.,Ltd.
Priority date: 2017-09-12
Filing date: 2017-09-12
Publication date: 2018-01-12
Anticipated expiration: 2037-09-12
Also published as: CN107578784B

Abstract

The present invention discloses a kind of method and device that target source is extracted from audio.Method includes：Time-frequency conversion is carried out frame by frame to the audio signal of collection, time-domain signal is transformed to frequency-region signal, carrying out segmentation to frequency-region signal using window function forms two paths of signals；Traversal is calculated under given frequency per the virtual angle of virtual source corresponding to each frequency of two paths of signals of frame frequency-region signal；Compare the size of virtual angle and predetermined angular threshold value, according to comparative result using a signal as target source signal and extract the target source signal frequency-region signal store；The frequency-region signal of the target source signal of storage is converted into time-domain signal using time-frequency inverse transformation, exports target source time-domain signal.The present invention realizes separates target source signal from audio signal.

Description

A kind of method and device that target source is extracted from audio

Technical field

The present invention relates to Audio Signal Processing technical field, and in particular to it is a kind of from audio extract target source method and Device.

Background technology

The singing scoring system of KTV in the markets is largely to be risen and fallen with the tone of performance or volume scores at present , it is impossible to really scored according to the sound of singer, the points-scoring system of this low precision just can not increasingly meet to consume The demand of person.Justing think one, to sing that very pleasing to the ear people and one sings be not that so good people their fractions are identical, Huo Zheyin The bad people's fraction sung for the relation of volume is very high on the contrary, and the scoring of such some people substantially reduced of giving a mark is positive Property.So the improvement for KTV points-scoring systems becomes extremely important, in order to improve KTV points-scoring systems, scoring is set to become more smart Standard, we can use the voice of original singer in song to contrast the voice that consumer sings in KTV, and the goodness of fit of the two is more high, scores Will be higher.And the first step so done is then to add original singer's voice in song in voice from the accompaniment of song individually to extract Come, but how to extract the voice of original singer well from the self-contained song audio for having voice with accompaniment, turn into difficult Topic.

The content of the invention

In view of the technical drawbacks of the prior art, it is an object of the present invention to provide one kind extracts target from audio The method and device in source.

Technical scheme is used by realize the purpose of the present invention：

A kind of method that target source is extracted from audio, including step：

Time-frequency conversion is carried out frame by frame to the audio signal of collection, time-domain signal is transformed to frequency-region signal, utilizes window function Frequency-region signal is split, forms first via signal and second road signal；

The first via signal that traversal is calculated under given frequency per frame frequency-region signal is corresponding with each frequency of second road signal The virtual angle of virtual source；

Compare the virtual angle and the size of predetermined angular threshold value, according to comparative result by first via signal or the second tunnel Signal as target source signal and extract the target source signal frequency-region signal storage；

The frequency-region signal of the target source signal of storage is converted into time-domain signal using time-frequency inverse transformation, when exporting target source Domain signal.

Another aspect of the present invention, which also resides in, provides a kind of device that target source is extracted from audio, including：

Time-domain and frequency-domain conversion segmentation module, carries out time-frequency conversion, by time-domain signal frame by frame for the audio signal to collection Frequency-region signal is transformed to, frequency-region signal is split using window function, forms first via signal and second road signal；

Virtual angle calcu-lation module, calculated for traveling through under given frequency per the first via signal and second of frame frequency-region signal The virtual angle of virtual source corresponding to each frequency of road signal；

Target source signal memory module, for the virtual angle and the size of predetermined angular threshold value, according to comparing As a result first via signal or second road signal as target source signal and are extracted the frequency-region signal of the target source signal and stored；

Time domain frequency domain converts output module, for being turned the frequency-region signal of the target source signal of storage using time-frequency inverse transformation Time-domain signal is changed to, exports target source time-domain signal.

After the inventive method by carrying out time-domain and frequency-domain conversion frame by frame by audio signal to be separated, first via letter is formed respectively Number and second road signal, then calculate each of first via signal and second road signal under frequency per frame frequency-region signal by traveling through The virtual angle of virtual source corresponding to frequency, according to the virtual angle compared with predetermined angular threshold value, realization will be satisfactory Target source Signal separator out stores, and is exported again after the conversion of frequency domain to time domain afterwards, realizes target source signal from sound Separation and Extraction comes out in frequency signal, and the convenient processing subsequently to target source signal uses.

Brief description of the drawings

Fig. 1 is the flow chart for the method that target source is extracted from audio；

Fig. 2 is the calculating schematic diagram of the virtual angle of virtual source；

Fig. 3 is the structural representation for the device that target source is extracted from audio.

Embodiment

The present invention is described in further detail below in conjunction with the drawings and specific embodiments.It is it should be appreciated that described herein Specific embodiment only to explain the present invention, be not intended to limit the present invention.

It is shown in Figure 1, a kind of method that target source is extracted from audio, including step：

Time-frequency conversion is carried out frame by frame to the audio signal of collection, time-domain signal is transformed to frequency-region signal, utilizes window function Frequency-region signal is split, forms first via signal a and second road signal b；

Traversal is calculated under given frequency k, per corresponding to the first via signal and each frequency of second road signal of frame frequency-region signal Virtual source virtual angle theta_ab(k)；

Compare the virtual angle theta_ab(k) with the size of predetermined angular threshold value, according to comparative result by first via signal or Second road signal as target source signal and extract the target source signal frequency-region signal storage；

The inventive method can be realized the voice of song in KTV systems and the voice (target in accompaniment mixed audio Source) separate individually storage after export, so be in follow-up KTV points-scoring systems exactly assess singer vacuum sing Level provides the foundation, when the voice of song in accompaniment mixed audio with extracting voice in for KTV systems, as will be above-mentioned First via signal, by calculating the virtual angle of virtual source, then compares as accompaniment signal, second road signal as human voice signal The size of more virtual angle and predetermined angular threshold value, according to the difference of voice and the virtual angle of the virtual source of the signal of accompaniment, , will the satisfactory virtual angle according to multilevel iudge result by setting a predetermined angular threshold value come multilevel iudge Signal corresponding to corresponding virtual source individually stores as human voice signal, uses the same manner time to every frame audio signal successively Go through processing, it is possible to realize and the voice of song is separated into storage from voice with the mixed audio signal accompanied.

The predetermined angular threshold value rule of thumb determines, can be 5 degree, 3 degree or other angles, the window function can The window function of same size or different size of window function are selected as needed, to be reduced as far as window function segmentation frequency domain letter Number when spectrum energy leakage.The frequency-region signal of the target source signal of storage is being converted to using time-frequency inverse transformation (ISTFT) Time-domain signal, when exporting target source time-domain signal, accordingly window function size during corresponding intercept is used to be reduced.

Wherein, the calculation of the virtual angle of the virtual source is as follows：

In formula, θ_ab(k) the first via signal a virtual sources corresponding with each frequencies of second road signal b that frequency k is presented are represented Virtual angle, A_aAnd A (k)_b(k) amplitude for the frequency k that first via signal a and second road signal b is presented is represented respectively,Represent First via signal a and second road signal b angle.

Due to the openness principle of audio signal, in same frequency of same time, a first via signal a time-frequency The output of point and the output of a second road signal b time frequency point, always have one to be far longer than another；Time frequency point is with Y-axis For frequency (HZ), X-axis is the amplitude size of the signal represented by signal in the coordinate system of time, unit dB.Such as following table institute Show：

Frequency (HZ)	20	40	60	……
					First via signal a	a1	a2	a3	……
First via signal b	b1	b2	b3	……

According to openness principle, in first via signal a and second road signal b comparison, a1 represents that first via signal a exists Frequency is under 20HZ, and the output of time-frequency conversion STFT (short time discrete Fourier transform) time frequency point, b1 represents second road signal b In the case where frequency is 20HZ, the output of a time-frequency conversion STFT time frequency point, then a1, b1 should be plural number.Always have | a1 | ＞＞ | b1 |, and b1 ≈ 0, or | a1 | ＜＜ | b1 |, and a1 ≈ 0, behind similarly.

Therefore, the openness principle of the signal is utilized, it is possible to achieve judge difference two using the virtual angle of virtual source Individual signal, it would be desirable to echo signal separate storage.

It is specifically shown in shown in accompanying drawing 2, the Fig. 2 illustrates how to calculate the amplitude that first via signal a and second road signal b is presented Virtual source 40 virtual angle theta_ab, A_a, A_bFor first via signal a and second road signal b amplitude, the angle of two signalsFor -30 ° to 30 °, two loudspeakers in Fig. 2, the audio signal that left speaker 10, right loudspeaker 20 are sent out, which is given, is located at two The hearer 30 in individual loudspeaker centre position, two such raise one's voice sound device transmission sound reach hearer 30 human ear frequency k Available virtual source 40 is presented first via signal a and second road signal b amplitude As_aAnd A (k)_b(k)。

The frequency k that the first via signal a and second road signal b obtained after time-frequency conversion is presented virtual source is presented The amplitude A of two signals_aAnd A (k)_b(k), certain present invention is not limited to what first via signal a and second road signal b was presented Frequency k virtual source is presented the amplitude A of two signals_aAnd A (k)_b(k) processing or multiple signals, if any multiple letters Number (more than two), then have known signal (signal containing target source) and other signals A_iAdd and：Sum(|A_iI)=∑_iIA_iI= A₁+…+A_I, another signal that (1≤i≤I) is formed and the virtual source that is presented handles, actually also by two signals at Presented virtual source is managed to handle.

It can thus be calculated virtual under given frequency k according to the virtual angle calculation formula of virtual source noted earlier The virtual angle θ in source_ab(k) size, it is positive or negative value：Virtual source is by the first via signal a selected, second road signal b Amplitude and virtual angle be expressed as：{A_a(k), A_b(k), θ_ab}。θ_abAs side information (Side information), i.e., second Road signal a, the virtual angles of second road signal b, i.e. virtual source angle, can by the auxiliary of side information (Side information) To analyze original signal, target source is determined whether ----voice.

After calculating virtual source angle, in the signal angle of two given signalsIn the range of (- 30 ° to 30 °, Size is fixed), in a certain frequency in same frame, if the output of first via signal a time frequency point is more than second road signal b (assuming that first via signal a is accompaniment, second road signal b is voice) then virtual angle theta_abIt can be tilted to first via signal a, it is on the contrary It can then be tilted to second road signal b.

Handled in order to facilitate judgement, can rule of thumb direction determines a predetermined angle threshold between two signal angles Value, such as zero degree, i.e., when virtual angle theta_abSize when exceeding the zero degree angle threshold, be classified as voice, as shown in Figure 2. Profit travels through each frame in frequency and classified in this way, you can realization separates target source from mixed audio；Last profit The frequency-region signal of storage is converted into time domain with time-frequency inverse transformation, exports target source signal --- voice.

Specifically, when determining whether voice according to described virtual angle, carry out in the following ways, i.e., when described The virtual angle theta of virtual source_ab(k) when being more than predetermined angular threshold value, by the first via signal corresponding to the virtual source or second Road signal is considered as target source signal, then extracts the frequency-region signal storage of the target source signal.

Wherein, the calculation for extracting the target source signal (assuming that first via signal a is target source signal, corresponds to as follows A_a(k) the target source signal containing extraction in need in)：

S (k)=A_a(k) M (k),

Wherein,

M (k) is target source extraction vector；T is given threshold value, and S (k) is mesh Mark source signal.

Wherein, when time-domain and frequency-domain is changed, including but not limited to using Fourier transformation, wavelet transformation, MDCT conversion etc. Method.

It is shown in Figure 3 the present invention also aims to provide a kind of device that target source is extracted from audio, including：

Virtual angle calcu-lation module, calculated for traveling through under given frequency per the first via signal and second of frame frequency-region signal The virtual angle theta of virtual source corresponding to each frequency of road signal_ab(k)；

Target source signal memory module, for the virtual angle theta_ab(k) with the size of predetermined angular threshold value, according to First via signal or second road signal as target source signal and are extracted the frequency-region signal of the target source signal and deposited by comparative result Storage；

Apparatus of the present invention can be realized the voice of song in KTV systems and the voice (target in accompaniment mixed audio Source) separate individually storage after export, so provided for the vacuum performance for assessing singer exactly in KTV systems is horizontal Basis, when for extracting the voice of song in KTV systems with voice in accompaniment mixed audio, such as by the above-mentioned first via Signal is then relatively more virtual by calculating the virtual angle of virtual source as human voice signal as accompaniment signal, second road signal Angle and the size of predetermined angular threshold value, according to the difference of voice and the virtual angle in the signal-virtual source of accompaniment, pass through setting One predetermined angular threshold value carrys out multilevel iudge, will void corresponding to the satisfactory virtual angle according to multilevel iudge result Signal corresponding to plan source individually stores as human voice signal, uses the same manner traversal processing to every frame audio signal successively, Can is realized separates storage from voice by the voice of original singer in song with the mixed audio signal accompanied.

The predetermined angular threshold value rule of thumb determines, can be 5 degree, 3 degree or other angles.The window function can The window function of same size or different size of window function are selected as needed, to be reduced as far as window function segmentation frequency domain letter Number when spectrum energy leakage.The target source of storage is being believed using time-frequency inverse transformation (such as Short-time Fourier inverse transformation ISTFT) Number frequency-region signal be converted to time-domain signal, when exporting target source time-domain signal, accordingly to use window letter during corresponding intercept Number size is reduced.

θ_ab(k) the virtual angle of virtual source, A are represented_aAnd A (k)_b(k) first via signal and second road signal are represented respectively The frequency k of presentation amplitude,Represent the angle of first via signal and second road signal.

Specifically, when determining whether voice according to virtual angle, carry out in the following ways, i.e., when the virtual source Virtual angle theta_ab(k) when being more than predetermined angular threshold value, by the first via signal or second road signal corresponding to the virtual source It is considered as target source signal, then extracts the frequency-region signal storage of the target source signal.

On virtual source and the explanation of virtual angle, refer to and foregoing first via signal a and the is presented relating to how to calculate The explanation and accompanying drawing 2 of the virtual angle of the virtual source of two road signal b amplitude.

Described above is only the preferred embodiment of the present invention, it is noted that for the common skill of the art For art personnel, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications Also it should be regarded as protection scope of the present invention.

Claims

A kind of 1. method that target source is extracted from audio, it is characterised in that including step：

Time-frequency conversion is carried out frame by frame to the audio signal of collection, time-domain signal is transformed to frequency-region signal, using window function to frequency Domain signal is split, and forms first via signal and second road signal；

The first via signal that traversal is calculated under given frequency per frame frequency-region signal is corresponding with each frequency of second road signal virtual The virtual angle in source；

Compare the virtual angle and the size of predetermined angular threshold value, according to comparative result by first via signal or second road signal As target source signal and extract the target source signal frequency-region signal storage；

The frequency-region signal of the target source signal of storage is converted into time-domain signal using time-frequency conversion inverse transformation, when exporting target source Domain signal.
2. the method for target source is extracted from audio as claimed in claim 1, it is characterised in that the virtual angle of the virtual source Calculation it is as follows：

θ_ab(k) the virtual folder for the first via signal a virtual sources corresponding with second road signal b each frequency that frequency k is presented is represented Angle, A_aAnd A (k)_b(k) amplitude for the frequency k that first via signal a and second road signal b is presented is represented respectively,Represent the first via Signal a and second road signal b angle.
3. the method for target source is extracted from audio as claimed in claim 2, it is characterised in that when the virtual folder of the virtual source When angle is more than predetermined angular threshold value, the first via signal corresponding to the virtual source or second road signal are considered as target source letter Number, the frequency-region signal for then extracting the target source signal stores.
4. the method for target source is extracted from audio as claimed in claim 2, it is characterised in that if first via signal a is target Source signal, the then calculation for extracting the target source signal are as follows：

S (k)=A_a(k) M (k),

Wherein,

M (k) is target source extraction vector；T is given threshold value, and S (k) is target source signal.
A kind of 5. device that target source is extracted from audio, it is characterised in that including：

Time-domain and frequency-domain conversion segmentation module, carries out time-frequency conversion for the audio signal to collection, time-domain signal is converted frame by frame For frequency-region signal, first via signal and second road signal are formed；

Virtual angle calcu-lation module, believe for traveling through the first via signal calculated under given frequency per frame frequency-region signal with the second road Number each frequency corresponding to virtual source virtual angle；

Target source signal memory module, for the virtual angle and the size of predetermined angular threshold value, according to comparative result First via signal or second road signal as target source signal and are extracted the frequency-region signal of the target source signal and stored；

Time domain frequency domain converts output module, for being turned the frequency-region signal of the target source signal of storage using time-frequency conversion inverse transformation Time-domain signal is changed to, exports target source time-domain signal.
6. the device of target source is extracted from audio as claimed in claim 5, it is characterised in that the virtual angle of the virtual source Calculation it is as follows：

θ_ab(k) the virtual folder for the first via signal a virtual sources corresponding with second road signal b each frequency that frequency k is presented is represented Angle, A_aAnd A (k)_b(k) amplitude for the frequency k that first via signal a and second road signal b is presented is represented respectively,Represent the first via Signal a and second road signal b angle.
7. the device of target source is extracted from audio as claimed in claim 6, it is characterised in that when the virtual folder of the virtual source When angle is more than predetermined angular threshold value, the first via signal corresponding to the virtual source or second road signal are considered as target source letter Number, the frequency-region signal for then extracting the target source signal stores.
8. the device of target source is extracted from audio as claimed in claim 6, it is characterised in that if first via signal a is target Source signal, the then calculation for extracting the target source signal are as follows：

S (k)=A_a(k) M (k),

Wherein,

M (k) is target source extraction vector；T is given threshold value, and S (k) is target source signal.