CN113782047A - Voice separation method, device, equipment and storage medium - Google Patents

Voice separation method, device, equipment and storage medium Download PDF

Info

Publication number
CN113782047A
CN113782047A CN202111040658.1A CN202111040658A CN113782047A CN 113782047 A CN113782047 A CN 113782047A CN 202111040658 A CN202111040658 A CN 202111040658A CN 113782047 A CN113782047 A CN 113782047A
Authority
CN
China
Prior art keywords
channel
signal
angle deviation
time domain
noise reduction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111040658.1A
Other languages
Chinese (zh)
Other versions
CN113782047B (en
Inventor
戴玮
关海欣
梁家恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202111040658.1A priority Critical patent/CN113782047B/en
Publication of CN113782047A publication Critical patent/CN113782047A/en
Application granted granted Critical
Publication of CN113782047B publication Critical patent/CN113782047B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S3/00Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received
    • G01S3/02Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received using radio waves
    • G01S3/14Systems for determining direction or deviation from predetermined direction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention relates to a voice separation method, a device, equipment and a storage medium, which comprises the steps of separating a mixed voice signal of a time domain to obtain a time domain signal of a first channel and a time domain signal of a second channel, selecting a two-dimensional direction of arrival estimation corresponding to the time domain signal of the first channel of a specified frame number according to the sequence of signal energy from high to low, and solving a mode to obtain direction estimation information of the first channel, and selecting the two-dimensional direction of arrival estimation information corresponding to the time domain signal of the second channel of the specified frame number and solving the mode to obtain direction estimation of the second channel; calculating the pitch angle deviation of the first channel and the azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculating the pitch angle deviation of the second channel and the azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel; and obtaining a comparison result of each deviation of the first channel and the second channel, and determining a target sound source corresponding to each channel according to the comparison result.

Description

Voice separation method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of voice processing, in particular to a voice separation method, a voice separation device, voice separation equipment and a storage medium.
Background
In recent years, with the rapid development of speech recognition technology, urgent technical demands are put forward on real-time speech separation technology in a multipath speech recognition scene. For example, in one-to-one teaching, the voice of a student and the voice of a teacher need to be separated.
In the related art, a blind source separation technology is usually adopted to separate mixed speech, but the output channel sequence corresponding to each speech signal obtained by blind source separation is uncertain, and a user is required to further determine the speech signal corresponding to each channel, thereby reducing the speech separation efficiency.
Disclosure of Invention
The invention provides a voice separation method, a voice separation device, voice separation equipment and a storage medium, which are used for solving the technical problems that in the prior art, the output channel sequence corresponding to each voice signal obtained by blind source separation is uncertain, a user is required to further determine the voice signal corresponding to each channel, and the voice separation efficiency is reduced.
The technical scheme for solving the technical problems is as follows:
a method of speech separation comprising:
carrying out Fourier transform on the time-domain mixed voice signal received by the microphone array to obtain a time-domain mixed voice signal;
separating the mixed voice signals of the time-frequency domain to obtain a separation signal of a first channel and a separation signal of a second channel;
respectively carrying out short-time Fourier inverse transformation on the separation signal of the first channel and the separation signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel;
according to the sequence of signal energy from high to low, selecting two-dimensional direction-of-arrival estimation corresponding to the time domain signal of a first channel with the specified frame number, and solving a mode to obtain direction estimation information of the first channel, and selecting two-dimensional direction-of-arrival estimation information corresponding to the time domain signal of a second channel with the specified frame number, and solving a mode to obtain direction estimation of the second channel;
calculating the pitch angle deviation of the first channel and the azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculating the pitch angle deviation of the second channel and the azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel;
if the pitch angle deviation of the first channel is not greater than the pitch angle deviation of the second channel, and/or the azimuth angle deviation of the first channel is not greater than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the first target sound source, and the second channel is the voice information of the second target sound source;
and if the pitch angle deviation of the first channel is greater than the pitch angle deviation of the second channel and the azimuth angle deviation of the first channel is greater than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the second target sound source, and the second channel is the voice information of the first target sound source.
Further, in the voice separation method, before performing short-time inverse fourier transform on the separation signal of the first channel and the separation signal of the second channel, respectively, to obtain a time domain signal of the first channel and a time domain signal of the second channel, the method further includes:
processing the separation signal of the first channel and the separation signal of the second channel through a self-adaptive filtering algorithm to obtain a primary noise reduction signal of the first channel;
comparing the energy of the preliminary noise reduction signal of the first channel with the energy of the mixed voice signal of the time domain, and processing the voice signal with high energy and the mixed voice signal of the time domain through a self-adaptive filtering algorithm and a nonlinear noise reduction algorithm to obtain a primary noise reduction signal of a second channel;
correspondingly, the performing short-time inverse fourier transform on the separation signal of the first channel and the separation signal of the second channel respectively to obtain a time domain signal of the first channel and a time domain signal of the second channel includes:
and respectively carrying out short-time Fourier inverse transformation on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel.
Further, in the voice separation method, before performing short-time inverse fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel, respectively, to obtain a time domain signal of the first channel and a time domain signal of the second channel, the method further includes:
respectively eliminating background noise by single-channel noise reduction on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel to obtain a final noise reduction signal of the first channel and a final noise reduction signal of the second channel;
correspondingly, the performing short-time inverse fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel respectively to obtain a time domain signal of the first channel and a time domain signal of the second channel includes:
and respectively carrying out short-time Fourier inverse transformation on the final noise reduction signal of the first channel and the final noise reduction signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel.
Further, the voice separation method further includes:
and when the pitch angle deviation is greater than the angular deviation threshold of the pitch angle, or the azimuth angle deviation is greater than the angular deviation threshold of the azimuth angle, updating the weight of the filter corresponding to the adaptive filtering algorithm.
Further, the voice separation method further includes:
and when the pitch angle deviation is smaller than or equal to the angle deviation threshold value and the azimuth angle deviation is smaller than or equal to the angle deviation threshold value of the azimuth angle, maintaining the weight value of the filter corresponding to the adaptive filtering algorithm unchanged.
Further, in the speech separation method, the adaptive filtering algorithm is any one of a least mean square algorithm LMS, an NLMS algorithm, and a least square method RLS.
The invention also provides a voice separation device, comprising:
the first transformation module is used for not carrying out Fourier transformation on the time-domain mixed voice signal received by the microphone array to obtain a time-domain mixed voice signal;
the separation module is used for separating the mixed voice signals of the time-frequency domain to obtain separation signals of a first channel and separation signals of a second channel;
the second transformation module is used for respectively carrying out short-time Fourier inverse transformation on the separation signal of the first channel and the separation signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel;
the direction estimation module is used for selecting two-dimensional direction-of-arrival estimation corresponding to the time domain signal of the first channel with the specified frame number according to the sequence of the signal energy from high to low, and solving a mode to obtain direction estimation information of the first channel, and selecting two-dimensional direction-of-arrival estimation information corresponding to the time domain signal of the second channel with the specified frame number and solving a mode to obtain direction estimation of the second channel;
the deviation estimation module is used for calculating pitch angle deviation of the first channel and azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculating pitch angle deviation of the second channel and azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel;
a determining module, configured to determine that the first channel is voice information of a first target sound source and the second channel is voice information of a second target sound source if the pitch angle deviation of the first channel is not greater than the pitch angle deviation of the second channel and/or the azimuth angle deviation of the first channel is not greater than the azimuth angle deviation of the second channel; and if the pitch angle deviation of the first channel is greater than the pitch angle deviation of the second channel and the azimuth angle deviation of the first channel is greater than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the second target sound source, and the second channel is the voice information of the first target sound source.
Further, in the above speech separation apparatus, the separation module is further configured to:
processing the separation signal of the first channel and the separation signal of the second channel through a self-adaptive filtering algorithm to obtain a primary noise reduction signal of the first channel;
comparing the energy of the preliminary noise reduction signal of the first channel with the energy of the mixed voice signal of the time domain, and processing the voice signal with high energy and the mixed voice signal of the time domain through a self-adaptive filtering algorithm and a nonlinear noise reduction algorithm to obtain a primary noise reduction signal of a second channel;
correspondingly, the second transform module is further configured to perform short-time inverse fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel, respectively, to obtain a time domain signal of the first channel and a time domain signal of the second channel.
The present invention also provides a voice separating apparatus, comprising: a processor and a memory;
the processor is configured to execute the program of the voice separation method stored in the memory to implement the voice separation method described in any one of the above.
The present invention also provides a storage medium, wherein the storage medium stores one or more programs that when executed implement any of the above-described speech separation methods.
The invention has the beneficial effects that:
performing voice separation on a mixed voice signal of a time domain, collecting energy judgment after obtaining a time domain signal of a first channel and a time domain signal of a second channel, selecting two-dimensional direction-of-arrival estimation corresponding to the time domain signal of the first channel with a specified frame number, solving a mode to obtain direction estimation information of the first channel, selecting two-dimensional direction-of-arrival estimation information corresponding to the time domain signal of the second channel with the specified frame number, and solving the mode to obtain direction estimation of the second channel; then, calculating the pitch angle deviation of the first channel and the azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculating the pitch angle deviation of the second channel and the azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel; if the pitch angle deviation of the first channel is not larger than the pitch angle deviation of the second channel, and/or the azimuth angle deviation of the first channel is not larger than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the first target sound source, and the second channel is the voice information of the second target sound source; and if the pitch angle deviation of the first channel is greater than that of the second channel and the azimuth angle deviation of the first channel is greater than that of the second channel, determining that the first channel is the voice information of the second target sound source and the second channel is the voice information of the first target sound source. Therefore, the voice signals are output according to the determined channel sequence, the user is prevented from further determining the voice signal corresponding to each channel, and the voice separation efficiency is improved.
Drawings
FIG. 1 is a flow chart of an embodiment of a speech separation method of the present invention;
FIG. 2 is a schematic diagram of a microphone array according to the present invention
FIG. 3 is a schematic structural diagram of an embodiment of a speech separation apparatus according to the present invention;
fig. 4 is a schematic structural diagram of the voice separating apparatus of the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
Fig. 1 is a flowchart of an embodiment of a speech separation method of the present invention, and as shown in fig. 1, the speech separation method of the present embodiment may specifically include the following steps:
100. carrying out Fourier transform on the time-domain mixed voice signal received by the microphone array to obtain a time-domain mixed voice signal;
fig. 2 is a schematic diagram of a microphone array according to the present invention. As shown in fig. 2, an angle error threshold of a pitch angle and an angle error threshold of an azimuth angle θ of the first sound source signal of the time-domain mixed speech signal received by the microphone array, such as a 30-degree direction and an azimuth angle θ, may be set
Figure BDA0003249086180000061
Such as 60 degrees. The second sound source signal of the mixed speech signal in the time domain received by the microphone array may be in any direction.
In a specific implementation process, the microphone array can receive a time-domain mixed voice signal, and because the voice signal has a short-time stationary characteristic, the voice signal is generally converted into a short-time frequency domain for analysis processing, so that the short-time fourier transform is performed on the time-domain mixed voice signal to obtain a time-frequency domain mixed voice signal. Can be expressed as x (t, k), t representing frame number and k representing frequency.
101. Separating the mixed voice signals of the time-frequency domain to obtain a separation signal of a first channel and a separation signal of a second channel;
in a specific implementation process, the time-frequency domain mixed speech signal may be separated by using a blind source separation algorithm to obtain a first channel separation signal and a second channel separation signal. For a specific separation method, reference may be made to the related art, which is not described herein again.
102. Respectively carrying out short-time Fourier inverse transformation on the separation signal of the first channel and the separation signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel;
in a specific implementation process, the separation signal of the first channel and the separation signal of the second channel may be subjected to short-time inverse fourier transform, respectively, to obtain a time domain signal of the first channel and a time domain signal of the second channel.
103. According to the sequence of signal energy from high to low, selecting two-dimensional direction-of-arrival estimation corresponding to the time domain signal of a first channel with the specified frame number, and solving a mode to obtain direction estimation information of the first channel, and selecting two-dimensional direction-of-arrival estimation information corresponding to the time domain signal of a second channel with the specified frame number, and solving a mode to obtain direction estimation of the second channel;
in one specific implementation, the pitch angle of each frame of each channel can be obtained by two-dimensional direction-of-arrival estimation
Figure BDA0003249086180000071
And azimuth angle
Figure BDA0003249086180000072
The capability of the speech signal of each frame can be obtained according to the calculation formula of the speech signal energy. Wherein the speech signal energy is calculated by
Figure BDA0003249086180000073
EiRepresenting speech signal energy, xi(t) represents a time domain signal of each channel of the current frame, and N represents a frame number.
In a specific implementation process, two-dimensional direction-of-arrival estimation corresponding to the time domain signal of the first channel with the previous 30% of the frame number may be selected, and a mode may be solved to obtain direction estimation information of the first channel, and two-dimensional direction-of-arrival estimation information corresponding to the time domain signal of the second channel with the specified frame number may be selected, and a mode may be solved to obtain direction estimation of the second channel.
Specifically, after obtaining two-dimensional direction-of-arrival estimates (pitch angle and azimuth angle) of all frames, sorting energy calculations of all frames from high to low, selecting the pitch angle and azimuth angle of the first 30% frame with the highest energy, and obtaining an array of pitch angles and an array of azimuth angles at this time. Three angle area ranges, such as 0-50, 50-100, and 100-180, can be set in advance, and the mode is to select which angle according to which range the value in the array appears most times. For example, the number of occurrences of 0-50 in the pitch angle array is the largest, and i think that the mode of the pitch angle of the channel is any value from 0-50.
104. Calculating the pitch angle deviation of the first channel and the azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculating the pitch angle deviation of the second channel and the azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel;
in one implementation, the position estimation information of the first channel can be recorded as
Figure BDA0003249086180000081
The second channel of position estimation information may be written as
Figure BDA0003249086180000082
The pitch angle deviation of the first passage is
Figure BDA0003249086180000083
The azimuth deviation of the first channel is
Figure BDA0003249086180000084
Wherein, theta represents a reference pitch angle,
Figure BDA0003249086180000085
a reference azimuth is indicated.
105. Detecting whether the pitch angle deviation of the first channel is larger than the pitch angle deviation of the second channel or not, and whether the azimuth angle deviation of the first channel is larger than the azimuth angle deviation of the second channel or not; if yes, go to step 106, if no, go to step 107;
106. determining that the first channel is the voice information of a second target sound source, wherein the second channel is the voice information of a first target sound source;
and if the pitch angle deviation of the first channel is greater than the pitch angle deviation of the second channel and the azimuth angle deviation of the first channel is greater than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the second target sound source, determining that the second channel is the voice information of the first target sound source, and determining that the first channel is the voice information of the second target sound source.
107. And determining that the first channel is the voice information of a first target sound source, and the second channel is the voice information of a second target sound source.
And if the pitch angle deviation of the first channel is not greater than the pitch angle deviation of the second channel, and/or the azimuth angle deviation of the first channel is not greater than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the first target sound source, and the second channel is the voice information of the second target sound source.
In the voice separation method of this embodiment, after performing voice separation on a time domain mixed voice signal and obtaining a time domain signal of a first channel and a time domain signal of a second channel, energy judgment is integrated, a two-dimensional direction-of-arrival estimation corresponding to the time domain signal of the first channel of a specified frame number is selected, a mode is solved, direction estimation information of the first channel is obtained, a two-dimensional direction-of-arrival estimation information corresponding to the time domain signal of the second channel of the specified frame number is selected, and the mode is solved, so that direction estimation of the second channel is obtained; then, calculating the pitch angle deviation of the first channel and the azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculating the pitch angle deviation of the second channel and the azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel; if the pitch angle deviation of the first channel is not larger than the pitch angle deviation of the second channel, and/or the azimuth angle deviation of the first channel is not larger than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the first target sound source, and the second channel is the voice information of the second target sound source; and if the pitch angle deviation of the first channel is greater than that of the second channel and the azimuth angle deviation of the first channel is greater than that of the second channel, determining that the first channel is the voice information of the second target sound source and the second channel is the voice information of the first target sound source. Therefore, the voice signals are output according to the determined channel sequence, the user is prevented from further determining the voice signal corresponding to each channel, and the voice separation efficiency is improved.
In a specific implementation process, before performing short-time inverse fourier transform on the separated signal of the first channel and the separated signal of the second channel respectively to obtain the time-domain signal of the first channel and the time-domain signal of the second channel in step 102 "in the above embodiment, the following steps may also be performed:
(1) processing the separation signal of the first channel and the separation signal of the second channel through a self-adaptive filtering algorithm to obtain a primary noise reduction signal of the first channel;
(2) comparing the energy of the preliminary noise reduction signal of the first channel with the energy of the mixed voice signal of the time domain, and processing the voice signal with high energy and the mixed voice signal of the time domain through a self-adaptive filtering algorithm and a nonlinear noise reduction algorithm to obtain a primary noise reduction signal of a second channel;
specifically, after the primary noise reduction signal of the first channel is obtained, energy comparison may be performed between the primary noise reduction signal of the first channel and the time-domain mixed speech signal, and a speech signal with high energy is selected. And if the energy of the primary noise reduction signal of the first channel is lower than that of the mixed voice signal of the time domain, the mixed voice signal of the time domain is taken as a voice signal with high energy. And taking the time-domain mixed voice signal as a reference, and filtering by using a self-adaptive filtering algorithm to obtain a primary noise reduction signal of the second channel. The adaptive filtering algorithm is any one of least mean square algorithm LMS, NLMS algorithm and least square method RLS.
Correspondingly, the performing short-time inverse fourier transform on the separation signal of the first channel and the separation signal of the second channel respectively to obtain a time domain signal of the first channel and a time domain signal of the second channel includes: and respectively carrying out short-time Fourier inverse transformation on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel.
In a specific implementation process, before "performing short-time inverse fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel respectively to obtain a time domain signal of the first channel and a time domain signal of the second channel", the following steps may be further performed:
(11) and respectively carrying out single-channel noise reduction and background noise elimination on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel to obtain a final noise reduction signal of the first channel and a final noise reduction signal of the second channel.
Correspondingly, performing short-time inverse fourier transform on the separation signal of the first channel and the separation signal of the second channel respectively to obtain a time domain signal of the first channel and a time domain signal of the second channel, including: and respectively carrying out short-time Fourier inverse transformation on the final noise reduction signal of the first channel and the final noise reduction signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel.
In this embodiment, the energy determination and the adaptive filtering technology are combined to further perform denoising processing on the separated voice signals of each channel, so that the separated voice is cleaner.
In one specific implementation procedure, after "calculating the pitch angle deviation and the azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculating the pitch angle deviation and the azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel" in step 104, the following steps may be further performed: and when the pitch angle deviation is greater than the angular deviation threshold of the pitch angle, or the azimuth angle deviation is greater than the angular deviation threshold of the azimuth angle, updating the weight of the filter corresponding to the adaptive filtering algorithm. And when the pitch angle deviation is smaller than or equal to the angle deviation threshold value and the azimuth angle deviation is smaller than or equal to the angle deviation threshold value of the azimuth angle, maintaining the weight value of the filter corresponding to the adaptive filtering algorithm unchanged.
In a specific implementation process, fitting can be performed according to the historical weight of the filter corresponding to the updated adaptive filtering algorithm to obtain a weight updating fitting function of the filter, so that before the filter is used, the filter weight is set according to the obtained weight updating fitting function, the updating times of the weight updating fitting function reach a preset time m, then the actual calculation weight of the filter of the mth time is obtained by the updating method, the fitting weight of the filter of the mth time is compared, if the error between the fitting weight of the mth time and the fitting weight of the filter of the mth time is within a preset range, the weight of the filter is still set by the weight updating fitting function from the mth time to the 2 mth time, otherwise, the weight of the filter is set by the method of updating the weight of the filter corresponding to the adaptive filtering algorithm when the pitch angle deviation is greater than the angle deviation threshold or the azimuth angle deviation is greater than the angle deviation threshold of the azimuth angle, and after the n times, modifying the weight updating fitting function of the filter according to the n times of calculated values, and then setting the weight of the filter by using the weight updating fitting function of the filter. Therefore, the pitch angle deviation between the pitch angle in the time-frequency domain mixed voice signal and the target azimuth and the azimuth angle deviation between the azimuth angle in the time-frequency domain mixed voice signal and the target azimuth can be repeatedly avoided, and the efficiency and the accuracy are improved.
It should be noted that the method of the embodiment of the present invention may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In the case of such a distributed scenario, one device of the multiple devices may only perform one or more steps of the method according to the embodiment of the present invention, and the multiple devices interact with each other to complete the method.
Fig. 3 is a schematic structural diagram of an embodiment of the speech separation apparatus of the present invention, and as shown in fig. 3, the speech separation apparatus of this embodiment may include a first transformation module 20, a separation module 21, a second transformation module 22, an orientation estimation module 23, a deviation estimation module 24, and a determination module 25.
The first transformation module 20 is used for not performing Fourier transformation on the time-domain mixed voice signal received by the microphone array to obtain a time-domain mixed voice signal;
a separation module 21, configured to separate the time-frequency domain mixed speech signal to obtain a first channel separation signal and a second channel separation signal;
a second transform module 22, configured to perform short-time inverse fourier transform on the separated signal of the first channel and the separated signal of the second channel, respectively, to obtain a time-domain signal of the first channel and a time-domain signal of the second channel;
the direction estimation module 23 is configured to select, according to the sequence of signal energy from high to low, two-dimensional direction-of-arrival estimation corresponding to the time domain signal of the first channel with the specified number of frames, and solve a mode to obtain direction estimation information of the first channel, and select two-dimensional direction-of-arrival estimation information corresponding to the time domain signal of the second channel with the specified number of frames, and solve a mode to obtain direction estimation of the second channel;
a deviation estimation module 24, configured to calculate a pitch angle deviation of the first channel and an azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculate a pitch angle deviation of the second channel and an azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel;
a determining module 25, configured to determine that the first channel is the voice information of the first target sound source and the second channel is the voice information of the second target sound source if the pitch angle deviation of the first channel is not greater than the pitch angle deviation of the second channel and/or the azimuth angle deviation of the first channel is not greater than the azimuth angle deviation of the second channel; and if the pitch angle deviation of the first channel is greater than the pitch angle deviation of the second channel and the azimuth angle deviation of the first channel is greater than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the second target sound source, and the second channel is the voice information of the first target sound source.
In a specific implementation process, the separation module 21 is further configured to:
processing the separation signal of the first channel and the separation signal of the second channel through a self-adaptive filtering algorithm to obtain a primary noise reduction signal of the first channel;
and comparing the energy of the preliminary noise reduction signal of the first channel with the energy of the mixed voice signal of the time domain, and processing the voice signal with high energy and the mixed voice signal of the time domain through a self-adaptive filtering algorithm and a nonlinear noise reduction algorithm to obtain a primary noise reduction signal of a second channel. The adaptive filtering algorithm is any one of least mean square algorithm LMS, NLMS algorithm and least square method RLS.
Correspondingly, the second transform module 22 is further configured to perform short-time inverse fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel, respectively, to obtain a time domain signal of the first channel and a time domain signal of the second channel.
In a specific implementation process, the separation module 21 is further configured to: respectively eliminating background noise by single-channel noise reduction on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel to obtain a final noise reduction signal of the first channel and a final noise reduction signal of the second channel;
correspondingly, the second transform module 22 is further configured to perform short-time inverse fourier transform on the final noise reduction signal of the first channel and the final noise reduction signal of the second channel, respectively, to obtain a time domain signal of the first channel and a time domain signal of the second channel.
In a specific implementation process, the deviation estimation module 24 is further configured to update the weight of the filter corresponding to the adaptive filtering algorithm when the pitch angle deviation is greater than the angular deviation threshold of the pitch angle, or the azimuth angle deviation is greater than the angular deviation threshold of the azimuth angle. And when the pitch angle deviation is smaller than or equal to the angle deviation threshold value and the azimuth angle deviation is smaller than or equal to the angle deviation threshold value of the azimuth angle, maintaining the weight value of the filter corresponding to the adaptive filtering algorithm unchanged.
The apparatus of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and specific implementation schemes thereof may refer to the method described in the foregoing embodiment and relevant descriptions in the method embodiment, and have beneficial effects of the corresponding method embodiment, which are not described herein again.
Fig. 4 is a schematic structural diagram of the voice separating device of the present invention, and as shown in fig. 4, the passing device of this embodiment may include: a processor 1010 and a memory 1020. Those skilled in the art will appreciate that the device may also include input/output interface 1030, communication interface 1040, and bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.
The input/output interface 1030 is used for connecting the input/output module 32 to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).
Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.
It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
In one specific implementation, the processor 1010 is configured to execute the program for speech separation stored in the memory 1020 to implement the speech separation method of the above-described embodiment.
The present invention also provides a storage medium storing one or more programs that when executed implement the voice separation method of the above embodiments.
Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.
In addition, well known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the invention. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the present invention is to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.
While the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method of speech separation, comprising:
carrying out Fourier transform on the time-domain mixed voice signal received by the microphone array to obtain a time-domain mixed voice signal;
separating the mixed voice signals of the time-frequency domain to obtain a separation signal of a first channel and a separation signal of a second channel;
respectively carrying out short-time Fourier inverse transformation on the separation signal of the first channel and the separation signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel;
according to the sequence of signal energy from high to low, selecting two-dimensional direction-of-arrival estimation corresponding to the time domain signal of a first channel with the specified frame number, and solving a mode to obtain direction estimation information of the first channel, and selecting two-dimensional direction-of-arrival estimation information corresponding to the time domain signal of a second channel with the specified frame number, and solving a mode to obtain direction estimation of the second channel;
calculating the pitch angle deviation of the first channel and the azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculating the pitch angle deviation of the second channel and the azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel;
if the pitch angle deviation of the first channel is not greater than the pitch angle deviation of the second channel, and/or the azimuth angle deviation of the first channel is not greater than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the first target sound source, and the second channel is the voice information of the second target sound source;
and if the pitch angle deviation of the first channel is greater than the pitch angle deviation of the second channel and the azimuth angle deviation of the first channel is greater than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the second target sound source, and the second channel is the voice information of the first target sound source.
2. The speech separation method according to claim 1, wherein before performing short-time inverse fourier transform on the separated signal of the first channel and the separated signal of the second channel to obtain the time-domain signal of the first channel and the time-domain signal of the second channel, the method further comprises:
processing the separation signal of the first channel and the separation signal of the second channel through a self-adaptive filtering algorithm to obtain a primary noise reduction signal of the first channel;
comparing the energy of the preliminary noise reduction signal of the first channel with the energy of the mixed voice signal of the time domain, and processing the voice signal with high energy and the mixed voice signal of the time domain through a self-adaptive filtering algorithm and a nonlinear noise reduction algorithm to obtain a primary noise reduction signal of a second channel;
correspondingly, the performing short-time inverse fourier transform on the separation signal of the first channel and the separation signal of the second channel respectively to obtain a time domain signal of the first channel and a time domain signal of the second channel includes:
and respectively carrying out short-time Fourier inverse transformation on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel.
3. The method of separating speech according to claim 2, wherein before performing short-time inverse fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel to obtain the time domain signal of the first channel and the time domain signal of the second channel, the method further comprises:
respectively eliminating background noise by single-channel noise reduction on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel to obtain a final noise reduction signal of the first channel and a final noise reduction signal of the second channel;
correspondingly, the performing short-time inverse fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel respectively to obtain a time domain signal of the first channel and a time domain signal of the second channel includes:
and respectively carrying out short-time Fourier inverse transformation on the final noise reduction signal of the first channel and the final noise reduction signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel.
4. The speech separation method of claim 1, further comprising:
and when the pitch angle deviation is greater than the angular deviation threshold of the pitch angle, or the azimuth angle deviation is greater than the angular deviation threshold of the azimuth angle, updating the weight of the filter corresponding to the adaptive filtering algorithm.
5. The speech separation method of claim 4, further comprising:
and when the pitch angle deviation is smaller than or equal to the angle deviation threshold value and the azimuth angle deviation is smaller than or equal to the angle deviation threshold value of the azimuth angle, maintaining the weight value of the filter corresponding to the adaptive filtering algorithm unchanged.
6. The speech separation method of claim 1, wherein the adaptive filtering algorithm is any one of Least Mean Square (LMS), NLMS, and least squares (RLS).
7. A speech separation apparatus, comprising:
the first transformation module is used for not carrying out Fourier transformation on the time-domain mixed voice signal received by the microphone array to obtain a time-domain mixed voice signal;
the separation module is used for separating the mixed voice signals of the time-frequency domain to obtain separation signals of a first channel and separation signals of a second channel;
the second transformation module is used for respectively carrying out short-time Fourier inverse transformation on the separation signal of the first channel and the separation signal of the second channel to obtain a time domain signal of the first channel and a time domain signal of the second channel;
the direction estimation module is used for selecting two-dimensional direction-of-arrival estimation corresponding to the time domain signal of the first channel with the specified frame number according to the sequence of the signal energy from high to low, and solving a mode to obtain direction estimation information of the first channel, and selecting two-dimensional direction-of-arrival estimation information corresponding to the time domain signal of the second channel with the specified frame number and solving a mode to obtain direction estimation of the second channel;
the deviation estimation module is used for calculating pitch angle deviation of the first channel and azimuth angle deviation of the first channel according to the azimuth estimation information of the first channel, and calculating pitch angle deviation of the second channel and azimuth angle deviation of the second channel according to the azimuth estimation information of the second channel;
a determining module, configured to determine that the first channel is voice information of a first target sound source and the second channel is voice information of a second target sound source if the pitch angle deviation of the first channel is not greater than the pitch angle deviation of the second channel and/or the azimuth angle deviation of the first channel is not greater than the azimuth angle deviation of the second channel; and if the pitch angle deviation of the first channel is greater than the pitch angle deviation of the second channel and the azimuth angle deviation of the first channel is greater than the azimuth angle deviation of the second channel, determining that the first channel is the voice information of the second target sound source, and the second channel is the voice information of the first target sound source.
8. The speech separation device of claim 7, wherein the separation module is further configured to:
processing the separation signal of the first channel and the separation signal of the second channel through a self-adaptive filtering algorithm to obtain a primary noise reduction signal of the first channel;
comparing the energy of the preliminary noise reduction signal of the first channel with the energy of the mixed voice signal of the time domain, and processing the voice signal with high energy and the mixed voice signal of the time domain through a self-adaptive filtering algorithm and a nonlinear noise reduction algorithm to obtain a primary noise reduction signal of a second channel;
correspondingly, the second transform module is further configured to perform short-time inverse fourier transform on the primary noise reduction signal of the first channel and the primary noise reduction signal of the second channel, respectively, to obtain a time domain signal of the first channel and a time domain signal of the second channel.
9. A speech separation device, comprising: a processor and a memory;
the processor is configured to execute a program of the voice separation method stored in the memory to implement the voice separation method of any one of claims 1 to 6.
10. A storage medium storing one or more programs which, when executed, implement the speech separation method of any one of claims 1-6.
CN202111040658.1A 2021-09-06 2021-09-06 Voice separation method, device, equipment and storage medium Active CN113782047B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111040658.1A CN113782047B (en) 2021-09-06 2021-09-06 Voice separation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111040658.1A CN113782047B (en) 2021-09-06 2021-09-06 Voice separation method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113782047A true CN113782047A (en) 2021-12-10
CN113782047B CN113782047B (en) 2024-03-08

Family

ID=78841275

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111040658.1A Active CN113782047B (en) 2021-09-06 2021-09-06 Voice separation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113782047B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103308889A (en) * 2013-05-13 2013-09-18 辽宁工业大学 Passive sound source two-dimensional DOA (direction of arrival) estimation method under complex environment
US20150063069A1 (en) * 2013-08-30 2015-03-05 Honda Motor Co., Ltd. Sound processing device, sound processing method, and sound processing program
CN106373589A (en) * 2016-09-14 2017-02-01 东南大学 Binaural mixed voice separation method based on iteration structure
CN106847301A (en) * 2017-01-03 2017-06-13 东南大学 A kind of ears speech separating method based on compressed sensing and attitude information
US20170251320A1 (en) * 2016-02-29 2017-08-31 Electronics And Telecommunications Research Institute Apparatus and method of creating multilingual audio content based on stereo audio signal
US20170251319A1 (en) * 2016-02-29 2017-08-31 Electronics And Telecommunications Research Institute Method and apparatus for synthesizing separated sound source
CN107346664A (en) * 2017-06-22 2017-11-14 河海大学常州校区 A kind of ears speech separating method based on critical band
KR20180079975A (en) * 2017-01-03 2018-07-11 한국전자통신연구원 Sound source separation method using spatial position of the sound source and non-negative matrix factorization and apparatus performing the method
WO2020042708A1 (en) * 2018-08-31 2020-03-05 大象声科(深圳)科技有限公司 Time-frequency masking and deep neural network-based sound source direction estimation method
CN110931036A (en) * 2019-12-07 2020-03-27 杭州国芯科技股份有限公司 Microphone array beam forming method
CN113053406A (en) * 2021-05-08 2021-06-29 北京小米移动软件有限公司 Sound signal identification method and device
CN113050035A (en) * 2021-03-12 2021-06-29 云知声智能科技股份有限公司 Two-dimensional directional pickup method and device
US11064294B1 (en) * 2020-01-10 2021-07-13 Synaptics Incorporated Multiple-source tracking and voice activity detections for planar microphone arrays
CN113225441A (en) * 2021-07-09 2021-08-06 北京中电慧声科技有限公司 Conference telephone system

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103308889A (en) * 2013-05-13 2013-09-18 辽宁工业大学 Passive sound source two-dimensional DOA (direction of arrival) estimation method under complex environment
US20150063069A1 (en) * 2013-08-30 2015-03-05 Honda Motor Co., Ltd. Sound processing device, sound processing method, and sound processing program
US20170251320A1 (en) * 2016-02-29 2017-08-31 Electronics And Telecommunications Research Institute Apparatus and method of creating multilingual audio content based on stereo audio signal
US20170251319A1 (en) * 2016-02-29 2017-08-31 Electronics And Telecommunications Research Institute Method and apparatus for synthesizing separated sound source
CN106373589A (en) * 2016-09-14 2017-02-01 东南大学 Binaural mixed voice separation method based on iteration structure
KR20180079975A (en) * 2017-01-03 2018-07-11 한국전자통신연구원 Sound source separation method using spatial position of the sound source and non-negative matrix factorization and apparatus performing the method
CN106847301A (en) * 2017-01-03 2017-06-13 东南大学 A kind of ears speech separating method based on compressed sensing and attitude information
CN107346664A (en) * 2017-06-22 2017-11-14 河海大学常州校区 A kind of ears speech separating method based on critical band
WO2020042708A1 (en) * 2018-08-31 2020-03-05 大象声科(深圳)科技有限公司 Time-frequency masking and deep neural network-based sound source direction estimation method
CN110931036A (en) * 2019-12-07 2020-03-27 杭州国芯科技股份有限公司 Microphone array beam forming method
US11064294B1 (en) * 2020-01-10 2021-07-13 Synaptics Incorporated Multiple-source tracking and voice activity detections for planar microphone arrays
KR20210091034A (en) * 2020-01-10 2021-07-21 시냅틱스 인코포레이티드 Multiple-source tracking and voice activity detections for planar microphone arrays
CN113050035A (en) * 2021-03-12 2021-06-29 云知声智能科技股份有限公司 Two-dimensional directional pickup method and device
CN113053406A (en) * 2021-05-08 2021-06-29 北京小米移动软件有限公司 Sound signal identification method and device
CN113225441A (en) * 2021-07-09 2021-08-06 北京中电慧声科技有限公司 Conference telephone system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
时洁: "基于矢量阵的水下噪声源近场高分辨定位识别方法研究", 中国博士学位论文全文数据库 工程科技Ⅱ辑, 15 February 2011 (2011-02-15), pages 028 - 12 *
李万龙: "基于麦克风阵列的语音增强和分离方法研究", 中国优秀硕士学位论文全文数据库 信息科技辑, 15 January 2009 (2009-01-15), pages 136 - 92 *

Also Published As

Publication number Publication date
CN113782047B (en) 2024-03-08

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
EP3347894B1 (en) Arbitration between voice-enabled devices
EP3703052A1 (en) Echo cancellation method and apparatus based on time delay estimation
CN108922553B (en) Direction-of-arrival estimation method and system for sound box equipment
CN109473118B (en) Dual-channel speech enhancement method and device
JP2021086154A (en) Method, device, apparatus, and computer-readable storage medium for speech recognition
US11289109B2 (en) Systems and methods for audio signal processing using spectral-spatial mask estimation
US10347272B2 (en) De-reverberation control method and apparatus for device equipped with microphone
CN113053365B (en) Voice separation method, device, equipment and storage medium
KR20170053623A (en) Method and apparatus for enhancing sound sources
CN106558315B (en) Heterogeneous microphone automatic gain calibration method and system
CN110363748B (en) Method, device, medium and electronic equipment for processing dithering of key points
EP3839949A1 (en) Audio signal processing method and device, terminal and storage medium
CN107240396B (en) Speaker self-adaptation method, device, equipment and storage medium
CN110837758B (en) Keyword input method and device and electronic equipment
CN111048118B (en) Voice signal processing method and device and terminal
CN112951263B (en) Speech enhancement method, apparatus, device and storage medium
WO2023000444A1 (en) Method and apparatus for detecting noise of loudspeaker, and electronic device and storage medium
CN113160846B (en) Noise suppression method and electronic equipment
CN110992975B (en) Voice signal processing method and device and terminal
CN110890099A (en) Sound signal processing method, device and storage medium
US10650839B2 (en) Infinite impulse response acoustic echo cancellation in the frequency domain
CN113782047B (en) Voice separation method, device, equipment and storage medium
CN113496698B (en) Training data screening method, device, equipment and storage medium
CN111048096B (en) Voice signal processing method and device and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant