CN113990344A - Voiceprint feature-based multi-user voice separation method, equipment and medium - Google Patents

Voiceprint feature-based multi-user voice separation method, equipment and medium Download PDF

Info

Publication number
CN113990344A
CN113990344A CN202111004878.9A CN202111004878A CN113990344A CN 113990344 A CN113990344 A CN 113990344A CN 202111004878 A CN202111004878 A CN 202111004878A CN 113990344 A CN113990344 A CN 113990344A
Authority
CN
China
Prior art keywords
audio
target speaker
mixed audio
spectrum
voiceprint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111004878.9A
Other languages
Chinese (zh)
Inventor
沈莹
程诗丹
周子怡
张�林
赵生捷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN202111004878.9A priority Critical patent/CN113990344A/en
Publication of CN113990344A publication Critical patent/CN113990344A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention relates to a method, equipment and a medium for separating multi-person voice based on voiceprint characteristics, wherein the method comprises the following steps: s1: obtaining voiceprint feature X of target speakerrefAnd extracting the frequency spectrum characteristic X of the mixed audio by adopting short-time Fourier transformmix(ii) a S2: spectral feature X by splicing mixed audiomixSpeaking to a targetHuman voiceprint feature XrefObtaining a spectral feature X 'of the reference voiceprint feature'mixAnd is characterized by X'mixInput features X into the dilated convolution layer for capturing low-level audio features to obtain a speech separation modelinput(ii) a S3: obtaining a spectrum mask based on a voice separation model and mixing the spectrum mask with the spectrum characteristic X of the mixed audiomixMultiplying to obtain the predicted pure audio frequency spectrum of the target speaker; and obtaining the predicted pure audio frequency of the target speaker in the time domain by referring to the phase spectrum of the mixed audio frequency and combining short-time Fourier inverse transformation. Compared with the prior art, the method has the advantage of high voice separation precision.

Description

Voiceprint feature-based multi-user voice separation method, equipment and medium
Technical Field
The invention relates to the field of intelligent voice separation, in particular to a method, equipment and medium for separating multi-user voice based on voiceprint characteristics.
Background
Humans can selectively listen in the presence of multiple sound sources, but computers do not have this capability. In everyday life, when our attention is focused on talking to a certain target or object, it is common to ignore conversations occurring between other people in the surrounding environment or noise in the environment, which is known as the cocktail party effect. Cocktail party effects typically occur in two situations: the attention of human beings is focused on a certain sound, for example, the attention of the human beings is focused on a sound source of a movie when watching the movie; ② the condition that the human auditory sense organ is stimulated by some kind, for example, the bombing sound of explosion can make people ignore other surrounding sounds.
In 1953, Cherry, e.colin presented a well-known Cocktail Party Problem (Cocktail Party project), namely a Problem of how to accurately track and recognize the speech of a specific speaker in the presence of multiple speakers speaking simultaneously and other background noise in the space. The cocktail party problem can also be visually understood as an auditory version of the graphical background problem in computer vision, where the sound of interest is the graph and other sounds are the background. Currently, there are two challenging problems in the cocktail party problem:
(1) how is the target speech signal separated from the mixed speech signal?
(2) How to track and maintain attention to a target sound source, and to be able to switch attention between different sound sources?
In most cases, the above two challenges are interacting, and tracking of a target sound source may benefit from good speech separation, which may also benefit from tracking of the target sound source. In fact, current research efforts directed to solving the cocktail party problem have focused primarily on the first challenging problem, namely speech separation.
Voice interaction is usually one-to-one in real-world practical applications, i.e., the smart device often only needs to be concerned with the voice signal emitted by the sound source of the target speaker, and can ignore other sound sources. Thus, in the face of the speech separation problem, the underlying objectives to be solved are: a speech signal of a target speaker is separated from a mixed speech signal composed of speech signals of a plurality of speakers. However, most of the existing speech separation methods based on deep neural networks generally only use the spectral features of the mixed audio as model input, and do not consider other speech features of the target speaker.
With the rise of multimodal machine learning methods, researchers have proposed speaker-independent Audio-Visual Joint models (Joint Audio-Visual models) to separate a target speech signal from a mixed speech signal, where Visual features are used to track the target speaker in the scene. Although the voice separation method of the multi-modal machine learning achieves good effects, the multi-modal method requires that audio information and visual information are used simultaneously, and the visual information is difficult to obtain in many occasions of voice interaction application in the real world, so the voice separation method of the multi-modal machine learning has high requirements on the types of information available in application scenes, and the application range is not wide.
The technical problem to be solved by the invention is as follows: the target speech signal is accurately separated from the mixed speech signal without visual or other information other than speech.
Disclosure of Invention
It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide a method, an apparatus, and a medium for separating a multi-person voice based on voiceprint features, which can accurately separate a target voice signal from a mixed voice signal based on voiceprint features.
The purpose of the invention can be realized by the following technical scheme:
according to a first aspect of the present invention, there is provided a method for separating multiple persons from voice based on voiceprint features, the method comprising the steps of:
step S1: voiceprint feature extraction, including obtaining a voiceprint feature X of a target speakerrefAnd extracting the frequency spectrum characteristic X of the mixed audio by adopting short-time Fourier transformmix
Step S2: voiceprint feature fusion by concatenating spectral features X of mixed audiomixVoiceprint feature X with target speakerrefObtaining a spectral feature X 'of the reference voiceprint feature'mixAnd is characterized by X'mixInput features X into the dilated convolution layer for capturing low-level audio features to obtain a speech separation modelinput
Step S3: separating the voice, obtaining a spectrum mask based on a voice separation model, and mixing the spectrum mask with the spectrum characteristic X of the mixed audiomixMultiplying to obtain the predicted pure audio frequency spectrum of the target speaker; and obtaining the predicted pure audio frequency of the target speaker in the time domain by referring to the phase spectrum of the mixed audio frequency and combining short-time Fourier inverse transformation.
Preferably, the voiceprint feature X of the target speaker is obtained in the step S1refThe method specifically comprises the following steps: inputting the reference audio frequency of the target speaker into a voiceprint feature extractor, acquiring Mel Frequency Cepstrum Coefficient (MFCCs) of the target speaker, and taking the MFCCs as the voiceprint feature X of the target speakerrefThe method specifically comprises the following steps:
step S11: simultaneously, performing mute segment pruning on the reference audio and the mixed audio of the target speaker;
step S12: processing the reference audio without the mute section and the mixed audio without the mute section to ensure that the length of the reference audio is consistent with that of the mixed audio;
step S13: extracting Mel frequency cepstrum coefficient MFCCs from the reference audio without mute segment, and taking front P dimension as voiceprint characteristic X of target speakerref
Preferably, the step S12 is: if the length of the reference audio without the mute section is smaller than that of the mixed audio without the mute section, circularly splicing the reference audio; if the length of the reference audio without the mute section is larger than that of the mixed audio without the mute section, pruning the reference audio to ensure that the length of the reference audio is consistent with that of the mixed audio; wherein the silence segments are speech segments below 20 db.
Preferably, in step S1, the short-time fourier transform is applied to the mixed audio to extract the spectral feature X of the mixed audiomixThe method specifically comprises the following steps:
step S14: carrying out short-time Fourier transform on the mixed audio without the mute section by using the window size of 256 and the frame shift of 64, and simultaneously obtaining the amplitude spectrum and the phase spectrum of the mixed audio;
step S15: using amplitude spectra as spectral feature X of mixed audiomix(ii) a And the phase spectrum is used as the phase spectrum used by the recovery separation model to predict the clean audio of the target speaker.
Preferably, the expanded convolutional layer in step S2 includes a convolutional neural network CNN.
Preferably, the process of obtaining the spectrum mask by the speech model specifically includes: using depth clustering model DPCL based on input features XinputAnd obtaining an embedded vector, and clustering the obtained embedded vector by adopting a K-Means algorithm to obtain a spectrum mask.
Preferably, the spectral mask is a binary spectral mask, i.e. each time-frequency bin in each spectrogram belongs to only one speaker.
Preferably, the deep clustering model DPCL comprises a bidirectional long-and-short memory network BiLSTM.
According to a second aspect of the present invention, there is provided an electronic device comprising a memory having stored thereon a computer program and a processor implementing the method when executing the program.
According to a third aspect of the invention, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method described
Compared with the prior art, the invention has the following advantages:
1) the invention provides a voiceprint feature-based multi-person voice separation method, which comprises the steps of splicing voiceprint features of a target speaker extracted from reference audio of the target speaker as supplementary features with frequency spectrum features of mixed audio, separating pure audio of the target speaker from the mixed audio through a deep neural network model, and improving the accuracy of a voice separation model for predicting the pure audio of the target speaker to a certain extent;
2) compared with the d-vector characteristic, the method has the advantages that the characteristic extraction speed is high by using the Mel frequency cepstrum coefficient MFCC, the generalization is stronger, and the characteristic extracted by the d-vector depends on the selection of the training data set to a great extent;
3) compared with an audio-visual combined model using visual information, the model has higher requirements on the visual information, and the acquisition of high-quality visual information in daily life is more complicated than the acquisition of audio information.
Drawings
FIG. 1 is a flowchart of the multi-user voice separation method based on voiceprint features according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
As shown in fig. 1, it is a workflow diagram of a method for separating multiple persons from voice based on voiceprint features, the method includes the following steps:
step S1: inputting the reference audio frequency of the target speaker into the voiceprint feature extractor, and obtaining the Mel frequency cepstrum coefficient of the target speaker as the voiceprint feature X of the target speakerref(ii) a Simultaneous short-time Fourier transform applied to mixed audio to extract spectral feature X of mixed audiomix
Step S11: simultaneously performing mute segment (less than 20db) pruning on the reference audio and the mixed audio of the target speaker;
step S12: if the length of the reference audio without the mute section is smaller than that of the mixed audio without the mute section, circularly splicing the reference audio; if the length of the reference audio without the mute section is larger than that of the mixed audio without the mute section, pruning the reference audio; to ensure that the length of the reference audio is consistent with the length of the mixed audio;
step S13: extracting Mel Frequency Cepstrum Coefficients (MFCCs) from the reference audio without the silence segment by using window size of 256, frame shift of 64, and number of Mel triangular filters of 40, and taking the first 13 dimensions as voiceprint feature X of the target speakerref. The mel frequency cepstrum coefficient is linear transformation of logarithmic energy spectrum of nonlinear mel scale based on sound frequency, under the mel scale, human ear's perception to frequency presents linear relation, namely when the mel frequency of signal is multiple relation, then the difference of pitch perceived by human ear is multiple relation probably. The mapping relation between the common frequency scale f and the mel frequency scale mel (f) is as follows:
Figure BDA0003236854490000051
step S14: carrying out short-time Fourier transform on the mixed audio without the mute section by using the window size of 256 and the frame shift of 64, and simultaneously obtaining the amplitude spectrum and the phase spectrum of the mixed audio;
step S15: using amplitude spectra as spectral feature X of mixed audiomix(ii) a And the phase spectrum isA phase spectrum used for predicting the pure audio of the target speaker by using the recovery separation model;
step S2: firstly, the spectral characteristics X of the mixed audio frequencymixStitching X with voiceprint features of a target speakerrefObtaining a spectral feature X 'of the reference voiceprint feature'mix(ii) a Then X 'is mixed'mixInput feature X to extended convolutional layer for capturing low-level audio features to obtain a speech separation modelinputThe expanded convolutional layer is composed of 8 layers of Convolutional Neural Networks (CNN), the parameter settings are shown in table 1, and the parameter settings refer to Voice Filter of ***: target Voice Separation by Speaker-Conditioned Spectrogram Mask.
TABLE 1 expanded convolution layer parameter settings
Figure BDA0003236854490000052
Step S3: multiplying the binary spectrum mask obtained by the DPCL model with the spectrum of the mixed audio to obtain the spectrum of the predicted pure audio of the target speaker;
step S4: and (3) recovering the pure audio frequency of the predicted target speaker in the time domain by referring to the phase spectrum of the mixed audio frequency and combining short-time Fourier inverse transformation.
The method for separating the voices of multiple persons according to the voice print characteristics of the target speaker will be further described below with reference to specific experimental data.
Experimental conditions and scoring criteria:
the invention adopts a LibriSpeech data set provided by Daniel Povey, and the data set contains English reading speeches with sampling rate of 16000Hz of about 1000 hours; the model presented herein was trained using the library of library's train-clean-100 data subset, which contains about 100 hours of 251 speakers in total, with 126 for males and 125 for females; the proposed model was tested using the library of test-clear data, which contains approximately 5 hours of audio for 40 speakers, 20 bits for each of the male and female. The mixed audio required for training is generated by adopting a method of fusing pure audio, and in order to ensure the reliability of training, the audio in the LibriSpeech is normalized by using ffmpeg-normaze.
The invention adopts Signal Distortion Ratio (SDR) to evaluate the performance of the model, the SDR reflects the similarity degree between the separated audio signal and the original audio signal, and the calculation formula is as follows:
Figure BDA0003236854490000061
where s represents the target speech signal and,
Figure BDA0003236854490000062
representing the output of the speech separation model, the signals are time domain speech signals. The higher the similarity between the separated audio signal and the original audio signal, the higher the difference
Figure BDA0003236854490000063
The smaller the log (. cndot.) value, the larger the SDR and the better the performance of the model.
The experimental results are as follows:
the method takes three Voice separation models of DPCL, PITNet and naked Voice Filter for removing a speaker recognition network as basic models, takes MFCCs and d-vector as voiceprint characteristics of a target speaker, and adopts a direct splicing method (DC), an expansion convolution after splicing method (EC) and an expansion convolution after splicing method (CE) to carry out comparison experiments by three methods of applying the voiceprint characteristics.
TABLE 2 reference voiceprint feature front and back SDR comparison results
Figure BDA0003236854490000064
(1) Transverse comparison: the method can obviously improve the performance of the voice separation model no matter which voiceprint feature is applied and no matter which proposed method for applying the voiceprint feature is adopted. The performance of the speech separation model using MFCCs as the voiceprint feature of the target speaker is superior to the performance of the speech separation model using d-vector as the voiceprint feature of the target speaker. The MFCCs extract the voiceprint features of the target speaker from the traditional signal processing angle, the obtained information is relatively original, the information amount is more complete, however, the voiceprint features extracted by the d-vector depend on a training data set to a great extent, so that the universality of the MFCCs voiceprint features is better, and the separation result is better.
(2) In a speech model applying voiceprint features: the DPCL model using the MFCCs voiceprint characteristics is the best in performance by using a post-splicing dilation convolution method (CE), wherein the SDR is 12.430, and compared with the DPCL model not using the voiceprint characteristics, the performance is improved by about 50%. ② the Voice Filter model using the extended convolution post-stitching (EC) applied d-vector voiceprint feature performs the worst, with its SDR of only 8.159.
(3) In a speech separation model with MFCCs as the voiceprint feature of the target speaker: the DPCL model using the expansion convolution method (CE) after splicing is the best in performance, and the SDR is 12.430. The Voice Filter model using the direct splicing method has the worst performance, the SDR is only 10.420, and the difference with the SDR of the optimal DPCL is 2.01. And the performance of the voice separation model using the expansion convolution splicing method (EC) and the voice separation model using the expansion convolution splicing method (CE) is not obviously different except the DPCL model. And the existence of the expansion convolution layer can improve the performance of the voice separation model in a small range.
To sum up: the proposed DPCL model using post-concatenation dilation Convolution (CE) with MFCCs voiceprint characterization performed best.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the described module may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
The electronic device of the present invention includes a Central Processing Unit (CPU) that can perform various appropriate actions and processes according to computer program instructions stored in a Read Only Memory (ROM) or computer program instructions loaded from a storage unit into a Random Access Memory (RAM). In the RAM, various programs and data required for the operation of the device can also be stored. The CPU, ROM, and RAM are connected to each other via a bus. An input/output (I/O) interface is also connected to the bus.
A plurality of components in the device are connected to the I/O interface, including: an input unit such as a keyboard, a mouse, etc.; an output unit such as various types of displays, speakers, and the like; storage units such as magnetic disks, optical disks, and the like; and a communication unit such as a network card, modem, wireless communication transceiver, etc. The communication unit allows the device to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The processing unit performs the various methods and processes described above, such as methods S1-S3. For example, in some embodiments, the methods S1-S3 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as a storage unit. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device via ROM and/or the communication unit. When the computer program is loaded into RAM and executed by the CPU, one or more of the steps of methods S1-S3 described above may be performed. Alternatively, in other embodiments, the CPU may be configured to perform methods S1-S3 in any other suitable manner (e.g., by way of firmware).
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), and the like.
Program code for implementing the methods of the present invention may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A multi-person voice separation method based on voiceprint features is characterized by comprising the following steps:
step S1: voiceprint feature extraction, including obtaining a voiceprint feature X of a target speakerrefAnd extracting the frequency spectrum characteristic X of the mixed audio by adopting short-time Fourier transformmix
Step S2: voiceprint feature fusion by concatenating spectral features X of mixed audiomixVoiceprint feature X with target speakerrefObtaining a spectral feature X 'of the reference voiceprint feature'mixAnd is characterized by X'mixInput features X into the dilated convolution layer for capturing low-level audio features to obtain a speech separation modelinput
Step S3: separating the voice, obtaining a spectrum mask based on a voice separation model, and mixing the spectrum mask with the spectrum characteristic X of the mixed audiomixMultiplying to obtain the predicted pure audio frequency spectrum of the target speaker; and obtaining the predicted pure audio frequency of the target speaker in the time domain by referring to the phase spectrum of the mixed audio frequency and combining short-time Fourier inverse transformation.
2. The method for separating multi-user voice based on voiceprint features as claimed in claim 1, wherein the voiceprint feature X of the target speaker is obtained in step S1refThe method specifically comprises the following steps: inputting the reference audio frequency of the target speaker into a voiceprint feature extractor, acquiring Mel Frequency Cepstrum Coefficient (MFCCs) of the target speaker, and taking the MFCCs as the voiceprint feature X of the target speakerrefThe method specifically comprises the following steps:
step S11: simultaneously, performing mute segment pruning on the reference audio and the mixed audio of the target speaker;
step S12: processing the reference audio without the mute section and the mixed audio without the mute section to ensure that the length of the reference audio is consistent with that of the mixed audio;
step S13: extracting Mel frequency cepstrum coefficient MFCCs from the reference audio without mute segment, and taking front P dimension as voiceprint characteristic X of target speakerref
3. The method for separating multi-user voice based on voiceprint features according to claim 2, wherein said step S12 is: if the length of the reference audio without the mute section is smaller than that of the mixed audio without the mute section, circularly splicing the reference audio; if the length of the reference audio without the mute section is larger than that of the mixed audio without the mute section, pruning the reference audio to ensure that the length of the reference audio is consistent with that of the mixed audio; wherein the silence segments are speech segments below 20 db.
4. The method for separating the voices of the multi-person based on the voiceprint features of claim 1, wherein the spectral feature X of the mixed audio is extracted by applying short-time Fourier transform to the mixed audio in the step S1mixThe method specifically comprises the following steps:
step S14: carrying out short-time Fourier transform on the mixed audio without the mute section by using the window size of 256 and the frame shift of 64, and simultaneously obtaining the amplitude spectrum and the phase spectrum of the mixed audio;
step S15: using amplitude spectra as spectral feature X of mixed audiomix(ii) a And the phase spectrum is used as the phase spectrum used by the recovery separation model to predict the clean audio of the target speaker.
5. The method for separating multi-person speech according to claim 1, wherein the expanded convolutional layer in step S2 comprises a convolutional neural network CNN.
6. The method for separating the multi-user voice based on the voiceprint features as claimed in claim 1, wherein the process of the voice model obtaining the spectrum mask specifically comprises: using depth clustering model DPCL based on input features XinputAnd obtaining an embedded vector, and clustering the obtained embedded vector by adopting a K-Means algorithm to obtain a spectrum mask.
7. The method as claimed in claim 6, wherein the spectrum mask is a binary spectrum mask, that is, each time-frequency box in each spectrogram belongs to only one speaker.
8. The method of claim 6, wherein the deep clustering model DPCL comprises a bidirectional long-and-short memory network (BilSTM).
9. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program, wherein the processor, when executing the program, implements the method of claims 1-8.
10. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 8.
CN202111004878.9A 2021-08-30 2021-08-30 Voiceprint feature-based multi-user voice separation method, equipment and medium Pending CN113990344A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111004878.9A CN113990344A (en) 2021-08-30 2021-08-30 Voiceprint feature-based multi-user voice separation method, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111004878.9A CN113990344A (en) 2021-08-30 2021-08-30 Voiceprint feature-based multi-user voice separation method, equipment and medium

Publications (1)

Publication Number Publication Date
CN113990344A true CN113990344A (en) 2022-01-28

Family

ID=79735218

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111004878.9A Pending CN113990344A (en) 2021-08-30 2021-08-30 Voiceprint feature-based multi-user voice separation method, equipment and medium

Country Status (1)

Country Link
CN (1) CN113990344A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024114303A1 (en) * 2022-11-30 2024-06-06 腾讯科技(深圳)有限公司 Phoneme recognition method and apparatus, electronic device and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024114303A1 (en) * 2022-11-30 2024-06-06 腾讯科技(深圳)有限公司 Phoneme recognition method and apparatus, electronic device and storage medium

Similar Documents

Publication Publication Date Title
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
Al-Ali et al. Enhanced forensic speaker verification using a combination of DWT and MFCC feature warping in the presence of noise and reverberation conditions
US9792898B2 (en) Concurrent segmentation of multiple similar vocalizations
CN110970036A (en) Voiceprint recognition method and device, computer storage medium and electronic equipment
Singh et al. Modulation spectral features for speech emotion recognition using deep neural networks
Chenchah et al. A bio-inspired emotion recognition system under real-life conditions
Soe Naing et al. Discrete Wavelet Denoising into MFCC for Noise Suppressive in Automatic Speech Recognition System.
Xia et al. Audiovisual speech recognition: A review and forecast
Salekin et al. Distant emotion recognition
Chao et al. Cross-domain single-channel speech enhancement model with bi-projection fusion module for noise-robust ASR
CN113990344A (en) Voiceprint feature-based multi-user voice separation method, equipment and medium
JP2015175859A (en) Pattern recognition device, pattern recognition method, and pattern recognition program
Su et al. Robust audio copy-move forgery detection on short forged slices using sliding window
CN111429919B (en) Crosstalk prevention method based on conference real recording system, electronic device and storage medium
Матиченко et al. The structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space
CN113782005B (en) Speech recognition method and device, storage medium and electronic equipment
CN114970695B (en) Speaker segmentation clustering method based on non-parametric Bayesian model
CN116312559A (en) Training method of cross-channel voiceprint recognition model, voiceprint recognition method and device
CN112992155B (en) Far-field voice speaker recognition method and device based on residual error neural network
CN113724697A (en) Model generation method, emotion recognition method, device, equipment and storage medium
CN113889081A (en) Speech recognition method, medium, device and computing equipment
CN114512133A (en) Sound object recognition method, sound object recognition device, server and storage medium
Mon et al. Developing a speech corpus from web news for Myanmar (Burmese) language
Pandharipande et al. Front-end feature compensation for noise robust speech emotion recognition
Waghmare et al. A Comparative Study of the Various Emotional Speech Databases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination