CN114333852A

CN114333852A - Multi-speaker voice and human voice separation method, terminal device and storage medium

Info

Publication number: CN114333852A
Application number: CN202210017047.3A
Authority: CN
Inventors: 陈剑超; 肖龙源; 李稀敏; 叶志坚
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2022-01-07
Filing date: 2022-01-07
Publication date: 2022-04-12

Abstract

The invention relates to a method for separating voice and human voice of multiple speakers, a terminal device and a storage medium, wherein the method comprises the following steps: s1: collecting voices of different speakers, extracting voice superposition of the different speakers to generate mixed audio, taking single voice frequency used for forming the mixed audio as label voice frequency of model training, and forming a training set by all the mixed audio and the corresponding label voice frequency; s2: constructing a voice separation model for separating input mixed audio into single audio corresponding to different speakers, and training the voice separation model through a training set to minimize the difference between the separated audio output by the model and the single audio used for forming the input mixed audio; s3: and separating the audio frequency containing a plurality of speakers through the trained human voice separation model. The invention can realize the separation of the voices of a plurality of speakers at the overlapped part, has only one model, and does not need to train a voiceprint extraction model and a voice clustering model independently.

Description

Multi-speaker voice and human voice separation method, terminal device and storage medium

Technical Field

The invention relates to the field of voice recognition, in particular to a method for separating voices and voices of multiple speakers, terminal equipment and a storage medium.

Background

With the continuous development of voice recognition technology, more and more intelligent devices realize the function of human-computer interaction through voice-related technologies such as voice recognition, for example, devices such as smart speakers and smart phones, and through the intelligent devices, people can more conveniently perform related operations of the devices through voice.

In a conference scene, a recording pen is usually used to record the voices of all speakers, and then the voices are converted into characters to be stored after the conference is finished, but because the same recording contains the voices of a plurality of speakers, when the voices are converted into the characters, the speakers cannot distinguish the voices of each sentence, and at this time, the voices of different speakers in the same audio segment need to be distinguished through a voice separation technology, and then the voices of different speakers are separately identified.

The traditional voice separation adopts a clustering method based on voice voiceprint information, namely, voice frequency is segmented with equal length, then the voice voiceprint information of speakers of all the voice frequency segments is extracted, finally, the voiceprint information of the speakers of all the voice frequency segments is classified, the voice frequency segments of the same speaker are spliced together, and the voice separation of the speakers is realized. The method is based on a voice clustering mode to realize the separation of the voices of the speakers, and has the problems that the overlapped parts of the voices of a plurality of speakers cannot be separated, the accuracy of the separation is influenced by a voiceprint information extraction system, and if the accuracy of the voiceprint information extraction system is not high, the effect of audio classification can be influenced.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method for separating voice from voice of multiple speakers, a terminal device and a storage medium.

The specific scheme is as follows:

a method for separating voice and human voice of multiple speakers comprises the following steps:

s1: collecting voices of different speakers, extracting voice superposition of the different speakers to generate mixed audio, taking single voice frequency used for forming the mixed audio as label voice frequency of model training, and forming a training set by all the mixed audio and the corresponding label voice frequency;

s2: constructing a voice separation model for separating input mixed audio into single audio corresponding to different speakers, and training the voice separation model through a training set to minimize the difference between the separated audio output by the model and the single audio used for forming the input mixed audio;

s3: and separating the audio frequency containing a plurality of speakers through the trained human voice separation model.

Further, the specific method for collecting the voices of different speakers in step S1 is as follows: the voice of M speakers is collected through a recording pen, the voice of each speaker is L pieces of audio with fixed length recorded by each speaker, and the recorded audio content is the reading voice of the speakers to the fixed text content.

Further, the method for extracting the voice superposition of different speakers to generate the mixed audio in step S1 includes: randomly selecting N speakers from all speakers, randomly selecting one audio from multiple audios corresponding to the speakers aiming at the N speakers, and superposing the obtained N audios in a linear addition mode to generate mixed audio.

Further, step S1 includes: the mixed audio in the training set is converted into one-dimensional audio data.

Further, the network structure of the human voice Separation model comprises three modules, namely an Encoder module, a Separation module and a Decoder module, wherein the Encoder module is used for encoding input mixed audio and converting one-dimensional audio data into a two-dimensional matrix structure; the Separation module is used for separating mixed audio of the two-dimensional matrix structure to generate a mask for separating the audio; the Decoder module is used for decoding the separated audio output by the Separation module and restoring the two-dimensional matrix structure into one-dimensional audio data.

Furthermore, the network structure of the Encoder module consists of a 1-D Conv network, and the output result of the input mixed audio after passing through the 1-D Conv network is combined with the input mixed audio to be used as the output result of the Encoder module; the 1-D Conv network is a convolutional network for encoding audio data.

Further, the 1-D Conv network comprises, in order, a 1 × 1 sized convolution layer, a prerlu activation function layer, a normalization layer, a convolution layer, a prerlu activation function layer, a normalization layer, and a 1 × 1 sized convolution layer.

Further, the input of the segregation module is the output result of the Encoder module, the output result of the Encoder module sequentially passes through a normalization layer, a 1 × 1 convolution layer and a plurality of 1-D Conv networks, the output results of all the 1-D Conv networks are superposed, the superposed result sequentially passes through a PReLU activation function layer, a 1 × 1 convolution layer and a Sigmoid activation function layer, and the result obtained by multiplying the output result of the Sigmoid activation function layer and the output result of the Encoder module is used as the output result of the segregation module.

Further, the network structure of the Decoder module is composed of a 1-D Conv network, and the output result of the input Separation module after passing through the 1-D Conv network is used as the output result of the Decoder module.

The terminal equipment for separating the voice and the human voice of the multi-speaker comprises a processor, a memory and a computer program which is stored in the memory and can run on the processor, wherein the processor executes the computer program to realize the steps of the method of the embodiment of the invention.

A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the method as described above for an embodiment of the invention.

The invention adopts the technical scheme, the constructed voice separation model adopts an end-to-end method to realize voice separation of multi-speaker voice frequency, can realize separation of a plurality of speaker voices at an overlapped part, has only one model, and does not need to train a voiceprint extraction model and a voice clustering model independently.

Drawings

Fig. 1 is a flowchart illustrating a first embodiment of the present invention.

Fig. 2 is a schematic diagram showing a network structure of the model in this embodiment.

Fig. 3 is a schematic diagram showing a specific network structure of the model in this embodiment.

Fig. 4 is a schematic diagram of a network structure of the Encoder module in this embodiment.

Detailed Description

To further illustrate the various embodiments, the invention provides the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the embodiments. Those skilled in the art will appreciate still other possible embodiments and advantages of the present invention with reference to these figures.

The invention will now be further described with reference to the accompanying drawings and detailed description.

The first embodiment is as follows:

the embodiment of the invention provides a method for separating voices and voices of multiple speakers, which comprises the following steps of:

s1: the method comprises the steps of collecting voices of different speakers, extracting voice superposition of the different speakers to generate mixed audio, using single voice used for forming the mixed audio as label voice of model training, and forming a training set by all the mixed voice and the corresponding label voice.

The voices of the speakers in the real scene should be voices in the real scene, and the specific method for collecting the voices of different speakers in the example is as follows: the voice of M speakers is collected through a recording pen, the voice of each speaker is L pieces of audio with fixed length recorded by each speaker, and the recorded audio content is the reading voice of the speakers to the fixed text content. M, L and the fixed length can be set by those skilled in the art according to the needs, and is not limited herein, and in this embodiment, M and L are both 100, and the fixed length is 10 seconds. For the convenience of subsequent use, all collected voices are stored as single-channel 16k voice audio, and the audio corresponding to each speaker is separately stored in a corresponding folder.

In this example, the method for generating the mixed audio by extracting the superposition of the voices of different speakers comprises the following steps: randomly selecting N speakers from all speakers, randomly selecting one audio from multiple audios corresponding to the speakers aiming at the N speakers, and superposing the obtained N audios in a linear addition mode to generate mixed audio. The value of N may be set by a person skilled in the art according to a requirement, and is not limited herein, in this embodiment, if N is set to 10, the single audio used for generating the mixed audio is 10 audio corresponding to 10 speakers in 10 seconds, and these single audio will be used as the tag audio for the subsequent model training. A total of 1 ten thousand pieces of mixed audio are generated in this embodiment.

Further, since the input of the model needs to be one-dimensional sample data, this embodiment further includes converting the mixed audio in the training set into one-dimensional audio data before the model training. The specific conversion used in this example is the audio reading by the Pydub tool of Python.

S2: and constructing a voice separation model for separating the input mixed audio into single audios corresponding to different speakers, and training the voice separation model through a training set, so that the difference between the separated audio output by the model and the single audio used for forming the input mixed audio is minimum.

In this embodiment, the human voice Separation model is named as a Conv-TasNet model, as shown in fig. 2 and 3, a network structure of the Conv-TasNet model includes three modules, which are an Encoder module, a Separation module and a Decoder module, respectively, where the Encoder module is configured to encode mixed audio data and convert one-dimensional audio data into a two-dimensional matrix structure, the Separation module is configured to separate the mixed audio of the two-dimensional matrix structure to generate a Mask (Mask) for separating audio, and the Decoder module is configured to decode the separated audio output by the Separation module and restore the two-dimensional matrix structure into one-dimensional audio data.

In the Encoder module, the input data is a one-dimensional mixed audio, and the mixed audio is encoded (i.e., a one-dimensional structure is converted into a two-dimensional matrix structure) by constructing a convolutional network. In this embodiment, the convolutional network structure is referred to as a 1-D Conv network, as shown in fig. 4, after the mixed audio is input into the 1-D Conv network, the mixed audio first passes through a convolutional layer with a size of 1 × 1, then passes through a layer of a prilu activation function, then passes through a normalization layer, is used for normalizing data, and then sequentially passes through a convolutional layer, a layer of a prilu activation function, a layer of a normalization layer, and finally passes through a layer of a convolutional layer with a size of 1 × 1, an output result of the last passed convolutional layer with a size of 1 × 1 is used as an output result of the 1-D Conv network, and an output result of the input mixed audio after passing through the 1-D Conv network is combined with the input mixed audio to be used as an output result of the Encoder module.

The input of the Separation module is the output result of the Encoder module, the output result of the Encoder module sequentially passes through a normalization layer, a layer of convolution layer with the size of 1 x 1 and a plurality of 1-D Conv networks, then the output results of all the 1-D Conv networks are superposed, the superposed result sequentially passes through a PReLU activation function layer, a layer of convolution layer with the size of 1 x 1 and a Sigmoid activation function layer, and the result obtained by multiplying the output result of the Sigmoid activation function layer and the output result of the Encoder module is used as the output result of the Separation module. In this embodiment, the output result of the Sigmoid activation function layer is a mask filter screen of each of 10 speakers predicted by the network, the filter screen is multiplied by the mixed audio feature data output by the Encoder module to obtain the speech feature data of the speaker corresponding to the filter screen, and the output result of the whole Separation module includes the speech feature data of each of the 10 speakers.

The network structure of the Decoder module consists of a 1-D Conv network, and the output result of the input Separation module after passing through the 1-D Conv network is used as the output result of the Decoder module. In the embodiment, the voice characteristic data of each of 10 speakers output by the Separation module is used as the input of the Decoder module, then all the voice characteristic data of the speakers are processed by the 1-D Conv network, the output result is one-dimensional sample data of the voice audio corresponding to each of the 10 speakers, and the Separation work of the model to the mixed audio is completed at the moment.

The separation result of the model is the voice audio data of the single speaker predicted by the model, and the voice audio data of the single speaker predicted by the model is compared with the original voice audio data of the single speaker, namely the label data, for the purpose of model training, the difference between the voice audio data of the single speaker predicted by the model and the original voice audio data of the single speaker, namely the label data, is calculated in a difference calculation mode, and the difference is used as a loss value for iterative optimization of the model training.

In the training process of the model, iterative training of loss values is performed through an adammoptimizer in TensorFlow, in this embodiment, every 64 pieces of audio data are set as a training batch, and an Epoch has a total of 100 batches and is trained for 50 epochs until the loss value during training is reduced to a stable interval.

And after the model after training fitting is obtained, taking out the trained model as a final model for separating the human voice, and inputting the audio frequency containing a plurality of speakers into the final model, wherein the output result of the model is the separated voice audio data of each speaker.

In the voice separation model adopted in the embodiment, all RNN networks are changed into CNN networks, TCN replaces LSTM, and meanwhile, deep separable convolution (depthwise separable convolution) is used for convolution operation to reduce the number of parameters and the amount of computation, and this operation changes the original one of CONV operation into two CONV operations, which can greatly reduce the number of parameters and the amount of computation.

This embodiment employs a network framework that addresses the shortcomings of STFT domain speech separation, including phase and amplitude separation, sub-optimal representation of mixed audio, and high latency of STFT computation. Furthermore, the present embodiment employs a network with a smaller model size and shorter minimum delay, which makes it suitable for low-resource, low-delay applications.

The end-to-end voice separation method adopted by the embodiment of the invention has good scene adaptability, can record the scene audio corresponding to the scene to be separated into the training model, then uses the trained model to separate other audio in the same scene, for example, uses the data training model of the conference room A, and then uses the model to separate the mixed audio data of the conference room B.

Example two:

the invention also provides a terminal device for separating the voice and the human voice of the multiple speakers, which comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the processor executes the computer program to realize the steps of the method embodiment of the first embodiment of the invention.

Further, as an executable scheme, the multi-speaker voice and human voice separation terminal device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The multi-speaker voice and voice separation terminal device can include, but is not limited to, a processor and a memory. It is understood by those skilled in the art that the above-mentioned structure of the terminal device for separating the voice from the multiple speakers is only an example of the terminal device for separating the voice from the multiple speakers, and does not constitute a limitation on the terminal device for separating the voice from the multiple speakers, and may include more or less components than the above-mentioned structure, or combine some components, or different components, for example, the terminal device for separating the voice from the multiple speakers may further include an input/output device, a network access device, a bus, etc., which is not limited in this embodiment of the present invention.

Further, as an executable solution, the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, and the like. The general processor may be a microprocessor or the processor may be any conventional processor, etc., and the processor is a control center of the multi-speaker voice and voice separation terminal device, and various interfaces and lines are used to connect various parts of the whole multi-speaker voice and voice separation terminal device.

The memory can be used for storing the computer program and/or the module, and the processor can realize various functions of the multi-speaker voice and human voice separation terminal equipment by operating or executing the computer program and/or the module stored in the memory and calling the data stored in the memory. The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system and an application program required by at least one function; the storage data area may store data created according to the use of the mobile phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

The invention also provides a computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned method of an embodiment of the invention.

The integrated module/unit of the multi-speaker voice and human voice separation terminal device can be stored in a computer readable storage medium if it is realized in the form of a software functional unit and sold or used as an independent product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), software distribution medium, and the like.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for separating voice and human voice of multiple speakers is characterized by comprising the following steps:

2. The method of claim 1, wherein the method further comprises: the specific method for collecting the voices of different speakers in the step S1 is as follows: the voice of M speakers is collected through a recording pen, the voice of each speaker is L pieces of audio with fixed length recorded by each speaker, and the recorded audio content is the reading voice of the speakers to the fixed text content.

3. The method of claim 1, wherein the method further comprises: the method for generating the mixed audio by extracting the voice superposition of different speakers in the step S1 comprises the following steps: randomly selecting N speakers from all speakers, randomly selecting one audio from multiple audios corresponding to the speakers aiming at the N speakers, and superposing the obtained N audios in a linear addition mode to generate mixed audio.

4. The method of claim 1, wherein the method further comprises: step S1 further includes: the mixed audio in the training set is converted into one-dimensional audio data.

5. The method of claim 1, wherein the method further comprises: the network structure of the human voice Separation model comprises three modules, namely an Encoder module, a Separation module and a Decoder module, wherein the Encoder module is used for coding input mixed audio and converting one-dimensional audio data into a two-dimensional matrix structure; the Separation module is used for separating mixed audio of the two-dimensional matrix structure to generate a mask for separating the audio; the Decoder module is used for decoding the separated audio output by the Separation module and restoring the two-dimensional matrix structure into one-dimensional audio data.

6. The method of claim 5, wherein the method further comprises: the network structure of the Encoder module consists of a 1-D Conv network, and the output result of the input mixed audio after passing through the 1-D Conv network is combined with the input mixed audio to be used as the output result of the Encoder module; the 1-D Conv network is a convolutional network for encoding audio data.

7. The method of claim 6, wherein the method further comprises: the 1-D Conv network comprises a 1 × 1 sized convolution layer, a PReLU activation function layer, a normalization layer, a convolution layer, a PReLU activation function layer, a normalization layer and a 1 × 1 sized convolution layer in sequence.

8. The method of claim 5, wherein the method further comprises: the input of the Separation module is the output result of the Encoder module, the output result of the Encoder module sequentially passes through a normalization layer, a layer of convolution layer with the size of 1 x 1 and a plurality of 1-D Conv networks, then the output results of all the 1-D Conv networks are superposed, the superposed result sequentially passes through a PReLU activation function layer, a layer of convolution layer with the size of 1 x 1 and a Sigmoid activation function layer, and the result obtained by multiplying the output result of the Sigmoid activation function layer and the output result of the Encoder module is used as the output result of the Separation module.

9. The method of claim 5, wherein the method further comprises: the network structure of the Decoder module consists of a 1-D Conv network, and the output result of the input Separation module after passing through the 1-D Conv network is used as the output result of the Decoder module.

10. The utility model provides a many speakers pronunciation people sound separation terminal equipment which characterized in that: comprising a processor, a memory and a computer program stored in said memory and running on said processor, said processor implementing the steps of the method according to any one of claims 1 to 9 when executing said computer program.

11. A computer-readable storage medium storing a computer program, characterized in that: the computer program when executed by a processor implementing the steps of the method as claimed in any one of claims 1 to 9.