CN116230001A

CN116230001A - Mixed voice separation method, device, equipment and storage medium

Info

Publication number: CN116230001A
Application number: CN202310246413.7A
Authority: CN
Inventors: 张雪; 杨俊祥
Original assignee: Agricultural Bank of China
Current assignee: Agricultural Bank of China
Priority date: 2023-03-10
Filing date: 2023-03-10
Publication date: 2023-06-06

Abstract

The invention discloses a method, a device, equipment and a storage medium for separating mixed voice. The method comprises the following steps: acquiring mixed voice to be separated; decomposing and transforming the mixed voice to be separated to obtain mixed amplitude information and mixed phase information corresponding to the mixed voice to be separated; inputting the mixed amplitude information into a target voice separation model for voice separation processing, and obtaining a time-frequency mask corresponding to the mixed voice to be separated based on the output of the target voice separation model; based on the time-frequency mask and the mixed voice to be separated, determining voice amplitude information and noise amplitude information; and determining target voice and target noise voice based on the mixed phase information, the voice amplitude information and the noise amplitude information, so that voice separation accuracy can be improved, voice separation smoothness is optimized, and user experience is improved.

Description

Mixed voice separation method, device, equipment and storage medium

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for separating mixed speech.

Background

With the development of society and the progress and development of scientific technology, the intelligent voice system can save manpower and material resources and separate and recognize voice signals more conveniently and rapidly. In real life, however, the voice signal cannot avoid aliasing with surrounding noise, which will greatly reduce the voice recognition performance of the entire system.

The current common voice separation methods comprise a voice separation method based on statistical modeling, a voice separation method based on visual analysis of a computing scene, a voice separation method based on blind source separation and a separation method based on deep learning. However, the existing voice separation method has limited extracted characteristics, low voice separation accuracy, and poor separation effect, and cannot meet the actual requirements.

Disclosure of Invention

The invention provides a method, a device, equipment and a storage medium for separating mixed voice, which are used for improving voice separation precision, optimizing voice separation smoothness and improving user experience.

According to an aspect of the present invention, there is provided a mixed speech separation method. The method comprises the following steps:

acquiring mixed voice to be separated;

decomposing and transforming the mixed voice to be separated to obtain mixed amplitude information and mixed phase information corresponding to the mixed voice to be separated;

Inputting the mixed amplitude information into a target voice separation model for voice separation processing, and determining a time-frequency mask corresponding to the mixed voice to be separated based on the output of the target voice separation model;

based on the time-frequency mask and the mixed voice to be separated, determining voice amplitude information and noise amplitude information;

and determining a target voice and a target noise voice based on the mixed phase information, the voice amplitude information and the noise amplitude information.

According to another aspect of the present invention, there is provided a mixed speech separation apparatus. The device comprises:

the mixed voice acquisition module is used for acquiring mixed voice to be separated;

the mixed information acquisition module is used for carrying out decomposition and transformation processing on the mixed voice to be separated to acquire mixed amplitude information and mixed phase information corresponding to the mixed voice to be separated;

the time-frequency mask acquisition module is used for inputting the mixed amplitude information into a target voice separation model to perform voice separation processing, and determining a time-frequency mask corresponding to the mixed voice to be separated based on the output of the target voice separation model;

the amplitude information determining module is used for determining voice amplitude information and noise amplitude information based on the time-frequency mask and the mixed voice to be separated;

And the target voice determining module is used for determining target voice and target noise voice based on the mixed phase information, the voice amplitude information and the noise amplitude information.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of mixed speech separation according to any one of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to execute the method for mixed speech separation according to any one of the embodiments of the present invention.

According to the technical scheme, the mixed voice to be separated is obtained. And decomposing and transforming the mixed voice to be separated to obtain mixed amplitude information and mixed phase information corresponding to the mixed voice to be separated. And inputting the mixed amplitude information into a target voice separation model to perform voice separation processing, and automatically determining a time-frequency mask corresponding to the mixed voice to be separated based on the output of the target voice separation model. And determining the voice amplitude information and the noise amplitude information based on the time-frequency mask and the mixed voice to be separated. And determining target voice and target noise voice based on the mixed phase information, the voice amplitude information and the noise amplitude information, so that voice separation accuracy can be improved, voice separation smoothness is optimized, and user experience is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for separating mixed speech according to a first embodiment of the present invention;

fig. 2 is a flowchart of a method for separating mixed speech according to a second embodiment of the present invention;

FIG. 3 is a schematic diagram of a generating countermeasure network according to a second embodiment of the present invention;

fig. 4 is a schematic diagram of a generator according to a second embodiment of the present invention;

fig. 5 is a schematic structural diagram of a discriminator according to the second embodiment of the invention;

fig. 6 is a block diagram of a mixed speech separation device according to a third embodiment of the present invention;

Fig. 7 is a block diagram of an electronic device implementing a mixed speech separation method according to a fourth embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of a method for separating mixed speech according to an embodiment of the present invention, where the method may be performed by a mixed speech separation device, and the mixed speech separation device may be implemented in hardware and/or software, and the mixed speech separation device may be configured in an electronic device. As shown in fig. 1, the method includes:

s101, obtaining mixed voice to be separated.

S102, decomposing and transforming the mixed voice to be separated to obtain mixed amplitude information and mixed phase information corresponding to the mixed voice to be separated.

The mixed speech to be separated may refer to mixed speech in which any of a plurality of speech signals are superimposed. The mixed amplitude information may refer to information corresponding to an amplitude spectrum in the mixed speech to be separated. The mixed phase information may refer to information corresponding to a phase spectrum in the mixed speech to be separated.

In particular, since the speech signal has a characteristic of short-term stability, time-frequency domain analysis of the speech signal can well combine advantages of a time domain and a frequency domain, and conversion of the speech signal into the time-frequency domain analysis is one of common methods for enhancing sparseness of the speech signal. According to the method, the mixed voice to be separated is separated according to time frames through a short-time Fourier transform technology, short-time Fourier transform is carried out on each frame of mixed voice signal to be separated to obtain short-time Fourier coefficients, and mode taking processing is carried out on the basis of the short-time Fourier coefficients, so that mixed amplitude information and mixed phase information corresponding to the mixed voice to be separated can be obtained.

S103, inputting the mixed amplitude information into a target voice separation model for voice separation processing, and determining a time-frequency mask corresponding to the mixed voice to be separated based on the output of the target voice separation model.

Wherein the target speech separation model may be obtained based on generating an countermeasure network training. The target speech separation model may be used to obtain time-frequency masks corresponding to the various types of mixed speech. The time-frequency mask may refer to the duty cycle of the voice amplitude information in the mixed speech.

Specifically, the obtained mixed amplitude information is input into a trained target voice separation model to perform voice separation processing. And determining the time-frequency mask corresponding to the mixed voice to be separated according to the output result of the target voice separation model.

Illustratively, the inputting the mixed amplitude information into a target speech separation model for speech separation processing, and determining, based on the output of the target speech separation model, a time-frequency mask corresponding to the mixed speech to be separated includes:

inputting the mixed amplitude information into the target voice separation model for voice separation processing, and obtaining predicted voice amplitude information and predicted noise amplitude information based on the output of the target voice separation model; and determining a time-frequency mask corresponding to the mixed voice to be separated according to the predicted voice amplitude information and the predicted noise amplitude information.

The predicted voice amplitude information may refer to amplitude information corresponding to a voice amplitude spectrum obtained by performing voice separation processing on the mixed amplitude information and predicting output of the target voice separation model. The predicted noise amplitude information may refer to amplitude information corresponding to a noise amplitude spectrum obtained by performing a speech separation process on the mixed amplitude information through prediction of the output of the target speech separation model. The predicted voice amplitude information and the predicted noise amplitude information may be used to determine a time-frequency mask corresponding to the mixed speech to be separated.

Specifically, the mixed amplitude information is input into a target speech separation model for speech separation processing. According to the output of the target voice separation model, the predicted voice amplitude information and the predicted noise amplitude information can be obtained. According to the predicted voice amplitude information and the predicted noise amplitude information, a time-frequency mask corresponding to the mixed voice to be separated can be calculated.

Illustratively, the determining the time-frequency mask corresponding to the mixed speech to be separated according to the predicted voice amplitude information and the predicted noise amplitude information includes: determining a summation result of the predicted voice amplitude information and the predicted noise amplitude information; and determining the quotient result of the predicted voice amplitude information and the summation result as a time-frequency mask corresponding to the mixed voice to be separated.

Specifically, a result of summing the predicted voice amplitude information and the predicted noise amplitude information is calculated. And determining a quotient result between the predicted voice amplitude information and the summation result as a time-frequency mask corresponding to the mixed voice to be separated.

The time-frequency mask corresponding to the mixed voice to be separated can be implemented by the following way:

where m is the time-frequency mask,

may refer to predicting voice amplitude information, < +.>

May refer to prediction noise amplitude information.

S104, based on the time-frequency mask and the mixed voice to be separated, voice amplitude information and noise amplitude information are determined.

The voice amplitude information may refer to a voice amplitude signal obtained by performing separation processing on a mixed voice to be separated. The noise amplitude information may refer to a noise amplitude signal obtained by subjecting the mixed speech to be separated to separation processing.

Specifically, according to the time-frequency mask and the mixed speech to be separated, the voice amplitude information and the noise amplitude information can be respectively calculated and obtained.

Illustratively, the determining the voice amplitude information and the noise amplitude information based on the time-frequency mask and the mixed speech to be separated includes: determining the product of the time-frequency mask and the mixed voice to be separated as the voice amplitude information; and determining the difference value between the mixed voice to be separated and the voice amplitude information as the noise amplitude information.

Specifically, the product of the time-frequency mask and the mixed speech to be separated is calculated, and the result of the product is determined as the voice amplitude information. And calculating a difference value between the mixed voice to be separated and the voice amplitude information, and determining the difference value result as noise amplitude information.

Illustratively, the voice amplitude information and the noise amplitude information may be implemented as follows:

wherein,,

can refer to voice amplitude information, +.>

May refer to noise amplitude information, m may refer to a time-frequency mask, and X may refer to mixed speech to be separated.

Optionally, in the embodiment of the present invention, the predicted voice amplitude information and the predicted noise amplitude information output by the target voice separation model may also be directly determined as voice amplitude information and noise amplitude information, respectively. Preferably, the voice separation effect can be further smoothed by predicting the voice amplitude information and the predicted noise amplitude information to determine the time-frequency mask and determining the voice amplitude information and the noise amplitude information based on the time-frequency mask and the mixed voice to be separated, so that the user experience is improved.

S105, determining target voice and target noise voice based on the mixed phase information, the voice amplitude information and the noise amplitude information.

Specifically, the target voice may refer to pure voice obtained by separation. The target noise speech may refer to other noise of the mixed speech to be separated than the target human voice speech.

Specifically, the voice amplitude information and the mixed phase information are subjected to short-time inverse fourier transform, and the target voice can be obtained. The noise amplitude information and the mixed phase information are subjected to short-time inverse Fourier transform, and target noise voice can be obtained. The amplitude information and the mixed phase information are subjected to inversion processing, so that the amplitude information can be converted into audible voice, and the user experience can be further improved.

Example two

Fig. 2 is a flowchart of a method for separating mixed speech according to a second embodiment of the present invention, and the present embodiment discloses a training process of a target speech separation model based on the above embodiment. As shown in fig. 2, the method includes:

s201, acquiring expected voice amplitude information and expected noise amplitude information corresponding to a mixed amplitude information sample.

The mixed amplitude information sample may include a pure voice amplitude information sample, a pure noise amplitude information sample, and a voice and noise mixed amplitude information sample. Specifically, the mixed amplitude information sample, and the expected human voice amplitude information and the expected noise amplitude information corresponding to the mixed amplitude information sample can be obtained through a recording or synthesizing method.

S202, inputting the mixed amplitude information sample into the generator, performing voice separation processing based on a voice dictionary, and obtaining a model separation result according to the output of the generator.

It should be noted that the target speech separation model may be obtained based on generation of the countermeasure network training. Fig. 3 is a schematic diagram of a generating countermeasure network according to an embodiment of the present invention. As shown in fig. 3, the generation countermeasure network includes a generator and a discriminator. The generator is used for generating predicted amplitude information, and the discriminator is used for determining whether the predicted amplitude information meets actual requirements.

The speech dictionary may refer to model parameters of a target speech separation model. The initialized speech dictionary may be derived from practice of clean human voice and clean noise speech based on non-negative matrix factorization techniques. Fig. 4 is a schematic diagram of a generator according to an embodiment of the present invention. As shown in fig. 4, the generator may include a non-negative matrix factorization layer, a masking layer, and a reconstruction layer.

Specifically, the mixed amplitude information sample is input into the generator, and the generator performs a speech separation process on the mixed amplitude information sample based on a training function and a speech dictionary. From the output of the generator, a model separation result can be obtained.

S203, inputting the expected voice amplitude information, the expected noise amplitude information, the model separation result and the mixed amplitude information sample into the discriminator, and determining a model discrimination result corresponding to the discriminator.

The model separation result comprises predicted voice amplitude information and predicted noise amplitude information. Fig. 5 is a schematic structural diagram of a discriminator according to the embodiment of the invention. As shown in fig. 5, the arbiter may include an input layer, a convolution layer, a feature layer, and a result discrimination layer.

Specifically, desired voice amplitude information, desired noise amplitude information, a model separation result, and a mixed amplitude information sample are input into a discriminator. The discriminator can determine the model discrimination result corresponding to the current training through comparison and judgment.

Illustratively, the determining the model discrimination result corresponding to the discriminator includes: splicing the predicted voice amplitude information, the predicted noise amplitude information and the mixed amplitude information sample to obtain spliced predicted amplitude information; and inputting the spliced prediction amplitude information into the discriminator to judge a prediction result, and determining the model discrimination result based on the output of the discriminator.

The splicing prediction amplitude information can be obtained by splicing prediction voice amplitude information, prediction noise amplitude information and mixed amplitude information samples.

Specifically, the predicted voice amplitude information, the predicted noise amplitude information and the mixed amplitude information sample are subjected to splicing processing to obtain spliced predicted amplitude information. And inputting the spliced prediction amplitude information into the discriminator to carry out prediction judgment results, and determining a model-enabled judgment result based on the output of the discriminator.

Illustratively, the determining the model discrimination result based on the output of the discriminator includes: calculating a comparison amplitude error between the spliced predicted amplitude information output by the discriminator and spliced expected amplitude information; under the condition that the contrast amplitude error is smaller than or equal to a preset amplitude error, determining that the model discrimination result meets a preset end condition; and under the condition that the comparison amplitude error is larger than a preset amplitude error, determining that the model judging result does not meet a preset ending condition.

The spliced expected amplitude information is obtained by splicing expected voice amplitude information, the expected noise amplitude information and the mixed amplitude information sample. The model discrimination result may include satisfaction of a preset end condition and non-satisfaction of the preset end condition. The preset amplitude error may be determined according to actual situations, and is not limited herein.

Specifically, a comparison amplitude error between the splicing predicted amplitude information and the splicing expected amplitude information is calculated, the comparison amplitude error is compared with a preset amplitude error, and under the condition that the comparison amplitude error is smaller than or equal to the preset amplitude error, the model discrimination result is determined to meet the preset end condition. And under the condition that the comparison amplitude error is larger than the preset amplitude error, determining that the model judgment result does not meet the preset end condition.

The generator and the arbiter are updated alternately by a batch gradient descent algorithm. When one of them is updated, the other parameter is fixed. The goal of generating the countermeasure network is to minimize the divergence of the distribution differences between the natural and generated voices. According to the invention, extra mixed voice to be separated is added for the discriminator in training, so that the smooth effect of the separated voice can be improved, and the separation effect is improved.

S204, adjusting the voice dictionary in the generator and continuously training the generator when the model judging result is that the preset ending condition is not met, or determining the generator obtained through training as a target voice separation model when the model judging result is that the preset ending condition is met.

Specifically, when the model discrimination result is that the preset end condition is not satisfied, it is required to adjust the speech dictionary in the generator and continue the training of the generator, so that the training result satisfies the preset end condition. Or under the condition that the model discrimination result meets the preset end condition, the prediction result of the generator can reach the expected level, and the generator obtained through training can be determined as the target voice separation model.

S205, obtaining the mixed voice to be separated.

S206, carrying out decomposition and transformation processing on the mixed voice to be separated to obtain mixed amplitude information and mixed phase information corresponding to the mixed voice to be separated.

S207, inputting the mixed amplitude information into a target voice separation model for voice separation processing, and determining a time-frequency mask corresponding to the mixed voice to be separated based on the output of the target voice separation model.

S208, based on the time-frequency mask and the mixed voice to be separated, voice amplitude information and noise amplitude information are determined.

S209, determining target voice and target noise voice based on the mixed phase information, the voice amplitude information and the noise amplitude information.

According to the technical scheme, the mixed amplitude information sample and the expected voice amplitude information and the expected noise amplitude information corresponding to the mixed amplitude information sample are obtained. And inputting the mixed amplitude information sample into the generator, performing voice separation processing based on a voice dictionary, and obtaining a model separation result according to the output of the generator. And inputting the expected voice amplitude information, the expected noise amplitude information, the model separation result and the mixed amplitude information sample into the discriminator, and determining a model discrimination result corresponding to the discriminator. And under the condition that the model judging result does not meet the preset ending condition, adjusting the voice dictionary in the generator, and continuously training the generator, or under the condition that the model judging result meets the preset ending condition, determining the generator obtained by training as a target voice separation model. By utilizing the mixed amplitude information sample and the expected human voice amplitude information and the expected noise amplitude information corresponding to the mixed amplitude information sample to carry out model training, the accuracy of voice separation of the target voice separation model can be ensured, and the user experience is improved.

Example III

Fig. 6 is a schematic structural diagram of a mixed-speech separation device according to a third embodiment of the present invention. As shown in fig. 6, the apparatus includes: a mixed speech acquisition module 301, a mixed information acquisition module 302, a time-frequency mask acquisition module 303, an amplitude information determination module 304, and a target speech determination module 305.

Wherein,,

a mixed voice acquisition module 301, configured to acquire a mixed voice to be separated; the mixed information obtaining module 302 is configured to perform decomposition and transformation processing on the mixed speech to be separated, so as to obtain mixed amplitude information and mixed phase information corresponding to the mixed speech to be separated; a time-frequency mask obtaining module 303, configured to input the mixed amplitude information into a target speech separation model to perform speech separation processing, and determine a time-frequency mask corresponding to the mixed speech to be separated based on an output of the target speech separation model; an amplitude information determining module 304, configured to determine, based on the time-frequency mask and the mixed speech to be separated, voice amplitude information and noise amplitude information; the target voice determining module 305 is configured to determine a target voice and a target noise voice based on the mixed phase information, the voice amplitude information and the noise amplitude information.

Optionally, the time-frequency mask acquisition module 303 includes a predicted amplitude information acquisition unit and a time-frequency mask determination unit. Wherein,,

the predicted amplitude information acquisition unit is used for inputting the mixed amplitude information into the target voice separation model to perform voice separation processing, and acquiring predicted voice amplitude information and predicted noise amplitude information based on the output of the target voice separation model;

And the time-frequency mask determining unit is used for determining the time-frequency mask corresponding to the mixed voice to be separated according to the predicted voice amplitude information and the predicted noise amplitude information.

Alternatively, the time-frequency mask determining unit may be specifically configured to: determining a summation result of the predicted voice amplitude information and the predicted noise amplitude information; and determining the quotient result of the predicted voice amplitude information and the summation result as a time-frequency mask corresponding to the mixed voice to be separated.

Optionally, the target speech separation model is obtained based on generating an countermeasure network training, wherein the generating the countermeasure network comprises a generator and a discriminator; the apparatus includes a model training module. Wherein,,

the model training module comprises: the device comprises an amplitude information sample acquisition unit, a generator training unit, a discriminator discriminating unit and a discriminating result determining unit. Wherein,,

the amplitude information sample acquisition unit is used for acquiring a mixed amplitude information sample and expected voice amplitude information and expected noise amplitude information corresponding to the mixed amplitude information sample;

the generator training unit is used for inputting the mixed amplitude information sample into the generator, performing voice separation processing based on a voice dictionary and obtaining a model separation result according to the output of the generator;

The discriminator discriminating unit is used for inputting the expected voice amplitude information, the expected noise amplitude information, the model separation result and the mixed amplitude information sample into the discriminator and determining a model discriminating result corresponding to the discriminator;

and the judging result determining unit is used for adjusting the voice dictionary in the generator and continuously training the generator under the condition that the model judging result does not meet the preset ending condition, or determining the generator obtained by training as a target voice separation model under the condition that the model judging result meets the preset ending condition.

Optionally, the model separation result includes predicted voice amplitude information and predicted noise amplitude information; the discriminator discriminating unit comprises a predicted amplitude information splicing subunit and a model discriminating result determining subunit. Wherein,,

the predicted amplitude information splicing subunit is used for carrying out splicing processing on the predicted voice amplitude information, the predicted noise amplitude information and the mixed amplitude information sample to obtain spliced predicted amplitude information;

and the model discrimination result determining subunit is used for inputting the spliced prediction amplitude information into the discriminator to judge the prediction result and determining the model discrimination result based on the output of the discriminator.

Optionally, the model discrimination result determining subunit is specifically configured to:

calculating a comparison amplitude error between the spliced predicted amplitude information and the spliced expected amplitude information output by the discriminator, wherein the spliced expected amplitude information is obtained by splicing the expected voice amplitude information, the expected noise amplitude information and the mixed amplitude information sample;

under the condition that the contrast amplitude error is smaller than or equal to a preset amplitude error, determining that the model discrimination result meets a preset end condition;

and under the condition that the comparison amplitude error is larger than a preset amplitude error, determining that the model judging result does not meet a preset ending condition.

Optionally, the amplitude information determining module 304 is specifically configured to:

determining the product of the time-frequency mask and the mixed voice to be separated as the voice amplitude information;

and determining the difference value between the mixed voice to be separated and the voice amplitude information as the noise amplitude information.

The mixed voice separation device provided by the embodiment of the invention can execute the mixed voice separation method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example IV

Fig. 7 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 7, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as method-mixed speech separation.

In some embodiments, the method mixed-speech separation may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the above-described method of mixing speech separation may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the method mixing speech separation in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method of mixed speech separation comprising:

acquiring mixed voice to be separated;

2. The method according to claim 1, wherein the inputting the mixed amplitude information into a target speech separation model for speech separation processing, and determining a time-frequency mask corresponding to the mixed speech to be separated based on an output of the target speech separation model, comprises:

inputting the mixed amplitude information into the target voice separation model for voice separation processing, and obtaining predicted voice amplitude information and predicted noise amplitude information based on the output of the target voice separation model;

and determining a time-frequency mask corresponding to the mixed voice to be separated according to the predicted voice amplitude information and the predicted noise amplitude information.

3. The method according to claim 2, wherein the determining the time-frequency mask corresponding to the mixed speech to be separated according to the predicted voice amplitude information and the predicted noise amplitude information includes:

Determining a summation result of the predicted voice amplitude information and the predicted noise amplitude information;

and determining the quotient result of the predicted voice amplitude information and the summation result as a time-frequency mask corresponding to the mixed voice to be separated.

4. The method of claim 1, wherein the target speech separation model is obtained based on generating an countermeasure network training, wherein the generating the countermeasure network comprises a generator and a arbiter; before the mixed voice to be separated is obtained, the method comprises the following steps:

acquiring expected voice amplitude information and expected noise amplitude information corresponding to a mixed amplitude information sample;

inputting the mixed amplitude information sample into the generator, performing voice separation processing based on a voice dictionary, and obtaining a model separation result according to the output of the generator;

inputting the expected voice amplitude information, the expected noise amplitude information, the model separation result and the mixed amplitude information sample into the discriminator, and determining a model discrimination result corresponding to the discriminator;

and under the condition that the model judging result does not meet the preset ending condition, adjusting the voice dictionary in the generator, and continuously training the generator, or under the condition that the model judging result meets the preset ending condition, determining the generator obtained by training as a target voice separation model.

5. The method of claim 4, wherein the model separation result comprises predicted voice amplitude information and predicted noise amplitude information;

the determining the model discrimination result corresponding to the discriminator comprises the following steps:

splicing the predicted voice amplitude information, the predicted noise amplitude information and the mixed amplitude information sample to obtain spliced predicted amplitude information;

and inputting the spliced prediction amplitude information into the discriminator to judge a prediction result, and determining the model discrimination result based on the output of the discriminator.

6. The method of claim 5, wherein determining the model discrimination result based on the output of the discriminator comprises:

7. The method of claim 1, wherein the determining the voice amplitude information and the noise amplitude information based on the time-frequency mask and the mixed speech to be separated comprises:

8. A mixed speech separation apparatus, comprising:

9. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the mixed speech separation method of any one of claims 1-7.

10. A computer readable storage medium storing computer instructions for causing a processor to perform the mixed speech separation method of any one of claims 1-7.