CN114598914A

CN114598914A - Human voice separation method based on video, terminal equipment and storage medium

Info

Publication number: CN114598914A
Application number: CN202210146711.4A
Authority: CN
Inventors: 陈剑超; 肖龙源; 李稀敏; 叶志坚
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2022-02-17
Filing date: 2022-02-17
Publication date: 2022-06-07

Abstract

The invention relates to a voice separation method based on video, a terminal device and a storage medium, wherein the method comprises the following steps: combining audio information corresponding to video clips of any two different speakers and random noise into mixed audio, and taking the mixed audio and two groups of face information corresponding to the two video clips as training data; constructing a voice separation model, combining the mixed audio and face information into a combined feature after respectively performing feature extraction and feature processing on the mixed audio and face information, converting the combined feature into two speaker features through dimension conversion and a full connection layer, and multiplying the two speaker features with the mixed audio feature to obtain a feature spectrogram and reducing the feature spectrogram into audio data; after the voice separation model is trained through the training set, voice separation is carried out on the video clips with the face information and the audio information through the trained voice separation model. The invention can realize the extraction of the clean speaking voice of the appointed speaker in the video.

Description

Human voice separation method based on video, terminal device and storage medium

Technical Field

The present invention relates to the field of voice separation, and in particular, to a video-based voice separation method, a terminal device, and a storage medium.

Background

With the continuous development of video media technology, more and more information and content are presented by means of videos, for example, on an internet video platform, countless videos are uploaded to the platform every day. If the information content in the video is required to be acquired, the user can listen to the voice content spoken by the person, but the content spoken by the speaker cannot be clearly heard because the speaker in the video may be in a noisy environment or a plurality of speakers speak simultaneously in the video, thereby affecting the listening effect.

At present, most video platforms do not process the voice of a speaker in a video, and usually directly output the original sound, which causes that the original sound is easily interfered by the environment.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method for separating human voice based on video, a terminal device and a storage medium.

The specific scheme is as follows:

a human voice separation method based on videos comprises the following steps:

s1: collecting video clips with fixed lengths corresponding to different speakers, wherein each video clip contains face information and audio information corresponding to a single speaker;

s2: extracting video clips of any two different speakers from all the video clips, randomly selecting a noise audio from an audio noise data set, combining two audio information corresponding to the two extracted video clips and the extracted noise audio, taking the combined mixed audio and two groups of face information corresponding to the two extracted video clips as training data, and forming all the training data into a training set;

s3: constructing a voice separation model, and training the voice separation model through a training set to obtain a trained voice separation model;

the human voice separation model respectively extracts and processes the features of the input mixed audio and two groups of face information and combines the extracted and processed features into a combined feature, the combined feature is converted into two speaker features corresponding to two speakers through a full connection layer after dimension conversion, the two speaker features are respectively multiplied with the input mixed audio feature to obtain feature spectrograms corresponding to the two speakers, and the feature spectrograms are reduced into audio data;

in the model training process, the difference value of the two audio data output by the model and the real audio information of the two speakers corresponding to the input training data is used as a loss value, and iterative training is carried out on the model by taking the minimum loss value as a target;

s4: and carrying out voice separation on the video clips with the face information and the audio information through the trained voice separation model.

Furthermore, a short-time Fourier transform algorithm is adopted to convert the audio into a spectrogram in the feature extraction of the mixed audio.

Further, the feature processing is performed by using a hole convolution network.

Further, dimension conversion is performed by using a Bidirective LSTM network.

Further, the characteristic spectrogram is restored into audio data through inverse Fourier transform.

A video-based voice separation terminal device includes a processor, a memory, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method described above in the embodiments of the present invention when executing the computer program.

A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method as described above for embodiments of the invention.

By adopting the technical scheme, the invention can realize the extraction of the clean speaking voice of the specified speaker in the video.

Drawings

Fig. 1 is a flowchart illustrating a first embodiment of the present invention.

Fig. 2 is a schematic diagram showing a network structure of the model in this embodiment.

Detailed Description

To further illustrate the various embodiments, the invention provides the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the embodiments. Those skilled in the art will appreciate still other possible embodiments and advantages of the present invention with reference to these figures.

The invention will now be further described with reference to the accompanying drawings and detailed description.

The first embodiment is as follows:

the embodiment of the invention provides a human voice separation method based on a video, which comprises the following steps of:

s1: the method comprises the steps of collecting video clips with fixed lengths corresponding to different speakers, wherein each video clip comprises face information and audio information corresponding to a single speaker.

In the embodiment, a large number of video segments with single speaker face information and single speaker audio information are intercepted from a video platform, wherein the face information and the audio information in the video segments correspond to the same speaker. Specifically, about 3 seconds of each segment contains 75 face images, and about 1w video segments are collected.

S2: video clips of any two different speakers are extracted from all the video clips, a noise audio is randomly selected from an audio noise data set, two audio information corresponding to the two extracted video clips and the extracted noise audio are combined, the combined mixed audio and two groups of face information corresponding to the two extracted video clips are used as training data, and all the training data form a training set.

The audio noise data set is used to add noise data to the original clean audio, and an existing known audio data set, such as an AudioSet audio data set, may be used.

S3: and constructing a voice separation model, and training the voice separation model through a training set to obtain the trained voice separation model.

Referring to fig. 2, the human-voice separation model first performs feature extraction on the input mixed audio and two sets of face information, wherein an STFT short-time fourier transform algorithm is used to convert the audio into a spectrogram in the feature extraction of the mixed audio, the feature of the mixed audio extracted in this embodiment is 298 × 257, and the face features of the two sets of face information are both 75 × 1024.

After the feature extraction, feature processing needs to be performed on each feature, and in this embodiment, the feature processing is performed by using a hole convolution network (scaled convolution network). After feature processing, the face features are converted into a structure of 298 × 256, and the mixed audio features are converted into a structure of 257 × 8.

And combining the three features after feature processing into a combined feature, and performing dimension conversion on the combined feature, wherein the dimension conversion adopts a Bidirectional LSTM network, and is firstly converted into a structure of 298 × 400, and then is converted into 3 structures of 298 × 600.

After dimension conversion, the two speaker characteristics corresponding to the two speakers are converted through a full connection layer, namely 2 masks with 298 × 257 structures, and the two masks respectively correspond to the two speakers.

Multiplying the two speaker characteristics (Mask) with the input mixed audio characteristics to obtain characteristic spectrograms corresponding to the two speakers (namely, the result of filtering the interference audio of each speaker), reducing the characteristic spectrograms into audio data through inverse Fourier transform (ISTFT), and enabling the two audio data output by the model to correspond to the input audio information of the two speakers.

In the model training process, the difference value of the two audio data output by the model and the real audio information of the two speakers corresponding to the input training data is used as a loss value, and iterative training is carried out on the model by taking the minimum loss value as a target. Training is performed in this embodiment until the loss value falls within a stable interval.

S4: and carrying out human voice separation on the video clips with the face information and the audio information through the trained human voice separation model.

The embodiment of the invention inputs the face image of the appointed speaker in the video and the audio frequency segment with the environmental noise into the deep learning model, and obtains the audio frequency of the appointed speaker without the environmental interference sound through the deep learning model.

Example two:

the invention also provides a video-based voice separation terminal device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the method embodiment of the first embodiment of the invention.

Further, as an executable scheme, the voice separation terminal device based on the video may be a computing device such as a desktop computer, a notebook, a palm computer, and a cloud server. The video-based voice separation terminal device can include, but is not limited to, a processor and a memory. It is understood by those skilled in the art that the above-mentioned structure of the video-based voice separation terminal device is only an example of the video-based voice separation terminal device, and does not constitute a limitation to the video-based voice separation terminal device, and may include more or less components than the above-mentioned components, or combine some components, or different components, for example, the video-based voice separation terminal device may further include an input and output device, a network access device, a bus, and the like, which is not limited by the embodiment of the present invention.

Further, as an executable solution, the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, and the like. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and the processor is a control center of the video-based voice separation terminal device, and various interfaces and lines are used to connect various parts of the whole video-based voice separation terminal device.

The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the video-based voice separation terminal device by executing or executing the computer programs and/or modules stored in the memory and calling data stored in the memory. The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system and an application program required by at least one function; the storage data area may store data created according to the use of the mobile phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

The invention also provides a computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned method of an embodiment of the invention.

The video-based voice separation terminal device integrated module/unit may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments described above may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), software distribution medium, and the like.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A human voice separation method based on video is characterized by comprising the following steps:

the human voice separation model respectively extracts and processes the features of the input mixed audio and the two groups of face information to form a combined feature, converts the combined feature into two speaker features corresponding to two speakers through a full connection layer after dimension conversion, multiplies the two speaker features with the input mixed audio feature to obtain feature frequency spectrograms corresponding to the two speakers respectively, and restores the feature frequency spectrograms into audio data;

2. The video-based human voice separation method of claim 1, wherein: and a short-time Fourier transform algorithm is adopted in the feature extraction aiming at the mixed audio to convert the audio into a spectrogram.

3. The video-based human voice separation method of claim 1, wherein: the characteristic processing is carried out by adopting a hole convolution network.

4. The video-based human voice separation method of claim 1, wherein: dimension conversion is performed by using a Bidirective LSTM network.

5. The video-based human voice separation method of claim 1, wherein: and restoring the characteristic spectrogram into audio data through inverse Fourier transform.

6. The utility model provides a people's voice separation terminal equipment based on video which characterized in that: comprising a processor, a memory and a computer program stored in the memory and running on the processor, the processor implementing the steps of the method according to any of claims 1 to 5 when executing the computer program.

7. A computer-readable storage medium storing a computer program, characterized in that: the computer program when executed by a processor implementing the steps of the method as claimed in any one of claims 1 to 5.