CN114598914A - Human voice separation method based on video, terminal equipment and storage medium - Google Patents

Human voice separation method based on video, terminal equipment and storage medium Download PDF

Info

Publication number
CN114598914A
CN114598914A CN202210146711.4A CN202210146711A CN114598914A CN 114598914 A CN114598914 A CN 114598914A CN 202210146711 A CN202210146711 A CN 202210146711A CN 114598914 A CN114598914 A CN 114598914A
Authority
CN
China
Prior art keywords
audio
video
voice separation
feature
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210146711.4A
Other languages
Chinese (zh)
Inventor
陈剑超
肖龙源
李稀敏
叶志坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Kuaishangtong Technology Co Ltd
Original Assignee
Xiamen Kuaishangtong Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Kuaishangtong Technology Co Ltd filed Critical Xiamen Kuaishangtong Technology Co Ltd
Priority to CN202210146711.4A priority Critical patent/CN114598914A/en
Publication of CN114598914A publication Critical patent/CN114598914A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/434Disassembling of a multiplex stream, e.g. demultiplexing audio and video streams, extraction of additional data from a video stream; Remultiplexing of multiplex streams; Extraction or processing of SI; Disassembling of packetised elementary stream
    • H04N21/4341Demultiplexing of audio and video streams
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention relates to a voice separation method based on video, a terminal device and a storage medium, wherein the method comprises the following steps: combining audio information corresponding to video clips of any two different speakers and random noise into mixed audio, and taking the mixed audio and two groups of face information corresponding to the two video clips as training data; constructing a voice separation model, combining the mixed audio and face information into a combined feature after respectively performing feature extraction and feature processing on the mixed audio and face information, converting the combined feature into two speaker features through dimension conversion and a full connection layer, and multiplying the two speaker features with the mixed audio feature to obtain a feature spectrogram and reducing the feature spectrogram into audio data; after the voice separation model is trained through the training set, voice separation is carried out on the video clips with the face information and the audio information through the trained voice separation model. The invention can realize the extraction of the clean speaking voice of the appointed speaker in the video.

Description

Human voice separation method based on video, terminal device and storage medium
Technical Field
The present invention relates to the field of voice separation, and in particular, to a video-based voice separation method, a terminal device, and a storage medium.
Background
With the continuous development of video media technology, more and more information and content are presented by means of videos, for example, on an internet video platform, countless videos are uploaded to the platform every day. If the information content in the video is required to be acquired, the user can listen to the voice content spoken by the person, but the content spoken by the speaker cannot be clearly heard because the speaker in the video may be in a noisy environment or a plurality of speakers speak simultaneously in the video, thereby affecting the listening effect.
At present, most video platforms do not process the voice of a speaker in a video, and usually directly output the original sound, which causes that the original sound is easily interfered by the environment.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method for separating human voice based on video, a terminal device and a storage medium.
The specific scheme is as follows:
a human voice separation method based on videos comprises the following steps:
s1: collecting video clips with fixed lengths corresponding to different speakers, wherein each video clip contains face information and audio information corresponding to a single speaker;
s2: extracting video clips of any two different speakers from all the video clips, randomly selecting a noise audio from an audio noise data set, combining two audio information corresponding to the two extracted video clips and the extracted noise audio, taking the combined mixed audio and two groups of face information corresponding to the two extracted video clips as training data, and forming all the training data into a training set;
s3: constructing a voice separation model, and training the voice separation model through a training set to obtain a trained voice separation model;
the human voice separation model respectively extracts and processes the features of the input mixed audio and two groups of face information and combines the extracted and processed features into a combined feature, the combined feature is converted into two speaker features corresponding to two speakers through a full connection layer after dimension conversion, the two speaker features are respectively multiplied with the input mixed audio feature to obtain feature spectrograms corresponding to the two speakers, and the feature spectrograms are reduced into audio data;
in the model training process, the difference value of the two audio data output by the model and the real audio information of the two speakers corresponding to the input training data is used as a loss value, and iterative training is carried out on the model by taking the minimum loss value as a target;
s4: and carrying out voice separation on the video clips with the face information and the audio information through the trained voice separation model.
Furthermore, a short-time Fourier transform algorithm is adopted to convert the audio into a spectrogram in the feature extraction of the mixed audio.
Further, the feature processing is performed by using a hole convolution network.
Further, dimension conversion is performed by using a Bidirective LSTM network.
Further, the characteristic spectrogram is restored into audio data through inverse Fourier transform.
A video-based voice separation terminal device includes a processor, a memory, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method described above in the embodiments of the present invention when executing the computer program.
A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method as described above for embodiments of the invention.
By adopting the technical scheme, the invention can realize the extraction of the clean speaking voice of the specified speaker in the video.
Drawings
Fig. 1 is a flowchart illustrating a first embodiment of the present invention.
Fig. 2 is a schematic diagram showing a network structure of the model in this embodiment.
Detailed Description
To further illustrate the various embodiments, the invention provides the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the embodiments. Those skilled in the art will appreciate still other possible embodiments and advantages of the present invention with reference to these figures.
The invention will now be further described with reference to the accompanying drawings and detailed description.
The first embodiment is as follows:
the embodiment of the invention provides a human voice separation method based on a video, which comprises the following steps of:
s1: the method comprises the steps of collecting video clips with fixed lengths corresponding to different speakers, wherein each video clip comprises face information and audio information corresponding to a single speaker.
In the embodiment, a large number of video segments with single speaker face information and single speaker audio information are intercepted from a video platform, wherein the face information and the audio information in the video segments correspond to the same speaker. Specifically, about 3 seconds of each segment contains 75 face images, and about 1w video segments are collected.
S2: video clips of any two different speakers are extracted from all the video clips, a noise audio is randomly selected from an audio noise data set, two audio information corresponding to the two extracted video clips and the extracted noise audio are combined, the combined mixed audio and two groups of face information corresponding to the two extracted video clips are used as training data, and all the training data form a training set.
The audio noise data set is used to add noise data to the original clean audio, and an existing known audio data set, such as an AudioSet audio data set, may be used.
S3: and constructing a voice separation model, and training the voice separation model through a training set to obtain the trained voice separation model.
Referring to fig. 2, the human-voice separation model first performs feature extraction on the input mixed audio and two sets of face information, wherein an STFT short-time fourier transform algorithm is used to convert the audio into a spectrogram in the feature extraction of the mixed audio, the feature of the mixed audio extracted in this embodiment is 298 × 257, and the face features of the two sets of face information are both 75 × 1024.
After the feature extraction, feature processing needs to be performed on each feature, and in this embodiment, the feature processing is performed by using a hole convolution network (scaled convolution network). After feature processing, the face features are converted into a structure of 298 × 256, and the mixed audio features are converted into a structure of 257 × 8.
And combining the three features after feature processing into a combined feature, and performing dimension conversion on the combined feature, wherein the dimension conversion adopts a Bidirectional LSTM network, and is firstly converted into a structure of 298 × 400, and then is converted into 3 structures of 298 × 600.
After dimension conversion, the two speaker characteristics corresponding to the two speakers are converted through a full connection layer, namely 2 masks with 298 × 257 structures, and the two masks respectively correspond to the two speakers.
Multiplying the two speaker characteristics (Mask) with the input mixed audio characteristics to obtain characteristic spectrograms corresponding to the two speakers (namely, the result of filtering the interference audio of each speaker), reducing the characteristic spectrograms into audio data through inverse Fourier transform (ISTFT), and enabling the two audio data output by the model to correspond to the input audio information of the two speakers.
In the model training process, the difference value of the two audio data output by the model and the real audio information of the two speakers corresponding to the input training data is used as a loss value, and iterative training is carried out on the model by taking the minimum loss value as a target. Training is performed in this embodiment until the loss value falls within a stable interval.
S4: and carrying out human voice separation on the video clips with the face information and the audio information through the trained human voice separation model.
The embodiment of the invention inputs the face image of the appointed speaker in the video and the audio frequency segment with the environmental noise into the deep learning model, and obtains the audio frequency of the appointed speaker without the environmental interference sound through the deep learning model.
Example two:
the invention also provides a video-based voice separation terminal device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the method embodiment of the first embodiment of the invention.
Further, as an executable scheme, the voice separation terminal device based on the video may be a computing device such as a desktop computer, a notebook, a palm computer, and a cloud server. The video-based voice separation terminal device can include, but is not limited to, a processor and a memory. It is understood by those skilled in the art that the above-mentioned structure of the video-based voice separation terminal device is only an example of the video-based voice separation terminal device, and does not constitute a limitation to the video-based voice separation terminal device, and may include more or less components than the above-mentioned components, or combine some components, or different components, for example, the video-based voice separation terminal device may further include an input and output device, a network access device, a bus, and the like, which is not limited by the embodiment of the present invention.
Further, as an executable solution, the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, and the like. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and the processor is a control center of the video-based voice separation terminal device, and various interfaces and lines are used to connect various parts of the whole video-based voice separation terminal device.
The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the video-based voice separation terminal device by executing or executing the computer programs and/or modules stored in the memory and calling data stored in the memory. The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system and an application program required by at least one function; the storage data area may store data created according to the use of the mobile phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
The invention also provides a computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned method of an embodiment of the invention.
The video-based voice separation terminal device integrated module/unit may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments described above may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), software distribution medium, and the like.
While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. A human voice separation method based on video is characterized by comprising the following steps:
s1: collecting video clips with fixed lengths corresponding to different speakers, wherein each video clip contains face information and audio information corresponding to a single speaker;
s2: extracting video clips of any two different speakers from all the video clips, randomly selecting a noise audio from an audio noise data set, combining two audio information corresponding to the two extracted video clips and the extracted noise audio, taking the combined mixed audio and two groups of face information corresponding to the two extracted video clips as training data, and forming all the training data into a training set;
s3: constructing a voice separation model, and training the voice separation model through a training set to obtain a trained voice separation model;
the human voice separation model respectively extracts and processes the features of the input mixed audio and the two groups of face information to form a combined feature, converts the combined feature into two speaker features corresponding to two speakers through a full connection layer after dimension conversion, multiplies the two speaker features with the input mixed audio feature to obtain feature frequency spectrograms corresponding to the two speakers respectively, and restores the feature frequency spectrograms into audio data;
in the model training process, the difference value of the two audio data output by the model and the real audio information of the two speakers corresponding to the input training data is used as a loss value, and iterative training is carried out on the model by taking the minimum loss value as a target;
s4: and carrying out voice separation on the video clips with the face information and the audio information through the trained voice separation model.
2. The video-based human voice separation method of claim 1, wherein: and a short-time Fourier transform algorithm is adopted in the feature extraction aiming at the mixed audio to convert the audio into a spectrogram.
3. The video-based human voice separation method of claim 1, wherein: the characteristic processing is carried out by adopting a hole convolution network.
4. The video-based human voice separation method of claim 1, wherein: dimension conversion is performed by using a Bidirective LSTM network.
5. The video-based human voice separation method of claim 1, wherein: and restoring the characteristic spectrogram into audio data through inverse Fourier transform.
6. The utility model provides a people's voice separation terminal equipment based on video which characterized in that: comprising a processor, a memory and a computer program stored in the memory and running on the processor, the processor implementing the steps of the method according to any of claims 1 to 5 when executing the computer program.
7. A computer-readable storage medium storing a computer program, characterized in that: the computer program when executed by a processor implementing the steps of the method as claimed in any one of claims 1 to 5.
CN202210146711.4A 2022-02-17 2022-02-17 Human voice separation method based on video, terminal equipment and storage medium Pending CN114598914A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210146711.4A CN114598914A (en) 2022-02-17 2022-02-17 Human voice separation method based on video, terminal equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210146711.4A CN114598914A (en) 2022-02-17 2022-02-17 Human voice separation method based on video, terminal equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114598914A true CN114598914A (en) 2022-06-07

Family

ID=81806493

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210146711.4A Pending CN114598914A (en) 2022-02-17 2022-02-17 Human voice separation method based on video, terminal equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114598914A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108766459A (en) * 2018-06-13 2018-11-06 北京联合大学 Target speaker method of estimation and system in a kind of mixing of multi-person speech
CN110246512A (en) * 2019-05-30 2019-09-17 平安科技(深圳)有限公司 Sound separation method, device and computer readable storage medium
US20200335121A1 (en) * 2017-11-22 2020-10-22 Google Llc Audio-visual speech separation
CN113035227A (en) * 2021-03-12 2021-06-25 山东大学 Multi-modal voice separation method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200335121A1 (en) * 2017-11-22 2020-10-22 Google Llc Audio-visual speech separation
CN108766459A (en) * 2018-06-13 2018-11-06 北京联合大学 Target speaker method of estimation and system in a kind of mixing of multi-person speech
CN110246512A (en) * 2019-05-30 2019-09-17 平安科技(深圳)有限公司 Sound separation method, device and computer readable storage medium
WO2020237855A1 (en) * 2019-05-30 2020-12-03 平安科技(深圳)有限公司 Sound separation method and apparatus, and computer readable storage medium
CN113035227A (en) * 2021-03-12 2021-06-25 山东大学 Multi-modal voice separation method and system

Similar Documents

Publication Publication Date Title
CN108198569B (en) Audio processing method, device and equipment and readable storage medium
AU2022200439B2 (en) Multi-modal speech separation method and system
CN113611324B (en) Method and device for suppressing environmental noise in live broadcast, electronic equipment and storage medium
CN110088835B (en) Blind source separation using similarity measures
CN111868823B (en) Sound source separation method, device and equipment
WO2022062800A1 (en) Speech separation method, electronic device, chip and computer-readable storage medium
CN111540370A (en) Audio processing method and device, computer equipment and computer readable storage medium
CN115588437B (en) Speech enhancement method, apparatus, device and storage medium
CN113763977A (en) Method, apparatus, computing device and storage medium for eliminating echo signal
CN115472153A (en) Voice enhancement system, method, device and equipment
CN112201262A (en) Sound processing method and device
CN114627889A (en) Multi-sound-source sound signal processing method and device, storage medium and electronic equipment
CN117693791A (en) Speech enhancement
CN109147801B (en) Voice interaction method, system, terminal and storage medium
CN110890098A (en) Blind signal separation method and device and electronic equipment
WO2023030017A1 (en) Audio data processing method and apparatus, device and medium
CN110069641B (en) Image processing method and device and electronic equipment
CN117496990A (en) Speech denoising method, device, computer equipment and storage medium
CN114598914A (en) Human voice separation method based on video, terminal equipment and storage medium
CN116312559A (en) Training method of cross-channel voiceprint recognition model, voiceprint recognition method and device
EP3680901A1 (en) A sound processing apparatus and method
CN114255778A (en) Audio stream noise reduction method, device, equipment and storage medium
CN112307161B (en) Method and apparatus for playing audio
CN110992966B (en) Human voice separation method and system
Erten et al. Voice extraction by on-line signal separation and recovery

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220607

RJ01 Rejection of invention patent application after publication