CN116343809B - Video voice enhancement method and device, electronic equipment and storage medium - Google Patents

Video voice enhancement method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116343809B
CN116343809B CN202211447626.8A CN202211447626A CN116343809B CN 116343809 B CN116343809 B CN 116343809B CN 202211447626 A CN202211447626 A CN 202211447626A CN 116343809 B CN116343809 B CN 116343809B
Authority
CN
China
Prior art keywords
video
audio
enhanced
enhancement
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211447626.8A
Other languages
Chinese (zh)
Other versions
CN116343809A (en
Inventor
张恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xuanjie Technology Co ltd
Original Assignee
Shanghai Xuanjie Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xuanjie Technology Co ltd filed Critical Shanghai Xuanjie Technology Co ltd
Priority to CN202211447626.8A priority Critical patent/CN116343809B/en
Publication of CN116343809A publication Critical patent/CN116343809A/en
Application granted granted Critical
Publication of CN116343809B publication Critical patent/CN116343809B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Image Processing (AREA)

Abstract

The application discloses a method and a device for enhancing video and voice, electronic equipment and a storage medium, which relate to the technical field of image processing and mainly comprise the following steps: acquiring picture depth information, a semantically segmented image main body and an attention area in a video to be enhanced, acquiring a face area in the video to be enhanced, performing audio enhancement processing according to audio in the video to be enhanced, the face area and the attention area to obtain enhanced audio, and fusing the picture depth information, the semantically segmented image main body and the enhanced audio to obtain a target video. Compared with the related art, when the audio in the video to be enhanced is enhanced, the audio, the corresponding face area and the corresponding attention area are fused, so that the audio, the face area and the attention area are synchronously enhanced, the sound in the video is enhanced, the display picture in the video to be enhanced is enhanced, and the video effect is integrally improved.

Description

Video voice enhancement method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a method and apparatus for enhancing video and voice, an electronic device, and a storage medium.
Background
There is increasing interest in using neural networks for multimodal fusion of auditory and visual signals to address various speech-related problems. These include audiovisual speech recognition, predicting speech or text from a silent video (e.g., lip reading), and unsupervised learning of language from visual and speech signals. These methods utilize natural synchronization between simultaneously recorded visual and audible signals.
At present, a common audio-visual method analysis method is a multitasking model based on convolutional neural networks (Convolution Neural Network, CNN), and the model outputs a noise voice spectrogram and reconstructs an input mouth area to finish audio separation in a video picture, so that the audio-visual method in the related art is only used for voice separation and does not relate to a processing process for other contents in video data.
Disclosure of Invention
The application provides a method, a device, electronic equipment and a storage medium for video voice enhancement. The method mainly aims to solve the problem that the audio-visual processing method in the related art is only used for voice separation.
According to a first aspect of the present application, there is provided a method for video speech enhancement, comprising:
acquiring picture depth information, a semantically segmented image main body and an attention area in a video to be enhanced;
acquiring a face region in the video to be enhanced;
performing audio enhancement processing according to the audio in the video to be enhanced, the face region and the attention region to obtain enhanced audio;
and fusing the picture depth information, the semantically segmented image main body and the enhanced audio to obtain a target video.
Optionally, the audio enhancement processing is performed according to the audio in the video to be enhanced, the face area and the attention area, and the obtaining the enhanced audio includes:
decoding audio in the video to be enhanced;
aligning the decoded audio according to the face area and the attention area to obtain a first audio;
performing background noise elimination processing on the first audio to obtain a second audio;
and carrying out sound track enhancement processing on the second audio to obtain enhanced audio.
Optionally, the fusing the image depth information, the semantically segmented image main body and the enhanced audio to obtain the target video includes:
Determining original audio in the video to be enhanced based on the third audio;
performing fusion processing on the semantically segmented image main body and the attention area to obtain a fusion area of the video to be enhanced;
performing fuzzy processing on other areas except the fusion area to obtain a fuzzy area of the video to be enhanced, wherein the size of the fuzzy area is obtained by scaling according to a picture depth value;
and respectively carrying out coding fusion processing on the third audio of the video to be enhanced and the fusion area and the fuzzy area to obtain the target video.
Optionally, the obtaining the picture depth information, the semantically segmented image main body and the attention area in the video to be enhanced includes:
inputting the video to be enhanced into a multi-task scene analysis model to acquire the picture depth information of each pixel of a single picture of the enhanced video, the image main body of the single picture of the enhanced video and the attention area of the single picture of the video to be enhanced.
Optionally, before acquiring the picture depth information, the semantically segmented image body and the attention area in the video to be enhanced, the method further comprises:
And constructing a video voice enhancement model, wherein the video voice enhancement model comprises a scene analysis module, a face detection module, an audio enhancement module and a fusion module.
Optionally, the building the video voice enhancement model includes:
configuring the scene analysis module to use a multi-task scene analysis model for realizing scene analysis of the video, wherein the multi-task comprises a semantic segmentation task, a depth estimation task and an attention deduction task;
the face detection module is configured to use a target detection model for realizing face region detection in the video;
the audio enhancement module is configured to use a trained audio enhancement model, and is used for inputting an analysis result of an attention deducing task in the multi-task scene analysis model, a face detection result of the target detection model and audio in a video as the audio enhancement module to realize audio enhancement;
the fusion module is configured to use a trained fusion model for integrating semantic segmentation tasks, analysis results of depth estimation tasks and enhancement results of the audio enhancement module in the multi-task scene analysis model, and the fusion model is input to realize fusion of enhanced audio and video.
Optionally, training the audio enhancement model includes:
extracting initial single-track audio in a sample video, wherein the sample video also comprises marked attention areas and custom attention areas;
performing enhancement processing on the training face frames in the sample video to obtain the training face frames after the enhancement processing;
and inputting the initial single-track audio, the marked attention area, the customized attention area and the training face box after enhancement treatment into the audio enhancement model for training so as to obtain a trained audio enhancement model.
Optionally, training the scene parsing model includes:
transmitting a sample video to a backbone network to obtain a feature vector of a single picture in the sample video;
respectively carrying out semantic segmentation training, depth estimation training and attention breaking training according to the feature vectors;
and obtaining the trained multi-task scene analysis model according to training results of semantic segmentation training, depth estimation training and attention-pushing training.
Optionally, training the fusion model includes:
extracting an image main body and an attention area marked in a sample video;
Performing picture fusion training according to picture depth information, an image main body and an attention area in a sample video;
and carrying out video fusion training according to the fusion training result and the audio in the sample video to obtain a trained fusion model.
Optionally, the method further comprises:
and generating a small model based on the target detection model through pruning according to the training result of the target detection model.
According to a second aspect of the present application, there is provided an apparatus for video speech enhancement, comprising:
the first acquisition unit is used for acquiring picture depth information, a semantically segmented image main body and an attention area in the video to be enhanced;
the second acquisition unit is used for acquiring the face area in the video to be enhanced;
the first enhancement processing unit is used for carrying out audio enhancement processing according to the audio in the video to be enhanced, the face area and the attention area to obtain enhanced audio;
and the second enhancement processing unit is used for fusing the picture depth information, the semantically segmented image main body and the enhanced audio to obtain a target video.
Optionally, the first enhancement processing unit includes:
The decoding module is used for decoding the audio in the video to be enhanced;
the alignment module is used for aligning the decoded audio according to the face area and the attention area to obtain a first audio;
the noise elimination module is used for carrying out background noise elimination processing on the first audio to obtain a second audio;
and the enhancement processing module is used for carrying out sound track enhancement processing on the second audio to obtain enhanced audio.
Optionally, the second enhancing unit includes:
the replacing module is used for replacing the original audio in the video to be enhanced by using the enhanced audio;
the fusion processing module is used for carrying out fusion processing on the semantically segmented image main body and the attention area to obtain a fusion area of the video to be enhanced;
the fuzzy processing module is used for carrying out fuzzy processing on other areas except the fusion area to obtain a fuzzy area of the video to be enhanced, wherein the size of the fuzzy area kernel is obtained by scaling according to the picture depth value;
and the enhancement processing module is used for respectively carrying out coding fusion processing on the enhanced audio of the video to be enhanced and the fusion area and the fuzzy area of the video to be enhanced to obtain a target video.
Optionally, the first obtaining unit is further configured to:
inputting the video to be enhanced into a multi-task scene analysis model to obtain the picture depth information of each pixel of the single picture of the enhanced video, the image main body of the single picture of the enhanced video and the attention area of the single picture of the video to be enhanced, which are output in parallel.
Optionally, the apparatus further comprises:
the construction unit is used for constructing a video voice enhancement model before the first acquisition unit acquires the picture depth information, the semantically segmented image main body and the attention area in the video to be enhanced, wherein the video voice enhancement model comprises a scene analysis module, a face detection module, an audio enhancement module and a fusion module.
Optionally, the building unit is further configured to:
configuring the scene analysis module to use a multi-task scene analysis model for realizing scene analysis of the video, wherein the multi-task comprises a semantic segmentation task, a depth estimation task and an attention deduction task;
the face detection module is configured to use a target detection model for realizing face region detection in the video;
the audio enhancement module is configured to use an audio enhancement model, and is used for taking an analysis result of an attention inference task in the multi-task scene analysis model, a face detection result of the target detection model and audio in the video to be enhanced as input of the audio enhancement module so as to enhance the audio;
And configuring the fusion module to use a fusion model for inputting the semantic segmentation task, the analysis result of the depth estimation task and the enhancement result of the audio enhancement module in the multi-task scene analysis model so as to realize fusion of the enhanced audio and video.
Optionally, the audio enhancement model training unit comprises:
extracting initial single-track audio in a sample video, wherein the sample video also comprises marked attention areas and custom attention areas;
performing enhancement processing on the initial face frame in the sample video to obtain a target face frame;
and training the audio enhancement model based on the initial mono track audio, the marked attention area, the self-defined attention area and the target face frame to obtain a trained audio enhancement model.
Optionally, the scene parsing model training unit includes:
transmitting a sample video to a backbone network to obtain a feature vector of a single picture in the sample video;
and respectively carrying out semantic segmentation training, depth estimation training and attention breaking training according to the feature vectors to obtain the trained multi-task scene analysis model.
Optionally, the fusion model training unit includes:
extracting an image main body and an attention area marked in a sample video;
performing picture fusion training according to picture depth information, an image main body and an attention area in a sample video;
and carrying out video fusion training according to the fusion training result and the audio in the sample video to obtain a trained fusion model.
Optionally, the object detection model training unit includes:
pruning is carried out on the training result of the target detection model, and the target detection model is obtained.
According to a third aspect of the present application, there is provided an electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.
According to a fourth aspect of the present application, there is provided a chip comprising one or more interface circuits and one or more processors; the interface circuit is configured to receive a signal from a memory of an electronic device and to send the signal to the processor, the signal comprising computer instructions stored in the memory, which when executed by the processor, cause the electronic device to perform the method of the first aspect.
According to a fifth aspect of the present application, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of the preceding first aspect.
According to a sixth aspect of the present application there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first aspect described above.
According to the video voice enhancement method, device, electronic equipment and storage medium, picture depth information, a semantically segmented image main body and an attention area in a video to be enhanced are obtained, a face area in the video to be enhanced is obtained, audio enhancement processing is carried out according to audio in the video to be enhanced, the face area and the attention area, enhanced audio is obtained, and the picture depth information, the semantically segmented image main body and the enhanced audio are fused to obtain a target video. Compared with the related art, when the audio in the video to be enhanced is enhanced, the audio, the corresponding face area and the corresponding attention area are fused, so that the audio, the face area and the attention area are synchronously enhanced, sound in the video is enhanced, a display picture in the video to be enhanced is enhanced, and the video effect is integrally improved.
It should be understood that the description of this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.
Drawings
The drawings are for better understanding of the present solution and do not constitute a limitation of the present application. Wherein:
fig. 1 is a flowchart of a method for video and audio enhancement according to an embodiment of the present application;
FIG. 2 is a block diagram of a model of video speech enhancement provided in an embodiment of the present application;
fig. 3 is a schematic diagram of an audio enhancement module according to an embodiment of the present application to perform audio enhancement;
fig. 4 is a schematic diagram of a face detection module for performing face detection according to an embodiment of the present application;
FIG. 5 is a flowchart of a method for training a scene analysis model according to an embodiment of the present disclosure;
FIG. 6 is a flowchart of a method for training an audio enhancement model according to an embodiment of the present application;
FIG. 7 is a flowchart of a method for training a fusion model according to an embodiment of the present application;
fig. 8 is a schematic diagram of a face detection model according to an embodiment of the present application;
Fig. 9 is a schematic structural diagram of a video and voice enhancement device according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a video and audio enhancement device according to an embodiment of the present application;
fig. 11 is a schematic block diagram of an example electronic device provided by an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Methods, apparatuses, electronic devices, and storage media for video-voice enhancement of embodiments of the present application are described below with reference to the accompanying drawings.
Fig. 1 is a flowchart of a method for video and speech enhancement according to an embodiment of the present application.
As shown in fig. 1, the method comprises the steps of:
step 101, obtaining picture depth information, a semantically segmented image main body and an attention area in a video to be enhanced.
For a better understanding of the present embodiment, please refer to fig. 2, fig. 2 is a schematic diagram of a video-audio enhancement system according to an embodiment of the present application; the framework comprises a scene analysis module, a face detection module, an audio enhancement module and a fusion module, wherein the scene analysis module is used for analyzing a display picture in a video to be enhanced, the step 101 is correspondingly executed, the face detection module is used for monitoring a face area in the video to be enhanced, the step 102 is correspondingly executed, the audio enhancement module is used for improving the audio in the video to be enhanced, the step 103 is correspondingly executed, and the fusion module fuses the processing of the display picture and the processing result of the audio enhancement, namely, the audio in the video to be enhanced is improved based on the attention of the display picture, and the step 104 is correspondingly executed.
The system architecture shown in fig. 2 may be configured in any electronic device with data supporting capability, and after the embodiment of the present application is mainly applied to starting the video recording function, the video and voice enhancement method described in the embodiment of the present application is also executed simultaneously, and the mode of starting the video recording function is not limited in the embodiment of the present application.
After the video recording function is triggered, the method shown in fig. 1 is started, the recorded video to be enhanced passes through a scene parsing module, and the module automatically parses corresponding picture depth information (updated depth) in a picture of the video to be enhanced, a semantically segmented image main body, a semantically segmented mask (estimated segmentation masks) comprising the image main body and a background, and an attention area, wherein the attention area can be an attention mask (attention mask).
Step 102, acquiring a face area in the video to be enhanced.
When step 101 is executed based on the system shown in fig. 2, the face detection module may be synchronously executed to detect the face region in the video to be enhanced.
And step 103, performing audio enhancement processing according to the audio in the video to be enhanced, the face area and the attention area to obtain target audio.
The face region in the video to be enhanced detected by the face detection module, the attention region (or attention mask) obtained by analysis of the scene analysis module and the audio in the video to be enhanced are taken as the input of the audio enhancement module in fig. 2, and the audio enhancement module outputs the enhanced audio after the audio enhancement processing is executed by the audio enhancement module.
In the embodiment of the present application, video voice enhancement is a process that is processed in real time with video capturing, and for the sake of better understanding of audio enhancement, the following description will be given by way of example. Assuming that a character A and a character B are shot in a video to be enhanced, when the character A in the video to be enhanced speaks, if a face area detected by a face detection module comprises the character A and the character B, and an attention area obtained by analysis of a scene analysis module is the character A, when an audio enhancement module carries out audio enhancement, the audio of the character A of a speaker needs to be enhanced; when switching to the situation that the person B speaks, if the face area detected by the face detection module comprises the person A and the person B, and the attention area obtained by analysis of the scene analysis module is the person B, the audio enhancement processing is needed to be carried out on the audio of the person B of the speaker when the audio enhancement module carries out audio enhancement. The above examples are only for explaining the real-time processing of the audio enhancement module, and are not intended to limit the specific application scenario or the detection results of the respective modules.
As a possible manner of the embodiment of the present application, a custom interface is further provided in the audio enhancement module, where the custom interface is configured to receive a user-defined operation, where the operation may be a user-defined operation on a face area or a user-defined operation on an attention area, so as to modify a final enhancement effect conveniently.
And 104, fusing the picture depth information, the semantically segmented image main body and the enhanced audio to obtain a target video.
And (3) fusing corresponding picture depth information (estimated depth) in the picture of the video to be enhanced, which is obtained in the step (101), the image main body after semantic segmentation and the enhanced audio obtained in the step (104), and fusing the video picture with the enhanced sound to realize the improvement of the audio effect based on picture attention.
According to the video voice enhancement method, picture depth information, a semantically segmented image main body and an attention area in a video to be enhanced are obtained, a face area in the video to be enhanced is obtained, audio enhancement processing is carried out according to audio in the video to be enhanced, the face area and the attention area, enhanced audio is obtained, and the picture depth information, the semantically segmented image main body and the enhanced audio are fused to obtain a target video. Compared with the related art, when the audio in the video to be enhanced is enhanced, the audio, the corresponding face area and the corresponding attention area are fused, so that the audio, the face area and the attention area are synchronously enhanced, sound in the video is enhanced, a display picture in the video to be enhanced is enhanced, and the video effect is integrally improved.
In a possible implementation manner of this embodiment, as shown in fig. 3, fig. 3 is a schematic diagram of an audio enhancement module for performing audio enhancement according to the audio in the video to be enhanced, the face region, and the attention region, where the following manner may be adopted but is not limited to: decoding the audio in the video to be enhanced, aligning the decoded audio according to the face area and the attention area to obtain a first audio, performing background noise elimination processing on the first audio to obtain a second audio, and performing sound track enhancement processing on the second audio to obtain the enhanced audio. Taking a U-Net network as an example in fig. 3, taking detected face area and attention area, taking original audio as input of an audio enhancement module, decoding the input original audio to obtain single-track audio of a current video frame, performing alignment operation on the single-track audio corresponding to the face area and the attention area, performing operations such as ASMR track enhancement and background noise elimination, and finally outputting the enhanced audio. Embodiments of the present application are not limited to ASMR soundtrack enhancement when processing soundtrack enhancement.
After background noise in the video to be enhanced is eliminated, the acquired second audio does not contain noise, and the enhancement processing is continuously carried out on the second audio to obtain third audio, so that the third audio is clearer.
In a possible implementation manner of the present embodiment, as shown in fig. 4, fig. 4 is a schematic diagram of a face detection module for performing face detection provided in the embodiment of the present application, and obtaining picture depth information in a video to be enhanced, an image main body after semantic segmentation, and an attention area includes: the video to be enhanced is input into a multi-task scene analysis model to obtain the parallel output picture depth information of each pixel of the single picture of the enhanced video, the image main body of the single picture of the enhanced video, and the attention area of the single picture of the enhanced video.
As shown in fig. 4, the face detection module in the embodiment of the present application supports a multitasking parallel processing manner, and after extracting features according to a video to be enhanced, the extracted features are sent to three parallel task nodes respectively.
In one possible implementation manner of this embodiment, the fusion of the picture depth information, the semantically segmented image main body and the enhanced audio may be used to obtain the target video by, but not limited to, the following methods: and determining the original audio in the video to be enhanced based on the third audio, carrying out fusion processing on the semantically segmented image main body and the attention area to obtain a fusion area of the video to be enhanced, carrying out fuzzy processing on other areas except the fusion area to obtain a fuzzy area of the video to be enhanced, wherein the size of the fuzzy area is scaled according to a picture depth value, and carrying out coding fusion processing on the enhanced audio of the video to be enhanced, the fusion area of the video to be enhanced and the fuzzy area to obtain a target video. In a specific application process, the third audio processed by the embodiment can be used for directly replacing the original audio in the video to be enhanced, so that not only can the consistency of the audio before and after replacement be ensured, but also the definition of the audio in the video to be enhanced can be ensured.
In the embodiment of the application, the semantically segmented image main body and the attention area are fused, and the blurring processing is performed outside the fusion area, so that the fused video can be clearly displayed visually, the blurring size used in the blurring processing is scaled correspondingly according to the estimated depth value, the specific scaling is dynamically adjusted according to the current picture of the video, and the scaling is not particularly limited in the embodiment of the application.
In practical application, the enhanced audio and video are coded and fused in a scene of shooting and video recording, and when a user triggers to stop video recording, the enhanced audio and video are synchronously output as a final optimized video and stored in a local (such as a mobile phone memory).
Before executing the method shown in fig. 1, the network structure building and training of the video and voice enhancement model needs to be executed, and with continued reference to fig. 2, the specific building includes:
1. configuring the scene analysis module to use a multi-task scene analysis model for realizing scene analysis of the video, wherein the multi-task comprises a semantic segmentation task, a depth estimation task and an attention deduction task;
2. the face detection module is configured to use a target detection model for realizing face region detection in the video;
3. The audio enhancement module is configured to use a trained audio enhancement model, and is used for inputting an analysis result of an attention deducing task in the multi-task scene analysis model, a face detection result of the target detection model and audio in a video as the audio enhancement module to realize audio enhancement;
4. the fusion module is configured to use a trained fusion model for integrating semantic segmentation tasks, analysis results of depth estimation tasks and enhancement results of the audio enhancement module in the multi-task scene analysis model, and the fusion model is input to realize fusion of enhanced audio and video.
After the network structure of the video voice enhancement model is built, training the models used in the video voice enhancement model by using sample videos respectively, wherein the training comprises training a multi-task scene analysis model, a target detection model, an audio enhancement model and a fusion model respectively, and after the training of the 4 models is completed, the trained video voice enhancement model is obtained. The method of video speech enhancement shown in fig. 1 is performed based on the trained video speech enhancement model. The training process for the multi-task scene analysis model, the target detection model, the audio enhancement model and the fusion model is as follows:
As shown in fig. 5, fig. 5 is a flowchart of a method for training a scene analysis model according to an embodiment of the present application, including:
step 501, transmitting a sample video to a backbone network to obtain feature vectors of a single picture in the sample video.
In practical applications, the scene analysis model may perform feature extraction by, but not limited to, using a preset convolutional neural network when extracting feature vectors.
Step 502, performing semantic segmentation training, depth estimation training and attention-pushing training according to the feature vectors.
Step 503, according to training results of the semantic segmentation training, the depth estimation training and the attention-pushing training, obtaining the trained multi-task scene analysis model.
Illustratively, after feature extraction is completed, task heads of two or more downstream task nodes are connected in parallel. In the present embodiment there are three downstream tasks: semantic segmentation tasks, depth estimation tasks, and attention inference tasks. The semantic segmentation task is used for segmenting three types of semantic subjects, namely characters, animals and backgrounds. The depth estimation task is used to estimate or calculate the relative depth (typically in the range of 0-255) for each pixel from a single picture. The attention-breaking task is used to infer which regions in the picture are the regions of interest to the user. The deduced results are packed and compressed into metadata of the image for use by the audio enhancement module and the final fusion module.
The embodiment of the application adopts a multi-head network for multi-task training. During training, different task execution processes can be respectively trained, and related training processes can be referred to the description of related technologies.
As shown in fig. 6, fig. 6 is a flowchart of a method for training an audio enhancement model according to an embodiment of the present application, including:
in step 601, an initial mono audio in a sample video is extracted, wherein the sample video further includes an attention area with a logo and a custom attention area.
As a possible way of implementing the embodiment of the present application, when performing audio enhancement model training, training may be continued using the training result described in fig. 5 as an input of this step.
As another possible manner of the embodiment of the present application, in a required video used when performing audio enhancement model training, after face regions, attention regions, and audio required for training are respectively identified, an audio enhancement model is input.
The embodiment of the application does not limit the mode of the sample video input by the audio enhancement model.
Step 602, performing enhancement processing on the initial face frame in the sample video to obtain a target face frame.
Exemplary, the enhancement processing of the initial face frame includes, but is not limited to, flattening processing, feature combination processing, feature accumulation processing, and other enhancement processing to obtain the target face frame.
Step 603, training the audio enhancement model based on the initial mono audio, the attention area with the identifier, the custom attention area, and the target face frame, so as to obtain a trained audio enhancement model.
The training aims at carrying out enhancement processing on the audio of the speaking main body according to the attention area and the target face frame, taking the sample video as an example and taking the person as the example, and improving the audio effect based on the attention of the video picture.
As shown in fig. 7, fig. 7 is a flowchart of a method for training a fusion model according to an embodiment of the present application, including:
step 701, extracting an image subject and an attention area identified in a sample video.
As a possible way of implementing the embodiment of the present application, when performing the fusion model training, the training result described in fig. 5 may be used as the input of this step to continue the training.
As another possible manner of the embodiment of the present application, after the image main body, the image background, and the attention area of the sample video required for training are respectively identified by the sample video used when the fusion model training is performed, the audio enhancement model is input.
Step 702, performing picture fusion training according to picture depth information, an image main body and an attention area in the sample video.
When the training is performed by fusion of the images, the training is performed according to fusion of the image main body and the attention area. The image subject location when the recorded video is switched can be enhanced later.
And 703, performing video fusion training according to the fusion training result and the audio in the sample video to obtain a trained fusion model.
And combining the trained picture fusion with the audio to realize the attention based on the video picture and improve the audio effect.
Further, in a possible implementation manner of this embodiment, when training a face detection model, a target detection model (Single Shot MultiBox Detector, SSD) is used, as shown in fig. 8, the model has eight positioning layers corresponding to tiny, small, medium and large four scales respectively, and the backbone network has 25 convolution layers, because the required hardware support capability of the model is large, in order to deploy the model on a hardware device (such as a mobile phone), when training the target detection model SSD, a pruning process is performed on a training result of the target detection model, so as to generate the target detection model, and compared with a model before performing the pruning process, the target detection model still can implement a function of target detection.
It should be noted that, regarding execution of the pruning operation, reference needs to be made to the data processing capability of the deployed hardware device, where the data processing capability of the hardware device is sufficient to support the target detection model of the non-pruning operation, and the pruning operation is not required to be executed, and the target detection model is obtained directly based on the training result of the target detection model.
It should be noted that, in the embodiment of the present application, the number of iterations, the number of samples, the use of a loss function, etc. in each model training are limited, and reference may be made to the related description in the related art, and the training process is not specifically limited in the embodiment of the present application.
Corresponding to the video voice enhancement method, the invention also provides a video voice enhancement device. Since the device embodiment of the present invention corresponds to the above-mentioned method embodiment, details not disclosed in the device embodiment may refer to the above-mentioned method embodiment, and details are not described in detail in the present invention.
Fig. 9 is a schematic structural diagram of a video and voice enhancement device according to an embodiment of the present application, where, as shown in fig. 9, the device includes:
a first obtaining unit 901, configured to obtain picture depth information, a semantic main body after semantic segmentation, and an attention area in a video to be enhanced;
A second obtaining unit 902, configured to obtain a face area in the video to be enhanced;
a first enhancement processing unit 903, configured to perform audio enhancement processing according to the audio in the video to be enhanced, the face region, and the attention region, so as to obtain enhanced audio;
and the second enhancement processing unit 904 is configured to fuse the picture depth information, the semantically segmented semantic main body and the enhanced audio to obtain a target video.
According to the video voice enhancement device, picture depth information in a video to be enhanced, a semantic main body after semantic segmentation and an attention area are obtained, a face area in the video to be enhanced is obtained, audio enhancement processing is carried out according to audio in the video to be enhanced, the face area and the attention area, the enhanced audio is obtained, and the picture depth information, the semantic main body after semantic segmentation and the enhanced audio are fused, so that a target video is obtained. Compared with the related art, when the audio in the video to be enhanced is enhanced, the audio, the corresponding face area and the corresponding attention area are fused, so that the audio, the face area and the attention area are synchronously enhanced, sound in the video is enhanced, a display picture in the video to be enhanced is enhanced, and the video effect is integrally improved.
Further, in a possible implementation manner of this embodiment, as shown in fig. 10, the first enhancement processing unit 903 includes:
a decoding module 9031, configured to decode audio in the video to be enhanced;
an alignment module 9032, configured to align the decoded audio according to the face region and the attention region, to obtain a first audio;
a noise cancellation module 9033, configured to perform background noise cancellation processing on the first audio to obtain a second audio;
and the enhancement processing module 9034 is configured to perform soundtrack enhancement processing on the second audio to obtain third audio.
Further, in a possible implementation manner of this embodiment, as shown in fig. 10, the second enhancing unit 904 includes:
a determining module 9041, configured to determine, based on the enhanced audio, original audio in the video to be enhanced;
the fusion processing module 9042 is configured to perform fusion processing on the semantic main body after semantic segmentation and the attention area to obtain a fusion area of the video to be enhanced;
the blurring processing module 9043 is configured to perform blurring processing on other regions except the fusion region to obtain a blurring region of the video to be enhanced, where the size of the blurring region is scaled according to a depth value of a picture;
And the enhancement processing module 9044 is used for respectively carrying out coding fusion processing on the third audio of the video to be enhanced and the fusion area and the fuzzy area of the video to be enhanced to obtain a target video.
Further, in a possible implementation manner of this embodiment, as shown in fig. 10, the first obtaining unit 901 is further configured to:
inputting the video to be enhanced into a multi-task scene analysis model to acquire the picture depth information of each pixel of the single picture of the enhanced video, the semantic main body of the single picture of the enhanced video and the attention area of the single picture of the video to be enhanced.
Further, in a possible implementation manner of this embodiment, as shown in fig. 10, the apparatus further includes:
the construction unit 905 is configured to construct a video speech enhancement model before the first obtaining unit obtains the picture depth information, the semantically segmented semantic body and the attention area in the video to be enhanced, where the video speech enhancement model includes a scene analysis module, a face detection module, an audio enhancement module and a fusion module.
Further, in a possible implementation manner of this embodiment, the building unit is further configured to:
Configuring the scene analysis module to use a multi-task scene analysis model for realizing scene analysis of the video, wherein the multi-task comprises a semantic segmentation task, a depth estimation task and an attention deduction task;
the face detection module is configured to use a target detection model for realizing face region detection in the video;
the audio enhancement module is configured to use a trained audio enhancement model, and is used for inputting an analysis result of an attention deducing task in the multi-task scene analysis model, a face detection result of the target detection model and audio in a video as the audio enhancement module to realize audio enhancement;
the fusion module is configured to use a trained fusion model for integrating semantic segmentation tasks, analysis results of depth estimation tasks and enhancement results of the audio enhancement module in the multi-task scene analysis model, and the fusion model is input to realize fusion of enhanced audio and video.
Further, in one possible implementation manner of the present embodiment, as shown in fig. 10, the audio enhancement model training unit 906 includes:
extracting initial single track audio in a sample video, wherein the sample video also comprises an attention area with a mark and a self-defined attention area;
Performing enhancement processing on the initial face frame in the sample video to obtain a target face frame;
and training the audio enhancement model by the initial single-track audio, the marked attention area, the self-defined attention area and the target face frame to obtain a trained audio enhancement model.
Further, in one possible implementation manner of the present embodiment, as shown in fig. 10, the scene analysis model training unit 907 includes:
transmitting a sample video to a backbone network to obtain a feature vector of a single picture in the sample video;
and respectively carrying out semantic segmentation training, depth estimation training and attention breaking training according to the feature vectors to obtain the trained multi-task scene analysis model.
Further, in one possible implementation manner of this embodiment, as shown in fig. 10, the fusion model training unit 908 includes:
extracting an image main body and an attention area marked in a sample video;
performing picture fusion training according to picture depth information, an image main body and an attention area in a sample video;
and carrying out video fusion training according to the fusion training result and the audio in the sample video to obtain a trained fusion model.
Further, in a possible implementation manner of this embodiment, as shown in fig. 10, the training result of the target detection model is used to perform pruning processing to obtain the target detection model.
The foregoing explanation of the method embodiment is also applicable to the apparatus of this embodiment, and the principle is the same, and this embodiment is not limited thereto.
According to embodiments of the present application, there is also provided an electronic device, a chip, a readable storage medium and a computer program product.
The application also provides a chip comprising one or more interface circuits and one or more processors; the interface circuit is configured to receive a signal from a memory of an electronic device and send the signal to the processor, where the signal includes computer instructions stored in the memory, which when executed by the processor, cause the electronic device to perform the method of video speech enhancement described in the above embodiments.
Fig. 11 illustrates a schematic block diagram of an example electronic device 1100 that can be used to implement embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.
As shown in fig. 11, the apparatus 1100 includes a computing unit 1101 that can perform various appropriate actions and processes according to a computer program stored in a ROM (Read-Only Memory) 1102 or a computer program loaded from a storage unit 1108 into a RAM (Random Access Memory ) 1103. In the RAM 1103, various programs and data required for the operation of the device 1100 can also be stored. The computing unit 1101, ROM 1102, and RAM 1103 are connected to each other by a bus 1104. An I/O (Input/Output) interface 1105 is also connected to bus 1104.
Various components in device 1100 are connected to I/O interface 1105, including: an input unit 1106 such as a keyboard, a mouse, etc.; an output unit 1107 such as various types of displays, speakers, and the like; a storage unit 1108, such as a magnetic disk, optical disk, etc.; and a communication unit 1109 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 1109 allows the device 1100 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 1101 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1101 include, but are not limited to, a CPU (Central Processing Unit ), a GPU (Graphic Processing Units, graphics processing unit), various dedicated AI (Artificial Intelligence ) computing chips, various computing units running machine learning model algorithms, a DSP (Digital Signal Processor ), and any suitable processor, controller, microcontroller, etc. The computing unit 1101 performs the various methods and processes described above, such as the method of video speech enhancement. For example, in some embodiments, the method of video speech enhancement may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1108. In some embodiments, some or all of the computer programs may be loaded and/or installed onto device 1100 via ROM 1102 and/or communication unit 1109. When the computer program is loaded into the RAM 1103 and executed by the computing unit 1101, one or more steps of the method described above may be performed. Alternatively, in other embodiments, the computing unit 1101 may be configured to perform the aforementioned method of video speech enhancement by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit System, FPGA (Field Programmable Gate Array ), ASIC (Application-Specific Integrated Circuit, application-specific integrated circuit), ASSP (Application Specific Standard Product, special-purpose standard product), SOC (System On Chip ), CPLD (Complex Programmable Logic Device, complex programmable logic device), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present application may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, RAM, ROM, EPROM (Electrically Programmable Read-Only-Memory, erasable programmable read-Only Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., CRT (Cathode-Ray Tube) or LCD (Liquid Crystal Display ) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network ), WAN (Wide Area Network, wide area network), internet and blockchain networks.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.
It should be noted that, artificial intelligence is a subject of studying a certain thought process and intelligent behavior (such as learning, reasoning, thinking, planning, etc.) of a computer to simulate a person, and has a technology at both hardware and software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application are achieved, and are not limited herein.
The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims (23)

1. A method of video speech enhancement, comprising:
acquiring picture depth information, a semantically segmented image main body and an attention area in a video to be enhanced;
acquiring a face region in the video to be enhanced;
performing audio enhancement processing according to the audio in the video to be enhanced, the face region and the attention region to obtain enhanced audio;
and fusing the picture depth information, the semantically segmented image main body and the enhanced audio to obtain a target video.
2. The method according to claim 1, wherein the performing audio enhancement processing according to the audio in the video to be enhanced, the face region, and the attention region, and obtaining the enhanced audio includes:
decoding audio in the video to be enhanced;
aligning the decoded audio according to the face area and the attention area to obtain a first audio;
performing background noise elimination processing on the first audio to obtain a second audio;
and carrying out sound track enhancement processing on the second audio to obtain a third audio.
3. The method of claim 2, wherein fusing the picture depth information, the semantically segmented image body, and the enhanced audio to obtain a target video comprises:
Determining original audio in the video to be enhanced based on the third audio;
performing fusion processing on the semantically segmented image main body and the attention area to obtain a fusion area of the video to be enhanced;
performing fuzzy processing on other areas except the fusion area to obtain a fuzzy area of the video to be enhanced, wherein the size of the fuzzy area is obtained by scaling according to a picture depth value;
and respectively carrying out coding fusion processing on the third audio of the video to be enhanced and the fusion area and the fuzzy area to obtain the target video.
4. The method according to claim 1, wherein the obtaining the picture depth information, the semantically segmented image body and the attention area in the video to be enhanced comprises:
inputting the video to be enhanced into a multi-task scene analysis model to acquire the picture depth information of each pixel of a single picture of the enhanced video, the image main body of the single picture of the video to be enhanced and the attention area of the single picture of the video to be enhanced.
5. The method according to any one of claims 1-4, wherein prior to obtaining picture depth information, semantically segmented image bodies and attention areas in the video to be enhanced, the method further comprises:
And constructing a video voice enhancement model, wherein the video voice enhancement model comprises a scene analysis module, a face detection module, an audio enhancement module and a fusion module.
6. The method of claim 5, wherein building a video speech enhancement model comprises:
configuring the scene analysis module to use a multi-task scene analysis model for realizing scene analysis of the video, wherein the multi-task comprises a semantic segmentation task, a depth estimation task and an attention deduction task;
the face detection module is configured to use a target detection model for realizing face region detection in the video;
the audio enhancement module is configured to use an audio enhancement model, and is used for taking an analysis result of an attention inference task in the multi-task scene analysis model, a face detection result of the target detection model and audio in the video to be enhanced as input of the audio enhancement module so as to enhance the audio;
and configuring the fusion module to use a fusion model for inputting the semantic segmentation task, the analysis result of the depth estimation task and the enhancement result of the audio enhancement module in the multi-task scene analysis model so as to realize fusion of the enhanced audio and video.
7. The method of claim 6, wherein training the audio enhancement model comprises:
extracting initial single track audio in a sample video, wherein the sample video also comprises an attention area with a mark and a self-defined attention area;
performing enhancement processing on the initial face frame in the sample video to obtain a target face frame;
and training the audio enhancement model based on the initial mono track audio, the marked attention area, the self-defined attention area and the target face frame to obtain a trained audio enhancement model.
8. The method of claim 6, wherein training the scene parsing model comprises:
transmitting a sample video to a backbone network to obtain a feature vector of a single picture in the sample video;
and respectively carrying out semantic segmentation training, depth estimation training and attention breaking training according to the feature vectors to obtain the trained multi-task scene analysis model.
9. The method of claim 6, wherein training the fusion model comprises:
extracting an image main body and an attention area marked in a sample video;
Performing picture fusion training according to picture depth information, an image main body and an attention area in a sample video;
and carrying out video fusion training according to the fusion training result and the audio in the sample video to obtain a trained fusion model.
10. The method of claim 6, wherein the method further comprises:
pruning is carried out on the training result of the target detection model, and the target detection model is obtained.
11. An apparatus for video speech enhancement, comprising:
the first acquisition unit is used for acquiring picture depth information, a semantically segmented image main body and an attention area in the video to be enhanced;
the second acquisition unit is used for acquiring the face area in the video to be enhanced;
the first enhancement processing unit is used for carrying out audio enhancement processing according to the audio in the video to be enhanced, the face area and the attention area to obtain enhanced audio;
and the second enhancement processing unit is used for fusing the picture depth information, the semantically segmented image main body and the third audio to obtain a target video.
12. The apparatus of claim 11, wherein the first enhancement processing unit comprises:
The decoding module is used for decoding the audio in the video to be enhanced;
the alignment module is used for aligning the decoded audio according to the face area and the attention area to obtain a first audio;
the noise elimination module is used for carrying out background noise elimination processing on the first audio to obtain a second audio;
and the enhancement processing module is used for carrying out sound track enhancement processing on the second audio to obtain third audio.
13. The apparatus of claim 12, wherein the second enhancement unit comprises:
the determining module is used for determining the original audio in the video to be enhanced based on the enhanced audio;
the fusion processing module is used for carrying out fusion processing on the semantically segmented image main body and the attention area to obtain a fusion area of the video to be enhanced;
the fuzzy processing module is used for carrying out fuzzy processing on other areas except the fusion area to obtain a fuzzy area of the video to be enhanced, wherein the size of the fuzzy area is obtained by scaling according to the picture depth value;
and the enhancement processing module is used for respectively carrying out coding fusion processing on the third audio of the video to be enhanced and the fusion area and the fuzzy area of the video to be enhanced to obtain a target video.
14. The apparatus of claim 11, wherein the first acquisition unit is further configured to:
inputting the video to be enhanced into a multi-task scene analysis model to acquire the picture depth information of each pixel of a single picture of the enhanced video, the image main body of the single picture of the enhanced video and the attention area of the single picture of the video to be enhanced.
15. The apparatus according to any one of claims 11-14, wherein the apparatus further comprises:
the construction unit is used for constructing a video voice enhancement model before acquiring picture depth information, a semantically segmented image main body and an attention area in the video to be enhanced, wherein the video voice enhancement model comprises a scene analysis module, a face detection module, an audio enhancement module and a fusion module.
16. The apparatus of claim 15, wherein the construction unit is further configured to:
configuring the scene analysis module to use a multi-task scene analysis model for realizing scene analysis of the video, wherein the multi-task comprises a semantic segmentation task, a depth estimation task and an attention deduction task;
the face detection module is configured to use a target detection model for realizing face region detection in the video;
The audio enhancement module is configured to use a trained audio enhancement model, and is used for inputting an analysis result of an attention deducing task in the multi-task scene analysis model, a face detection result of the target detection model and audio in a video as the audio enhancement module to realize audio enhancement;
the fusion module is configured to use a trained fusion model for integrating semantic segmentation tasks, analysis results of depth estimation tasks and enhancement results of the audio enhancement module in the multi-task scene analysis model, and the fusion model is input to realize fusion of enhanced audio and video.
17. The apparatus of claim 16, wherein the audio enhancement model training unit comprises:
extracting initial single track audio in a sample video, wherein the sample video also comprises an attention area with a mark and a self-defined attention area;
performing enhancement processing on the initial face frame in the sample video to obtain a target face frame;
and training the audio enhancement model by the initial single-track audio, the marked attention area, the self-defined attention area and the target face frame to obtain a trained audio enhancement model.
18. The apparatus of claim 16, wherein the scene parsing model training unit comprises:
transmitting a sample video to a backbone network to obtain a feature vector of a single picture in the sample video;
and respectively carrying out semantic segmentation training, depth estimation training and attention breaking training according to the feature vectors to obtain the trained multi-task scene analysis model.
19. The apparatus of claim 16, wherein the fusion model training unit comprises:
extracting an image main body and an attention area marked in a sample video;
performing picture fusion training according to picture depth information, an image main body and an attention area in a sample video;
and carrying out video fusion training according to the fusion training result and the audio in the sample video to obtain a trained fusion model.
20. The apparatus according to claim 16, wherein the object detection model training unit is configured to prune training results of the object detection model to obtain the object detection model.
21. An electronic device, comprising:
at least one processor; and
A memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.
22. A chip comprising one or more interface circuits and one or more processors; the interface circuit is configured to receive a signal from a memory of an electronic device and to send the signal to the processor, the signal comprising computer instructions stored in the memory, which when executed by the processor, cause the electronic device to perform the method of any of claims 1-10.
23. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-10.
CN202211447626.8A 2022-11-18 2022-11-18 Video voice enhancement method and device, electronic equipment and storage medium Active CN116343809B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211447626.8A CN116343809B (en) 2022-11-18 2022-11-18 Video voice enhancement method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211447626.8A CN116343809B (en) 2022-11-18 2022-11-18 Video voice enhancement method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116343809A CN116343809A (en) 2023-06-27
CN116343809B true CN116343809B (en) 2024-04-02

Family

ID=86891797

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211447626.8A Active CN116343809B (en) 2022-11-18 2022-11-18 Video voice enhancement method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116343809B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110709924A (en) * 2017-11-22 2020-01-17 谷歌有限责任公司 Audio-visual speech separation
CN110808048A (en) * 2019-11-13 2020-02-18 联想(北京)有限公司 Voice processing method, device, system and storage medium
WO2020172828A1 (en) * 2019-02-27 2020-09-03 华为技术有限公司 Sound source separating method, apparatus and device
CN112053690A (en) * 2020-09-22 2020-12-08 湖南大学 Cross-modal multi-feature fusion audio and video voice recognition method and system
CN113470671A (en) * 2021-06-28 2021-10-01 安徽大学 Audio-visual voice enhancement method and system by fully utilizing visual and voice connection
CN114549946A (en) * 2022-02-21 2022-05-27 中山大学 Cross-modal attention mechanism-based multi-modal personality identification method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10284956B2 (en) * 2015-06-27 2019-05-07 Intel Corporation Technologies for localized audio enhancement of a three-dimensional video

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110709924A (en) * 2017-11-22 2020-01-17 谷歌有限责任公司 Audio-visual speech separation
WO2020172828A1 (en) * 2019-02-27 2020-09-03 华为技术有限公司 Sound source separating method, apparatus and device
CN110808048A (en) * 2019-11-13 2020-02-18 联想(北京)有限公司 Voice processing method, device, system and storage medium
CN112053690A (en) * 2020-09-22 2020-12-08 湖南大学 Cross-modal multi-feature fusion audio and video voice recognition method and system
CN113470671A (en) * 2021-06-28 2021-10-01 安徽大学 Audio-visual voice enhancement method and system by fully utilizing visual and voice connection
CN114549946A (en) * 2022-02-21 2022-05-27 中山大学 Cross-modal attention mechanism-based multi-modal personality identification method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Daniel Michelsanti, etc.An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation.IEEE/ACM Transactions on Audio, Speech, and Language Processing .2021,第29卷1368-1396. *
JUNG-WOOK HWANG, etc.Efficient Audio-Visual Speech Enhancement Using Deep U-Net With Early Fusion of Audio and Video Information and RNN Attention Blocks.IEEE Access.2021,137584-137598. *

Also Published As

Publication number Publication date
CN116343809A (en) 2023-06-27

Similar Documents

Publication Publication Date Title
US11508366B2 (en) Whispering voice recovery method, apparatus and device, and readable storage medium
US20210158533A1 (en) Image processing method and apparatus, and storage medium
CN112889108B (en) Speech classification using audiovisual data
CN107004287B (en) Avatar video apparatus and method
US11436863B2 (en) Method and apparatus for outputting data
US20210319809A1 (en) Method, system, medium, and smart device for cutting video using video content
US20230069197A1 (en) Method, apparatus, device and storage medium for training video recognition model
JP7394809B2 (en) Methods, devices, electronic devices, media and computer programs for processing video
CN113159091B (en) Data processing method, device, electronic equipment and storage medium
CN114895817B (en) Interactive information processing method, network model training method and device
CN114187624B (en) Image generation method, device, electronic equipment and storage medium
CN113901909B (en) Video-based target detection method and device, electronic equipment and storage medium
CN113365146B (en) Method, apparatus, device, medium and article of manufacture for processing video
CN114863437B (en) Text recognition method and device, electronic equipment and storage medium
US20230036338A1 (en) Method and apparatus for generating image restoration model, medium and program product
CN114663556A (en) Data interaction method, device, equipment, storage medium and program product
Novopoltsev et al. Fine-tuning of sign language recognition models: a technical report
CN113379877B (en) Face video generation method and device, electronic equipment and storage medium
US10936823B2 (en) Method and system for displaying automated agent comprehension
CN113223125B (en) Face driving method, device, equipment and medium for virtual image
CN112634413B (en) Method, apparatus, device and storage medium for generating model and generating 3D animation
CN116343809B (en) Video voice enhancement method and device, electronic equipment and storage medium
CN111260756B (en) Method and device for transmitting information
CN116611491A (en) Training method and device of target detection model, electronic equipment and storage medium
CN114267375B (en) Phoneme detection method and device, training method and device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant