CN115376188B - Video call processing method, system, electronic equipment and storage medium - Google Patents

Video call processing method, system, electronic equipment and storage medium Download PDF

Info

Publication number
CN115376188B
CN115376188B CN202210987630.7A CN202210987630A CN115376188B CN 115376188 B CN115376188 B CN 115376188B CN 202210987630 A CN202210987630 A CN 202210987630A CN 115376188 B CN115376188 B CN 115376188B
Authority
CN
China
Prior art keywords
video
model
super
resolution
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210987630.7A
Other languages
Chinese (zh)
Other versions
CN115376188A (en
Inventor
肖冠正
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iMusic Culture and Technology Co Ltd
Original Assignee
iMusic Culture and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iMusic Culture and Technology Co Ltd filed Critical iMusic Culture and Technology Co Ltd
Priority to CN202210987630.7A priority Critical patent/CN115376188B/en
Publication of CN115376188A publication Critical patent/CN115376188A/en
Application granted granted Critical
Publication of CN115376188B publication Critical patent/CN115376188B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Image Processing (AREA)

Abstract

The application discloses a video call processing method, a system, electronic equipment and a storage medium, wherein the method comprises the steps of obtaining a call video to be processed; performing face detection processing on the call video to be processed, and determining a cut video; performing data preprocessing on the cut video to determine a preprocessing array; inputting the preprocessing array into a pre-trained face super-resolution model to perform super-resolution processing, and determining a super-resolution call video; the application can restore the low-resolution video into the high-definition call video through the face super-resolution model, improves the definition of the video call, and can be widely applied to the technical field of computer vision.

Description

Video call processing method, system, electronic equipment and storage medium
Technical Field
The application relates to the technical field of computer vision, in particular to a video call processing method, a video call processing system, electronic equipment and a storage medium.
Background
Currently, with the popularity of the internet, person-to-person communication evolves from voice calls to video calls. The video call has extremely high requirements on network quality because both parties need to receive video streams transmitted by the other party while transmitting own video pictures. In the related art, a high-definition video is transmitted when the network quality is good, a low-definition video is transmitted when the network quality is poor through detecting the network state, and a receiving end receives and amplifies a picture through a traditional linear interpolation method. The mode can bring great flow and bandwidth consumption when transmitting high-definition video, and the phenomena of edge blurring, mosaic and the like can occur when transmitting low-definition video.
Disclosure of Invention
In view of the above, embodiments of the present application provide a video call processing method, a system, an electronic device, and a storage medium, so as to solve one of the technical problems in the prior art.
In one aspect, the present application provides a video call processing method, including:
acquiring a call video to be processed;
performing face detection processing on the call video to be processed, and determining a cut video;
performing data preprocessing on the cut video to determine a preprocessing array;
and inputting the preprocessing array into a pre-trained face super-resolution model to perform super-resolution processing, and determining a super-resolution call video.
Optionally, the performing face detection processing on the call video to be processed to determine a clipping video includes:
performing face detection processing on the call video to be processed according to a face detection algorithm, and determining a face area;
and cutting the face area to determine a cutting video.
Optionally, the performing data preprocessing on the cropped video to determine a preprocessing array includes:
performing frame-by-frame decoding processing on the cut video to determine decoding data;
and performing data conversion processing on the decoded data to determine a preprocessing array.
Optionally, the face super-resolution model includes a generator model and a discriminator model, the discriminator model including a global image discriminator, an eye region discriminator, and a mouth region discriminator.
Optionally, the generator model includes a normal convolution layer, a depth separable convolution layer, a residual addition layer, and a sub-pixel convolution layer.
Optionally, before the preprocessing array is input into a pre-trained face super-resolution model to perform super-resolution processing and determine a super-resolution call video, the method further includes pre-training the face super-resolution model, and specifically includes:
acquiring a training data set;
inputting the training data set into the generator model to determine generated data;
inputting the generated data into the discriminator model to determine a discrimination result;
and updating parameters of the face super-resolution model according to the identification result.
Optionally, after the updating of the parameters of the face super-resolution model according to the identification result, the method further includes:
pruning is carried out on the updated face super-resolution model, and a pruning model is determined;
performing secondary training treatment on the pruning model to determine a training model;
and carrying out quantization processing on the training model, and determining the face super-resolution model.
In another aspect, an embodiment of the present application further provides a system, including:
the first module is used for acquiring a call video to be processed;
the second model is used for carrying out face detection processing on the call video to be processed and determining a cut video;
the third model is used for carrying out data preprocessing on the cut video and determining a preprocessing array;
and the fourth model is used for inputting the preprocessing array into a pre-trained face super-resolution model to perform super-resolution processing and determining a super-resolution call video.
In another aspect, an embodiment of the present application further discloses an electronic device, where the electronic device includes a processor 201, a memory 202, a program stored on the memory and executable on the processor, and a data bus for implementing a connection communication between the processor and the memory, where the program, when executed by the processor, implements a method as described above.
In another aspect, an embodiment of the present application further discloses a storage medium, where the storage medium is a computer readable storage medium, and the storage medium stores one or more programs, and the one or more programs may be executed by one or more processors to implement the foregoing method.
In another aspect, embodiments of the present application also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the foregoing method.
Compared with the prior art, the technical scheme provided by the application has the following technical effects: according to the application, the call video to be processed is obtained; the video to be processed is subjected to face detection processing, and the cut video is determined, so that the interference of video call background can be reduced, and the requirement on network bandwidth is reduced under the condition that face information is reserved as much as possible; in addition, the application determines a preprocessing array by carrying out data preprocessing on the cut video; inputting the preprocessing array into a pre-trained face super-resolution model to perform super-resolution processing, and determining a super-resolution call video; the low-resolution video can be restored into the high-definition call video through the face super-resolution model, so that the definition of the video call is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a video call processing method according to an embodiment of the present application;
fig. 2 is a schematic diagram of a generation model of a video call processing method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In the related art, high-definition transmission of video calls is generally performed by detecting a network state, transmitting high-definition video when network quality is good, transmitting low-definition video when network quality is poor, and amplifying a picture of the low-definition video by a traditional linear interpolation method after a receiving end receives the low-definition video. The mode can bring great flow and bandwidth consumption when transmitting high-definition video, and the phenomena of edge blurring, mosaic and the like can occur when transmitting low-definition video. On the other hand, for the operation of picture magnification, a linear interpolation method is needed, but the method does not perform targeted optimization on a portrait scene, and cannot realize fine restoration of texture details of eyes and mouths. And because only the video picture is integrally zoomed when the low-definition video is transmitted, when the human face occupies a relatively small area in the original video, the whole picture is zoomed again, so that the information of the finally transmitted human face area is further compressed, the difficulty of human image restoration is increased, the condition that the human face five sense organs are mixed into a group to be blurred easily occurs in the restoration result, and the human face of the opposite party is difficult to see by a receiver.
In view of this, the embodiment of the present application provides a video call processing method, which can be applied to a terminal, a server, software running in a terminal or a server, and the like. The terminal may be, but is not limited to, a tablet computer, a notebook computer, a desktop computer, etc. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms. Referring to fig. 1, the method mainly includes the steps of:
s101, acquiring a call video to be processed;
s102, carrying out face detection processing on the call video to be processed, and determining a cut video;
s103, carrying out data preprocessing on the cut video, and determining a preprocessing array;
and S104, inputting the preprocessing array into a pre-trained face super-resolution model to perform super-resolution processing, and determining a super-resolution call video.
In the embodiment of the application, firstly, the call video to be processed is obtained, and after the call video to be processed is established as the video call connection between the video sending terminal and the video receiving terminal, the video sending terminal obtains the video stream through the camera, wherein the video sending terminal and the video receiving terminal can be devices with cameras, such as notebooks, mobile phones and the like. Then, the embodiment of the application carries out face detection on the call video to be processed, detects the position and the size of the face of the user in the picture through a face detection algorithm, cuts out the picture of the portrait area from the call video to be processed, and obtains the cut video. Then, the embodiment of the application carries out data preprocessing on the cut video to obtain a preprocessing array, and inputs the preprocessing array into a pre-trained face super-resolution model to carry out super-resolution processing to obtain a super-resolution call video of a high-definition portrait picture. According to the embodiment of the application, the portrait part in the video picture is cut out, so that the video picture can be transmitted to the video receiving end in a coding mode under low resolution, and the requirement of call video transmission on network bandwidth is reduced. In addition, the embodiment of the application restores the low-resolution portrait video into the high-resolution portrait video by replacing the traditional linear interpolation method with the face super-resolution model, solves the problems of unclear picture edges and serious mosaic phenomenon caused by video amplification in the traditional linear interpolation mode, can restore the low-resolution portrait into a natural, real and high-definition portrait with a large amount of details reserved, and effectively ensures the video image quality of video call.
Further as a preferred embodiment, the performing face detection processing on the call video to be processed to determine a clip video includes:
performing face detection processing on the call video to be processed according to a face detection algorithm, and determining a face area;
and cutting the face area to determine a cutting video.
In the embodiment of the application, the face area in the call video to be processed is detected based on the face detection algorithm, and the face detection algorithm can use target detection algorithms such as a YOLO algorithm, an SSD algorithm and the like. It should be noted that the face detection algorithm is trained by using an aspect ratio of 16: and 9, marking and predicting the anchor frame so that the marking frame contains the whole face. If no face exists in the picture, the current global picture is used for replacing. And finally, clipping according to the detected face area to obtain a clipping video.
Further as a preferred embodiment, the performing data preprocessing on the cropped video to determine a preprocessing array includes:
performing frame-by-frame decoding processing on the cut video to determine decoding data;
and performing data conversion processing on the decoded data to determine a preprocessing array.
In the embodiment of the application, after the video receiving end acquires the video stream transmitted by the video sending end, decoding is carried out frame by frame to obtain decoded data. And performing data conversion processing on the decoded data, and converting the decoded data into a four-dimensional array of 1, 3, 320, 180, wherein 1 represents the batch size, 3 represents the RGB image channel number, 320 represents the image height, 180 represents the image width, the array element data type is uint8, the value of the array element is the pixel value of the corresponding RGB channel, and the value range is 0-255.
Further as a preferred embodiment, the face super-resolution model includes a generator model and a discriminator model including a global image discriminator, an eye region discriminator, and a mouth region discriminator.
In the embodiment of the application, the face super-resolution model is obtained by performing model training by using a generated countermeasure network training method. The training process requires the use of a global image discriminator model, an eye region discriminator model, a mouth region discriminator model, and a generator model. The global image discriminator model, the eye region discriminator model, and the mouth region discriminator model can adopt classical classification network structures such as VGG16, VGG19 or ResNet 34.
Further as a preferred embodiment, the generator model comprises a normal convolution layer, a depth separable convolution layer, a residual addition layer, and a sub-pixel convolution layer.
Referring to fig. 2, the generator model can construct a lightweight model by replacing the normal convolution with a depth separable convolution while reducing the number of channels and the number of convolution layers on the basis of the VDSR residual learning network. The input size of the generation model of the embodiment of the present application is 1 x 3 x 320 x 180 (number of channels of batch size x width x high), the output size is 1×3×1280×720 (number of channels of batch size×width).
Further as a preferred embodiment, before the preprocessing array is input into a pre-trained face super-resolution model to perform super-resolution processing and determine a super-resolution call video, the method further includes pre-training the face super-resolution model, and specifically includes:
acquiring a training data set;
inputting the training data set into the generator model to determine generated data;
inputting the generated data into the discriminator model to determine a discrimination result;
and updating parameters of the face super-resolution model according to the identification result.
In the embodiment of the application, the training data set is a high-definition portrait video acquired by a camera, after the portrait video is read frame by frame, a portrait area image with the height-width ratio of 16:9 is cut out by using a face detection algorithm, and then the portrait image is scaled to 1280 x 720 resolution and then is stored as a PNG picture to be used as a label sample. And randomly adopting bilinear interpolation, trilinear interpolation and regional interpolation to reduce the image height and width to 320 x 180 for the label sample. And then carrying out operations such as Gaussian blur, JPEG compression, noise addition, brightness adjustment, white balance and the like at random to simulate a low-resolution image of a real scene and generate a training data set. In addition, in order to improve the robustness of the model, a small amount of real scene data without the portrait is introduced in training, so that the robustness of the model to the scene image without the portrait is improved. Based on the generated countermeasure network, the training data set is input into a generator model to obtain generated data, the generated data is input into a discriminator model to be discriminated to obtain a discrimination result, and parameters of the face super-resolution model are updated according to the discrimination result. It should be noted that the loss function of the generator model is designed as follows:
L total =αL charbonnier +βL percep +γL GAN +δL comp_GAN +εL comp_gram
wherein, alpha, beta, gamma, delta and epsilon are super parameters respectively, and a proper value scheme is 1.0, 0.05, 0.1 and 100.L (L) charbonnier Charbonnier loss for global picture, its expression isWherein I is pred An image output by the generator, I gt For the label sample picture, E is a constant, and a proper value is 1e -4 。L percep For the perceived loss of the global picture, the perceived network adopts a VGG19 network which is pre-trained in an ImageNet data set to carry out perceived loss evaluation. L (L) GAN The loss is countered by least squares of the global picture. By adding global perceptual loss and global contrast loss, the overall detail restoration effect of the image can be enhanced. L (L) comp_GAN And L comp_gram Respectively eyes and mouthThe least squares contrast loss of the region and the loss of the glamer matrix characteristic L1. The eyes and mouth areas are separated from the image according to the face key point information detected by the face key point detection model. And the recovery effect of the texture details of eyes and mouths is enhanced by introducing a gram matrix to participate in loss calculation.
Further as a preferred embodiment, after the updating of the parameters of the face super-resolution model according to the identification result, the method further includes:
pruning is carried out on the thinner face super-resolution model, and a pruning model is determined;
performing secondary training treatment on the pruning model to determine a training model;
and carrying out quantization processing on the training model, and determining the face super-resolution model.
In the embodiment of the application, training the super-resolution model of the human face to a converging state, stopping training the model and saving the weight, pruning the model, removing nodes with smaller contribution from a model calculation graph, and constructing a pruned network model. The accuracy of the model is reduced after pruning, and the model needs to be trained for the second time by using a smaller learning rate until the model is in a convergence state. And comparing the model precision difference before and after pruning, and reserving pruning results within a certain threshold value by the model precision difference to obtain a pruning model. And finally, carrying out int8 quantization processing on the pruning model to obtain the face super-resolution model. It should be noted that, the accuracy of the face super-resolution model may be reduced after quantization, and training may be performed according to the performance requirement until the requirement is met. After secondary training, the embodiment of the application carries out int8 quantization on the model, reduces the calculation force and memory occupation required by the reasoning of the generator model, greatly reduces the requirement of the generator model on the equipment performance, and enables the generator model to be deployed on terminal equipment to realize real-time video super-resolution processing.
In one embodiment of the application, a first device initiates a video call request to a second device. The second equipment receives the video call request, the first equipment and the second equipment establish video call connection, the cameras are opened respectively, and the video streaming is ready to be sent to the other party and the video streaming transmitted by the other party is received. The video transmitting end reads the video picture of the camera frame by frame, and detects the area where the face is located based on a face detection algorithm. Cutting the face area information into an aspect ratio of 16:9, if no face exists in the image, the current global image is used for replacing the image area containing the whole face. And scaling the cut portrait area to 320 x 180 resolution and then sending the portrait area to a video receiving end. After receiving the video stream, the video receiving end decodes the video stream frame by frame, and performs preprocessing operation to convert the image into a four-dimensional array of 1×3×320×180 (the number of channels is the height is the width), the array element type is uint8, the value of the array element is the pixel value of the corresponding RGB channel, and the value range is 0-255. After inputting the four-dimensional array obtained after pretreatment into the super-resolution of the human image super-resolution model, outputting the four-dimensional array with the size of 1 x 3 x 1280 x 720 (the number of channels is the height is the width), and removing the dimension of the batch size to obtain a high-definition image with the resolution of 1280 x 720. And the video receiving end performs frame-by-frame rendering display on the generated high-definition portrait image.
In another embodiment of the present application, the first device initiates a video call request to the second device, and both parties agree on the resolution of the video to be sent according to the current network state and the device capabilities. The second equipment receives the video call request, the first equipment and the second equipment establish video call connection, the cameras are opened respectively, and the video streaming is ready to be sent to the other party and the video streaming transmitted by the other party is received. The video transmitting end reads the video picture of the camera frame by frame, and detects the area where the face is located based on a face detection algorithm. Cutting out the face area from the picture according to the face area information by taking the face area as the center, wherein the aspect ratio is 16:9, if no face exists in the image, the current global image is used for replacing the image area containing the whole face. And scaling the cut portrait area to the appointed video resolution. After receiving the video stream, the video receiving end decodes the video stream frame by frame, and performs preprocessing operation to convert the image into a four-dimensional array of 1 x 3 x h x w (the number of channels is the number of the video height is the width of the video), the array element is uint8, the value of the array element is the pixel value of the corresponding RGB channel, and the value range is 0-255. And selecting a superdivision model with proper multiplying power according to the current equipment performance, such as a 2-time superdivision model (the width and height of an input image are amplified by 2 times) or a 4-time superdivision model (the width and height of the input image are amplified by 4 times), inputting the four-dimensional array obtained after preprocessing into the human image superresolution model for superresolution processing, and removing the batch size dimension of the four-dimensional array output by the model to obtain the high-definition image after superdivision processing.
In another aspect, an embodiment of the present application further provides a system, including:
the first module is used for acquiring a call video to be processed;
the second model is used for carrying out face detection processing on the call video to be processed and determining a cut video;
the third model is used for carrying out data preprocessing on the cut video and determining a preprocessing array;
and the fourth model is used for inputting the preprocessing array into a pre-trained face super-resolution model to perform super-resolution processing and determining a super-resolution call video.
Referring to fig. 3, an embodiment of the present application further provides an electronic device comprising a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for enabling a connected communication between the processor and the memory, the program, when executed by the processor, implementing a method as described previously.
Corresponding to the method of fig. 1, the embodiment of the present application further provides a storage medium, which is a computer readable storage medium, for computer readable storage, where the storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the above-mentioned method as described above.
Embodiments of the present application also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.
In summary, the embodiment of the application has the following advantages:
1. according to the embodiment of the application, a face detection algorithm is used, a face region is detected and cut out from a video call picture aiming at the characteristic that a face in the video call scene belongs to a picture core main body, and the requirement on network bandwidth is reduced under the condition that face information belonging to a key part is kept as far as possible by removing the background of the video call scene belonging to a non-key part.
2. The super-resolution model based on the high-definition processing of the human images of the convolutional neural network introduces global average absolute error loss, global perception loss and global antagonism loss, and antagonism loss of eyes and mouth areas, so that fine restoration of textures of key areas of human faces such as mouth eyes and the like can be realized when the low-definition human images are restored to high-definition human images.
In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present application are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.
Furthermore, while the application is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present application. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the application as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the application, which is to be defined in the appended claims and their full scope of equivalents.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present application have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the application, the scope of which is defined by the claims and their equivalents.
While the preferred embodiment of the present application has been described in detail, the present application is not limited to the embodiments described above, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present application, and these equivalent modifications or substitutions are included in the scope of the present application as defined in the appended claims.

Claims (8)

1. A method for processing a video call, the method comprising:
acquiring a call video to be processed;
performing face detection processing on the call video to be processed, and determining a cut video;
performing data preprocessing on the cut video to determine a preprocessing array;
inputting the preprocessing array into a pre-trained face super-resolution model to perform super-resolution processing, and determining a super-resolution call video;
the face super-resolution model comprises a generator model and a discriminator model, wherein the discriminator model comprises a global image discriminator, an eye region discriminator and a mouth region discriminator;
the generator model comprises a common convolution layer, a depth separable convolution layer, a residual error addition layer and a sub-pixel convolution layer;
the loss function of the generator model is designed as follows:
L total =αL charbonnier +βL percep +γL GAN +δL comp_GAN +εL comp_gram
wherein alpha, beta, gamma, delta, epsilon are super parameters respectively, L charbonnier Charbonnier loss for global picture, I pred An image output by the generator, I gt For the label sample picture, E is a constant, L percep For the perceived loss of global picture, L GAN For least squares fight loss of global picture, L comp_GAN And L comp_gram The method comprises the steps of respectively obtaining least square antagonism loss and a gram matrix characteristic L1 loss of an eye area and a mouth area, wherein the global perception loss and the global antagonism loss are used for enhancing the overall detail restoration effect of an image, and the gram matrix characteristic L1 loss is used for enhancing the restoration effect of the texture detail of the eye area and the mouth area.
2. The method of claim 1, wherein the performing face detection on the call video to be processed to determine a clip video comprises:
performing face detection processing on the call video to be processed according to a face detection algorithm, and determining a face area;
and cutting the face area to determine a cutting video.
3. The method of claim 1, wherein the performing data preprocessing on the cropped video to determine a preprocessing array comprises:
performing frame-by-frame decoding processing on the cut video to determine decoding data;
and performing data conversion processing on the decoded data to determine a preprocessing array.
4. The method according to claim 1, wherein before the inputting the preprocessing array into a pre-trained face super-resolution model for super-resolution processing to determine a super-resolution call video, the method further comprises pre-training the face super-resolution model, specifically comprising:
acquiring a training data set;
inputting the training data set into the generator model to determine generated data;
inputting the generated data into the discriminator model to determine a discrimination result;
and updating parameters of the face super-resolution model according to the identification result.
5. The method according to claim 4, wherein after the updating of the parameters of the face super-resolution model according to the authentication result, the method further comprises:
pruning is carried out on the updated face super-resolution model, and a pruning model is determined;
performing secondary training treatment on the pruning model to determine a training model;
and carrying out quantization processing on the training model, and determining the face super-resolution model.
6. A video call processing system, the system comprising:
the first module is used for acquiring a call video to be processed;
the second model is used for carrying out face detection processing on the call video to be processed and determining a cut video;
the third model is used for carrying out data preprocessing on the cut video and determining a preprocessing array;
the fourth model is used for inputting the preprocessing array into a pre-trained face super-resolution model to perform super-resolution processing and determining a super-resolution call video;
the face super-resolution model comprises a generator model and a discriminator model, wherein the discriminator model comprises a global image discriminator, an eye region discriminator and a mouth region discriminator;
the generator model comprises a common convolution layer, a depth separable convolution layer, a residual error addition layer and a sub-pixel convolution layer;
the loss function of the generator model is designed as follows:
L total =αL charbonnier +βL percep +γL GAN +δL comp_GAN +εL comp_gram
wherein alpha, beta, gamma, delta, epsilon are super parameters respectively, L charbonnier Charbonnier loss for global picture, I pred An image output by the generator, I gt For the label sample picture, E is a constant, L percep For the perceived loss of global picture, L GAN For least squares fight loss of global picture, L comp_GAN And L comp_gram The method comprises the steps of respectively obtaining least square antagonism loss and a gram matrix characteristic L1 loss of an eye area and a mouth area, wherein the global perception loss and the global antagonism loss are used for enhancing the overall detail restoration effect of an image, and the gram matrix characteristic L1 loss is used for enhancing the restoration effect of the texture detail of the eye area and the mouth area.
7. An electronic device comprising a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for enabling a connected communication between the processor and the memory, the program when executed by the processor implementing the steps of the method according to any of claims 1 to 5.
8. A storage medium, which is a computer-readable storage medium, for computer-readable storage, characterized in that the storage medium stores one or more programs executable by one or more processors to implement the steps of the method of any one of claims 1 to 5.
CN202210987630.7A 2022-08-17 2022-08-17 Video call processing method, system, electronic equipment and storage medium Active CN115376188B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210987630.7A CN115376188B (en) 2022-08-17 2022-08-17 Video call processing method, system, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210987630.7A CN115376188B (en) 2022-08-17 2022-08-17 Video call processing method, system, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115376188A CN115376188A (en) 2022-11-22
CN115376188B true CN115376188B (en) 2023-10-24

Family

ID=84065885

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210987630.7A Active CN115376188B (en) 2022-08-17 2022-08-17 Video call processing method, system, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115376188B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109711364A (en) * 2018-12-29 2019-05-03 成都视观天下科技有限公司 A kind of facial image super-resolution reconstruction method, device and computer equipment
CN111709878A (en) * 2020-06-17 2020-09-25 北京百度网讯科技有限公司 Face super-resolution implementation method and device, electronic equipment and storage medium
WO2021134872A1 (en) * 2019-12-30 2021-07-08 深圳市爱协生科技有限公司 Mosaic facial image super-resolution reconstruction method based on generative adversarial network
CN113139915A (en) * 2021-04-13 2021-07-20 Oppo广东移动通信有限公司 Portrait restoration model training method and device and electronic equipment
CN113554058A (en) * 2021-06-23 2021-10-26 广东奥普特科技股份有限公司 Method, system, device and storage medium for enhancing resolution of visual target image
CN113869282A (en) * 2021-10-22 2021-12-31 马上消费金融股份有限公司 Face recognition method, hyper-resolution model training method and related equipment
CN114119874A (en) * 2021-11-25 2022-03-01 华东师范大学 Single image reconstruction high-definition 3D face texture method based on GAN
WO2022110638A1 (en) * 2020-11-30 2022-06-02 深圳市慧鲤科技有限公司 Human image restoration method and apparatus, electronic device, storage medium and program product

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109711364A (en) * 2018-12-29 2019-05-03 成都视观天下科技有限公司 A kind of facial image super-resolution reconstruction method, device and computer equipment
WO2021134872A1 (en) * 2019-12-30 2021-07-08 深圳市爱协生科技有限公司 Mosaic facial image super-resolution reconstruction method based on generative adversarial network
CN111709878A (en) * 2020-06-17 2020-09-25 北京百度网讯科技有限公司 Face super-resolution implementation method and device, electronic equipment and storage medium
WO2022110638A1 (en) * 2020-11-30 2022-06-02 深圳市慧鲤科技有限公司 Human image restoration method and apparatus, electronic device, storage medium and program product
CN113139915A (en) * 2021-04-13 2021-07-20 Oppo广东移动通信有限公司 Portrait restoration model training method and device and electronic equipment
CN113554058A (en) * 2021-06-23 2021-10-26 广东奥普特科技股份有限公司 Method, system, device and storage medium for enhancing resolution of visual target image
CN113869282A (en) * 2021-10-22 2021-12-31 马上消费金融股份有限公司 Face recognition method, hyper-resolution model training method and related equipment
CN114119874A (en) * 2021-11-25 2022-03-01 华东师范大学 Single image reconstruction high-definition 3D face texture method based on GAN

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Adversarial-learning-based image-to-image transformation: A survey";Yuan Chen等;《Neurocomputing》;第411卷;第468-486页 *
基于级联生成对抗网络的人脸图像修复;陈俊周;王娟;龚勋;;电子科技大学学报(第06期);全文 *

Also Published As

Publication number Publication date
CN115376188A (en) 2022-11-22

Similar Documents

Publication Publication Date Title
US11410275B2 (en) Video coding for machine (VCM) based system and method for video super resolution (SR)
US20220021887A1 (en) Apparatus for Bandwidth Efficient Video Communication Using Machine Learning Identified Objects Of Interest
TWI805085B (en) Handling method of chroma subsampled formats in machine-learning-based video coding
CN113538287A (en) Video enhancement network training method, video enhancement method and related device
US11854164B2 (en) Method for denoising omnidirectional videos and rectified videos
Afonso et al. Spatial resolution adaptation framework for video compression
TWI807491B (en) Method for chroma subsampled formats handling in machine-learning-based picture coding
Li et al. End-to-end optimized 360° image compression
Guleryuz et al. Sandwiched image compression: Increasing the resolution and dynamic range of standard codecs
Bonnineau et al. Multitask learning for VVC quality enhancement and super-resolution
CN115442609A (en) Characteristic data encoding and decoding method and device
CN117441333A (en) Configurable location for inputting auxiliary information of image data processing neural network
CN115376188B (en) Video call processing method, system, electronic equipment and storage medium
Xia et al. Visual sensitivity-based low-bit-rate image compression algorithm
CN116847087A (en) Video processing method and device, storage medium and electronic equipment
WO2023010981A1 (en) Encoding and decoding methods and apparatus
Li et al. Pseudocylindrical convolutions for learned omnidirectional image compression
Yang et al. Graph-convolution network for image compression
CN117321989A (en) Independent localization of auxiliary information in neural network-based image processing
Jia et al. Deep convolutional network based image quality enhancement for low bit rate image compression
US11854165B2 (en) Debanding using a novel banding metric
KR102604657B1 (en) Method and Apparatus for Improving Video Compression Performance for Video Codecs
US11948275B2 (en) Video bandwidth optimization within a video communications platform
CN114697709B (en) Video transmission method and device
Watanabe et al. Traffic reduction in video call and chat using dnn-based image reconstruction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant