CN115376188B - Video call processing method, system, electronic equipment and storage medium - Google Patents
Video call processing method, system, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN115376188B CN115376188B CN202210987630.7A CN202210987630A CN115376188B CN 115376188 B CN115376188 B CN 115376188B CN 202210987630 A CN202210987630 A CN 202210987630A CN 115376188 B CN115376188 B CN 115376188B
- Authority
- CN
- China
- Prior art keywords
- video
- model
- super
- resolution
- loss
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title abstract description 9
- 238000012545 processing Methods 0.000 claims abstract description 43
- 238000007781 pre-processing Methods 0.000 claims abstract description 42
- 238000000034 method Methods 0.000 claims abstract description 36
- 238000001514 detection method Methods 0.000 claims abstract description 28
- 238000012549 training Methods 0.000 claims description 29
- 238000013138 pruning Methods 0.000 claims description 15
- 238000004422 calculation algorithm Methods 0.000 claims description 14
- 230000006870 function Effects 0.000 claims description 11
- 230000000694 effects Effects 0.000 claims description 7
- 230000008485 antagonism Effects 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000013139 quantization Methods 0.000 claims description 6
- 238000004891 communication Methods 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 230000008447 perception Effects 0.000 claims description 3
- 230000002708 enhancing effect Effects 0.000 claims 4
- 210000001508 eye Anatomy 0.000 description 10
- 238000004590 computer program Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000006467 substitution reaction Methods 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000003321 amplification Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 210000000887 face Anatomy 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 210000000697 sensory organ Anatomy 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/40—Scaling of whole images or parts thereof, e.g. expanding or contracting
- G06T3/4053—Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Human Computer Interaction (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Image Processing (AREA)
Abstract
The application discloses a video call processing method, a system, electronic equipment and a storage medium, wherein the method comprises the steps of obtaining a call video to be processed; performing face detection processing on the call video to be processed, and determining a cut video; performing data preprocessing on the cut video to determine a preprocessing array; inputting the preprocessing array into a pre-trained face super-resolution model to perform super-resolution processing, and determining a super-resolution call video; the application can restore the low-resolution video into the high-definition call video through the face super-resolution model, improves the definition of the video call, and can be widely applied to the technical field of computer vision.
Description
Technical Field
The application relates to the technical field of computer vision, in particular to a video call processing method, a video call processing system, electronic equipment and a storage medium.
Background
Currently, with the popularity of the internet, person-to-person communication evolves from voice calls to video calls. The video call has extremely high requirements on network quality because both parties need to receive video streams transmitted by the other party while transmitting own video pictures. In the related art, a high-definition video is transmitted when the network quality is good, a low-definition video is transmitted when the network quality is poor through detecting the network state, and a receiving end receives and amplifies a picture through a traditional linear interpolation method. The mode can bring great flow and bandwidth consumption when transmitting high-definition video, and the phenomena of edge blurring, mosaic and the like can occur when transmitting low-definition video.
Disclosure of Invention
In view of the above, embodiments of the present application provide a video call processing method, a system, an electronic device, and a storage medium, so as to solve one of the technical problems in the prior art.
In one aspect, the present application provides a video call processing method, including:
acquiring a call video to be processed;
performing face detection processing on the call video to be processed, and determining a cut video;
performing data preprocessing on the cut video to determine a preprocessing array;
and inputting the preprocessing array into a pre-trained face super-resolution model to perform super-resolution processing, and determining a super-resolution call video.
Optionally, the performing face detection processing on the call video to be processed to determine a clipping video includes:
performing face detection processing on the call video to be processed according to a face detection algorithm, and determining a face area;
and cutting the face area to determine a cutting video.
Optionally, the performing data preprocessing on the cropped video to determine a preprocessing array includes:
performing frame-by-frame decoding processing on the cut video to determine decoding data;
and performing data conversion processing on the decoded data to determine a preprocessing array.
Optionally, the face super-resolution model includes a generator model and a discriminator model, the discriminator model including a global image discriminator, an eye region discriminator, and a mouth region discriminator.
Optionally, the generator model includes a normal convolution layer, a depth separable convolution layer, a residual addition layer, and a sub-pixel convolution layer.
Optionally, before the preprocessing array is input into a pre-trained face super-resolution model to perform super-resolution processing and determine a super-resolution call video, the method further includes pre-training the face super-resolution model, and specifically includes:
acquiring a training data set;
inputting the training data set into the generator model to determine generated data;
inputting the generated data into the discriminator model to determine a discrimination result;
and updating parameters of the face super-resolution model according to the identification result.
Optionally, after the updating of the parameters of the face super-resolution model according to the identification result, the method further includes:
pruning is carried out on the updated face super-resolution model, and a pruning model is determined;
performing secondary training treatment on the pruning model to determine a training model;
and carrying out quantization processing on the training model, and determining the face super-resolution model.
In another aspect, an embodiment of the present application further provides a system, including:
the first module is used for acquiring a call video to be processed;
the second model is used for carrying out face detection processing on the call video to be processed and determining a cut video;
the third model is used for carrying out data preprocessing on the cut video and determining a preprocessing array;
and the fourth model is used for inputting the preprocessing array into a pre-trained face super-resolution model to perform super-resolution processing and determining a super-resolution call video.
In another aspect, an embodiment of the present application further discloses an electronic device, where the electronic device includes a processor 201, a memory 202, a program stored on the memory and executable on the processor, and a data bus for implementing a connection communication between the processor and the memory, where the program, when executed by the processor, implements a method as described above.
In another aspect, an embodiment of the present application further discloses a storage medium, where the storage medium is a computer readable storage medium, and the storage medium stores one or more programs, and the one or more programs may be executed by one or more processors to implement the foregoing method.
In another aspect, embodiments of the present application also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the foregoing method.
Compared with the prior art, the technical scheme provided by the application has the following technical effects: according to the application, the call video to be processed is obtained; the video to be processed is subjected to face detection processing, and the cut video is determined, so that the interference of video call background can be reduced, and the requirement on network bandwidth is reduced under the condition that face information is reserved as much as possible; in addition, the application determines a preprocessing array by carrying out data preprocessing on the cut video; inputting the preprocessing array into a pre-trained face super-resolution model to perform super-resolution processing, and determining a super-resolution call video; the low-resolution video can be restored into the high-definition call video through the face super-resolution model, so that the definition of the video call is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a video call processing method according to an embodiment of the present application;
fig. 2 is a schematic diagram of a generation model of a video call processing method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In the related art, high-definition transmission of video calls is generally performed by detecting a network state, transmitting high-definition video when network quality is good, transmitting low-definition video when network quality is poor, and amplifying a picture of the low-definition video by a traditional linear interpolation method after a receiving end receives the low-definition video. The mode can bring great flow and bandwidth consumption when transmitting high-definition video, and the phenomena of edge blurring, mosaic and the like can occur when transmitting low-definition video. On the other hand, for the operation of picture magnification, a linear interpolation method is needed, but the method does not perform targeted optimization on a portrait scene, and cannot realize fine restoration of texture details of eyes and mouths. And because only the video picture is integrally zoomed when the low-definition video is transmitted, when the human face occupies a relatively small area in the original video, the whole picture is zoomed again, so that the information of the finally transmitted human face area is further compressed, the difficulty of human image restoration is increased, the condition that the human face five sense organs are mixed into a group to be blurred easily occurs in the restoration result, and the human face of the opposite party is difficult to see by a receiver.
In view of this, the embodiment of the present application provides a video call processing method, which can be applied to a terminal, a server, software running in a terminal or a server, and the like. The terminal may be, but is not limited to, a tablet computer, a notebook computer, a desktop computer, etc. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms. Referring to fig. 1, the method mainly includes the steps of:
s101, acquiring a call video to be processed;
s102, carrying out face detection processing on the call video to be processed, and determining a cut video;
s103, carrying out data preprocessing on the cut video, and determining a preprocessing array;
and S104, inputting the preprocessing array into a pre-trained face super-resolution model to perform super-resolution processing, and determining a super-resolution call video.
In the embodiment of the application, firstly, the call video to be processed is obtained, and after the call video to be processed is established as the video call connection between the video sending terminal and the video receiving terminal, the video sending terminal obtains the video stream through the camera, wherein the video sending terminal and the video receiving terminal can be devices with cameras, such as notebooks, mobile phones and the like. Then, the embodiment of the application carries out face detection on the call video to be processed, detects the position and the size of the face of the user in the picture through a face detection algorithm, cuts out the picture of the portrait area from the call video to be processed, and obtains the cut video. Then, the embodiment of the application carries out data preprocessing on the cut video to obtain a preprocessing array, and inputs the preprocessing array into a pre-trained face super-resolution model to carry out super-resolution processing to obtain a super-resolution call video of a high-definition portrait picture. According to the embodiment of the application, the portrait part in the video picture is cut out, so that the video picture can be transmitted to the video receiving end in a coding mode under low resolution, and the requirement of call video transmission on network bandwidth is reduced. In addition, the embodiment of the application restores the low-resolution portrait video into the high-resolution portrait video by replacing the traditional linear interpolation method with the face super-resolution model, solves the problems of unclear picture edges and serious mosaic phenomenon caused by video amplification in the traditional linear interpolation mode, can restore the low-resolution portrait into a natural, real and high-definition portrait with a large amount of details reserved, and effectively ensures the video image quality of video call.
Further as a preferred embodiment, the performing face detection processing on the call video to be processed to determine a clip video includes:
performing face detection processing on the call video to be processed according to a face detection algorithm, and determining a face area;
and cutting the face area to determine a cutting video.
In the embodiment of the application, the face area in the call video to be processed is detected based on the face detection algorithm, and the face detection algorithm can use target detection algorithms such as a YOLO algorithm, an SSD algorithm and the like. It should be noted that the face detection algorithm is trained by using an aspect ratio of 16: and 9, marking and predicting the anchor frame so that the marking frame contains the whole face. If no face exists in the picture, the current global picture is used for replacing. And finally, clipping according to the detected face area to obtain a clipping video.
Further as a preferred embodiment, the performing data preprocessing on the cropped video to determine a preprocessing array includes:
performing frame-by-frame decoding processing on the cut video to determine decoding data;
and performing data conversion processing on the decoded data to determine a preprocessing array.
In the embodiment of the application, after the video receiving end acquires the video stream transmitted by the video sending end, decoding is carried out frame by frame to obtain decoded data. And performing data conversion processing on the decoded data, and converting the decoded data into a four-dimensional array of 1, 3, 320, 180, wherein 1 represents the batch size, 3 represents the RGB image channel number, 320 represents the image height, 180 represents the image width, the array element data type is uint8, the value of the array element is the pixel value of the corresponding RGB channel, and the value range is 0-255.
Further as a preferred embodiment, the face super-resolution model includes a generator model and a discriminator model including a global image discriminator, an eye region discriminator, and a mouth region discriminator.
In the embodiment of the application, the face super-resolution model is obtained by performing model training by using a generated countermeasure network training method. The training process requires the use of a global image discriminator model, an eye region discriminator model, a mouth region discriminator model, and a generator model. The global image discriminator model, the eye region discriminator model, and the mouth region discriminator model can adopt classical classification network structures such as VGG16, VGG19 or ResNet 34.
Further as a preferred embodiment, the generator model comprises a normal convolution layer, a depth separable convolution layer, a residual addition layer, and a sub-pixel convolution layer.
Referring to fig. 2, the generator model can construct a lightweight model by replacing the normal convolution with a depth separable convolution while reducing the number of channels and the number of convolution layers on the basis of the VDSR residual learning network. The input size of the generation model of the embodiment of the present application is 1 x 3 x 320 x 180 (number of channels of batch size x width x high), the output size is 1×3×1280×720 (number of channels of batch size×width).
Further as a preferred embodiment, before the preprocessing array is input into a pre-trained face super-resolution model to perform super-resolution processing and determine a super-resolution call video, the method further includes pre-training the face super-resolution model, and specifically includes:
acquiring a training data set;
inputting the training data set into the generator model to determine generated data;
inputting the generated data into the discriminator model to determine a discrimination result;
and updating parameters of the face super-resolution model according to the identification result.
In the embodiment of the application, the training data set is a high-definition portrait video acquired by a camera, after the portrait video is read frame by frame, a portrait area image with the height-width ratio of 16:9 is cut out by using a face detection algorithm, and then the portrait image is scaled to 1280 x 720 resolution and then is stored as a PNG picture to be used as a label sample. And randomly adopting bilinear interpolation, trilinear interpolation and regional interpolation to reduce the image height and width to 320 x 180 for the label sample. And then carrying out operations such as Gaussian blur, JPEG compression, noise addition, brightness adjustment, white balance and the like at random to simulate a low-resolution image of a real scene and generate a training data set. In addition, in order to improve the robustness of the model, a small amount of real scene data without the portrait is introduced in training, so that the robustness of the model to the scene image without the portrait is improved. Based on the generated countermeasure network, the training data set is input into a generator model to obtain generated data, the generated data is input into a discriminator model to be discriminated to obtain a discrimination result, and parameters of the face super-resolution model are updated according to the discrimination result. It should be noted that the loss function of the generator model is designed as follows:
L total =αL charbonnier +βL percep +γL GAN +δL comp_GAN +εL comp_gram ;
wherein, alpha, beta, gamma, delta and epsilon are super parameters respectively, and a proper value scheme is 1.0, 0.05, 0.1 and 100.L (L) charbonnier Charbonnier loss for global picture, its expression isWherein I is pred An image output by the generator, I gt For the label sample picture, E is a constant, and a proper value is 1e -4 。L percep For the perceived loss of the global picture, the perceived network adopts a VGG19 network which is pre-trained in an ImageNet data set to carry out perceived loss evaluation. L (L) GAN The loss is countered by least squares of the global picture. By adding global perceptual loss and global contrast loss, the overall detail restoration effect of the image can be enhanced. L (L) comp_GAN And L comp_gram Respectively eyes and mouthThe least squares contrast loss of the region and the loss of the glamer matrix characteristic L1. The eyes and mouth areas are separated from the image according to the face key point information detected by the face key point detection model. And the recovery effect of the texture details of eyes and mouths is enhanced by introducing a gram matrix to participate in loss calculation.
Further as a preferred embodiment, after the updating of the parameters of the face super-resolution model according to the identification result, the method further includes:
pruning is carried out on the thinner face super-resolution model, and a pruning model is determined;
performing secondary training treatment on the pruning model to determine a training model;
and carrying out quantization processing on the training model, and determining the face super-resolution model.
In the embodiment of the application, training the super-resolution model of the human face to a converging state, stopping training the model and saving the weight, pruning the model, removing nodes with smaller contribution from a model calculation graph, and constructing a pruned network model. The accuracy of the model is reduced after pruning, and the model needs to be trained for the second time by using a smaller learning rate until the model is in a convergence state. And comparing the model precision difference before and after pruning, and reserving pruning results within a certain threshold value by the model precision difference to obtain a pruning model. And finally, carrying out int8 quantization processing on the pruning model to obtain the face super-resolution model. It should be noted that, the accuracy of the face super-resolution model may be reduced after quantization, and training may be performed according to the performance requirement until the requirement is met. After secondary training, the embodiment of the application carries out int8 quantization on the model, reduces the calculation force and memory occupation required by the reasoning of the generator model, greatly reduces the requirement of the generator model on the equipment performance, and enables the generator model to be deployed on terminal equipment to realize real-time video super-resolution processing.
In one embodiment of the application, a first device initiates a video call request to a second device. The second equipment receives the video call request, the first equipment and the second equipment establish video call connection, the cameras are opened respectively, and the video streaming is ready to be sent to the other party and the video streaming transmitted by the other party is received. The video transmitting end reads the video picture of the camera frame by frame, and detects the area where the face is located based on a face detection algorithm. Cutting the face area information into an aspect ratio of 16:9, if no face exists in the image, the current global image is used for replacing the image area containing the whole face. And scaling the cut portrait area to 320 x 180 resolution and then sending the portrait area to a video receiving end. After receiving the video stream, the video receiving end decodes the video stream frame by frame, and performs preprocessing operation to convert the image into a four-dimensional array of 1×3×320×180 (the number of channels is the height is the width), the array element type is uint8, the value of the array element is the pixel value of the corresponding RGB channel, and the value range is 0-255. After inputting the four-dimensional array obtained after pretreatment into the super-resolution of the human image super-resolution model, outputting the four-dimensional array with the size of 1 x 3 x 1280 x 720 (the number of channels is the height is the width), and removing the dimension of the batch size to obtain a high-definition image with the resolution of 1280 x 720. And the video receiving end performs frame-by-frame rendering display on the generated high-definition portrait image.
In another embodiment of the present application, the first device initiates a video call request to the second device, and both parties agree on the resolution of the video to be sent according to the current network state and the device capabilities. The second equipment receives the video call request, the first equipment and the second equipment establish video call connection, the cameras are opened respectively, and the video streaming is ready to be sent to the other party and the video streaming transmitted by the other party is received. The video transmitting end reads the video picture of the camera frame by frame, and detects the area where the face is located based on a face detection algorithm. Cutting out the face area from the picture according to the face area information by taking the face area as the center, wherein the aspect ratio is 16:9, if no face exists in the image, the current global image is used for replacing the image area containing the whole face. And scaling the cut portrait area to the appointed video resolution. After receiving the video stream, the video receiving end decodes the video stream frame by frame, and performs preprocessing operation to convert the image into a four-dimensional array of 1 x 3 x h x w (the number of channels is the number of the video height is the width of the video), the array element is uint8, the value of the array element is the pixel value of the corresponding RGB channel, and the value range is 0-255. And selecting a superdivision model with proper multiplying power according to the current equipment performance, such as a 2-time superdivision model (the width and height of an input image are amplified by 2 times) or a 4-time superdivision model (the width and height of the input image are amplified by 4 times), inputting the four-dimensional array obtained after preprocessing into the human image superresolution model for superresolution processing, and removing the batch size dimension of the four-dimensional array output by the model to obtain the high-definition image after superdivision processing.
In another aspect, an embodiment of the present application further provides a system, including:
the first module is used for acquiring a call video to be processed;
the second model is used for carrying out face detection processing on the call video to be processed and determining a cut video;
the third model is used for carrying out data preprocessing on the cut video and determining a preprocessing array;
and the fourth model is used for inputting the preprocessing array into a pre-trained face super-resolution model to perform super-resolution processing and determining a super-resolution call video.
Referring to fig. 3, an embodiment of the present application further provides an electronic device comprising a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for enabling a connected communication between the processor and the memory, the program, when executed by the processor, implementing a method as described previously.
Corresponding to the method of fig. 1, the embodiment of the present application further provides a storage medium, which is a computer readable storage medium, for computer readable storage, where the storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the above-mentioned method as described above.
Embodiments of the present application also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.
In summary, the embodiment of the application has the following advantages:
1. according to the embodiment of the application, a face detection algorithm is used, a face region is detected and cut out from a video call picture aiming at the characteristic that a face in the video call scene belongs to a picture core main body, and the requirement on network bandwidth is reduced under the condition that face information belonging to a key part is kept as far as possible by removing the background of the video call scene belonging to a non-key part.
2. The super-resolution model based on the high-definition processing of the human images of the convolutional neural network introduces global average absolute error loss, global perception loss and global antagonism loss, and antagonism loss of eyes and mouth areas, so that fine restoration of textures of key areas of human faces such as mouth eyes and the like can be realized when the low-definition human images are restored to high-definition human images.
In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present application are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.
Furthermore, while the application is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present application. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the application as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the application, which is to be defined in the appended claims and their full scope of equivalents.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present application have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the application, the scope of which is defined by the claims and their equivalents.
While the preferred embodiment of the present application has been described in detail, the present application is not limited to the embodiments described above, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present application, and these equivalent modifications or substitutions are included in the scope of the present application as defined in the appended claims.
Claims (8)
1. A method for processing a video call, the method comprising:
acquiring a call video to be processed;
performing face detection processing on the call video to be processed, and determining a cut video;
performing data preprocessing on the cut video to determine a preprocessing array;
inputting the preprocessing array into a pre-trained face super-resolution model to perform super-resolution processing, and determining a super-resolution call video;
the face super-resolution model comprises a generator model and a discriminator model, wherein the discriminator model comprises a global image discriminator, an eye region discriminator and a mouth region discriminator;
the generator model comprises a common convolution layer, a depth separable convolution layer, a residual error addition layer and a sub-pixel convolution layer;
the loss function of the generator model is designed as follows:
L total =αL charbonnier +βL percep +γL GAN +δL comp_GAN +εL comp_gram ;
wherein alpha, beta, gamma, delta, epsilon are super parameters respectively, L charbonnier Charbonnier loss for global picture, I pred An image output by the generator, I gt For the label sample picture, E is a constant, L percep For the perceived loss of global picture, L GAN For least squares fight loss of global picture, L comp_GAN And L comp_gram The method comprises the steps of respectively obtaining least square antagonism loss and a gram matrix characteristic L1 loss of an eye area and a mouth area, wherein the global perception loss and the global antagonism loss are used for enhancing the overall detail restoration effect of an image, and the gram matrix characteristic L1 loss is used for enhancing the restoration effect of the texture detail of the eye area and the mouth area.
2. The method of claim 1, wherein the performing face detection on the call video to be processed to determine a clip video comprises:
performing face detection processing on the call video to be processed according to a face detection algorithm, and determining a face area;
and cutting the face area to determine a cutting video.
3. The method of claim 1, wherein the performing data preprocessing on the cropped video to determine a preprocessing array comprises:
performing frame-by-frame decoding processing on the cut video to determine decoding data;
and performing data conversion processing on the decoded data to determine a preprocessing array.
4. The method according to claim 1, wherein before the inputting the preprocessing array into a pre-trained face super-resolution model for super-resolution processing to determine a super-resolution call video, the method further comprises pre-training the face super-resolution model, specifically comprising:
acquiring a training data set;
inputting the training data set into the generator model to determine generated data;
inputting the generated data into the discriminator model to determine a discrimination result;
and updating parameters of the face super-resolution model according to the identification result.
5. The method according to claim 4, wherein after the updating of the parameters of the face super-resolution model according to the authentication result, the method further comprises:
pruning is carried out on the updated face super-resolution model, and a pruning model is determined;
performing secondary training treatment on the pruning model to determine a training model;
and carrying out quantization processing on the training model, and determining the face super-resolution model.
6. A video call processing system, the system comprising:
the first module is used for acquiring a call video to be processed;
the second model is used for carrying out face detection processing on the call video to be processed and determining a cut video;
the third model is used for carrying out data preprocessing on the cut video and determining a preprocessing array;
the fourth model is used for inputting the preprocessing array into a pre-trained face super-resolution model to perform super-resolution processing and determining a super-resolution call video;
the face super-resolution model comprises a generator model and a discriminator model, wherein the discriminator model comprises a global image discriminator, an eye region discriminator and a mouth region discriminator;
the generator model comprises a common convolution layer, a depth separable convolution layer, a residual error addition layer and a sub-pixel convolution layer;
the loss function of the generator model is designed as follows:
L total =αL charbonnier +βL percep +γL GAN +δL comp_GAN +εL comp_gram ;
wherein alpha, beta, gamma, delta, epsilon are super parameters respectively, L charbonnier Charbonnier loss for global picture, I pred An image output by the generator, I gt For the label sample picture, E is a constant, L percep For the perceived loss of global picture, L GAN For least squares fight loss of global picture, L comp_GAN And L comp_gram The method comprises the steps of respectively obtaining least square antagonism loss and a gram matrix characteristic L1 loss of an eye area and a mouth area, wherein the global perception loss and the global antagonism loss are used for enhancing the overall detail restoration effect of an image, and the gram matrix characteristic L1 loss is used for enhancing the restoration effect of the texture detail of the eye area and the mouth area.
7. An electronic device comprising a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for enabling a connected communication between the processor and the memory, the program when executed by the processor implementing the steps of the method according to any of claims 1 to 5.
8. A storage medium, which is a computer-readable storage medium, for computer-readable storage, characterized in that the storage medium stores one or more programs executable by one or more processors to implement the steps of the method of any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210987630.7A CN115376188B (en) | 2022-08-17 | 2022-08-17 | Video call processing method, system, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210987630.7A CN115376188B (en) | 2022-08-17 | 2022-08-17 | Video call processing method, system, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115376188A CN115376188A (en) | 2022-11-22 |
CN115376188B true CN115376188B (en) | 2023-10-24 |
Family
ID=84065885
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210987630.7A Active CN115376188B (en) | 2022-08-17 | 2022-08-17 | Video call processing method, system, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115376188B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109711364A (en) * | 2018-12-29 | 2019-05-03 | 成都视观天下科技有限公司 | A kind of facial image super-resolution reconstruction method, device and computer equipment |
CN111709878A (en) * | 2020-06-17 | 2020-09-25 | 北京百度网讯科技有限公司 | Face super-resolution implementation method and device, electronic equipment and storage medium |
WO2021134872A1 (en) * | 2019-12-30 | 2021-07-08 | 深圳市爱协生科技有限公司 | Mosaic facial image super-resolution reconstruction method based on generative adversarial network |
CN113139915A (en) * | 2021-04-13 | 2021-07-20 | Oppo广东移动通信有限公司 | Portrait restoration model training method and device and electronic equipment |
CN113554058A (en) * | 2021-06-23 | 2021-10-26 | 广东奥普特科技股份有限公司 | Method, system, device and storage medium for enhancing resolution of visual target image |
CN113869282A (en) * | 2021-10-22 | 2021-12-31 | 马上消费金融股份有限公司 | Face recognition method, hyper-resolution model training method and related equipment |
CN114119874A (en) * | 2021-11-25 | 2022-03-01 | 华东师范大学 | Single image reconstruction high-definition 3D face texture method based on GAN |
WO2022110638A1 (en) * | 2020-11-30 | 2022-06-02 | 深圳市慧鲤科技有限公司 | Human image restoration method and apparatus, electronic device, storage medium and program product |
-
2022
- 2022-08-17 CN CN202210987630.7A patent/CN115376188B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109711364A (en) * | 2018-12-29 | 2019-05-03 | 成都视观天下科技有限公司 | A kind of facial image super-resolution reconstruction method, device and computer equipment |
WO2021134872A1 (en) * | 2019-12-30 | 2021-07-08 | 深圳市爱协生科技有限公司 | Mosaic facial image super-resolution reconstruction method based on generative adversarial network |
CN111709878A (en) * | 2020-06-17 | 2020-09-25 | 北京百度网讯科技有限公司 | Face super-resolution implementation method and device, electronic equipment and storage medium |
WO2022110638A1 (en) * | 2020-11-30 | 2022-06-02 | 深圳市慧鲤科技有限公司 | Human image restoration method and apparatus, electronic device, storage medium and program product |
CN113139915A (en) * | 2021-04-13 | 2021-07-20 | Oppo广东移动通信有限公司 | Portrait restoration model training method and device and electronic equipment |
CN113554058A (en) * | 2021-06-23 | 2021-10-26 | 广东奥普特科技股份有限公司 | Method, system, device and storage medium for enhancing resolution of visual target image |
CN113869282A (en) * | 2021-10-22 | 2021-12-31 | 马上消费金融股份有限公司 | Face recognition method, hyper-resolution model training method and related equipment |
CN114119874A (en) * | 2021-11-25 | 2022-03-01 | 华东师范大学 | Single image reconstruction high-definition 3D face texture method based on GAN |
Non-Patent Citations (2)
Title |
---|
"Adversarial-learning-based image-to-image transformation: A survey";Yuan Chen等;《Neurocomputing》;第411卷;第468-486页 * |
基于级联生成对抗网络的人脸图像修复;陈俊周;王娟;龚勋;;电子科技大学学报(第06期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN115376188A (en) | 2022-11-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11410275B2 (en) | Video coding for machine (VCM) based system and method for video super resolution (SR) | |
US20220021887A1 (en) | Apparatus for Bandwidth Efficient Video Communication Using Machine Learning Identified Objects Of Interest | |
TWI805085B (en) | Handling method of chroma subsampled formats in machine-learning-based video coding | |
CN113538287A (en) | Video enhancement network training method, video enhancement method and related device | |
US11854164B2 (en) | Method for denoising omnidirectional videos and rectified videos | |
Afonso et al. | Spatial resolution adaptation framework for video compression | |
TWI807491B (en) | Method for chroma subsampled formats handling in machine-learning-based picture coding | |
Li et al. | End-to-end optimized 360° image compression | |
Guleryuz et al. | Sandwiched image compression: Increasing the resolution and dynamic range of standard codecs | |
Bonnineau et al. | Multitask learning for VVC quality enhancement and super-resolution | |
CN115442609A (en) | Characteristic data encoding and decoding method and device | |
CN117441333A (en) | Configurable location for inputting auxiliary information of image data processing neural network | |
CN115376188B (en) | Video call processing method, system, electronic equipment and storage medium | |
Xia et al. | Visual sensitivity-based low-bit-rate image compression algorithm | |
CN116847087A (en) | Video processing method and device, storage medium and electronic equipment | |
WO2023010981A1 (en) | Encoding and decoding methods and apparatus | |
Li et al. | Pseudocylindrical convolutions for learned omnidirectional image compression | |
Yang et al. | Graph-convolution network for image compression | |
CN117321989A (en) | Independent localization of auxiliary information in neural network-based image processing | |
Jia et al. | Deep convolutional network based image quality enhancement for low bit rate image compression | |
US11854165B2 (en) | Debanding using a novel banding metric | |
KR102604657B1 (en) | Method and Apparatus for Improving Video Compression Performance for Video Codecs | |
US11948275B2 (en) | Video bandwidth optimization within a video communications platform | |
CN114697709B (en) | Video transmission method and device | |
Watanabe et al. | Traffic reduction in video call and chat using dnn-based image reconstruction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |