CN114640882A

CN114640882A - Video processing method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN114640882A
Application number: CN202011478292.1A
Authority: CN
Inventors: 周易; 易阳; 李昊沅; 李峰; 余晓铭; 涂娟辉; 左小祥; 周泉; 李新智; 张永欣; 万旭杰; 杨超; 刘磊; 李博
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2022-06-17

Abstract

The application provides a video processing method, a video processing device, electronic equipment and a computer readable storage medium, and relates to the application of cloud technology in the technical field of multimedia; the method comprises the following steps: receiving a video code stream, wherein the video code stream comprises a plurality of frames of images and enhanced image information synchronous with each frame of image; decoding the video code stream to obtain the multi-frame images and enhanced image information synchronous with each frame of image; performing fusion processing on each frame of image and the enhanced image information synchronous with each frame of image to obtain a composite image corresponding to each frame of image; and sequentially displaying the composite image corresponding to each frame of image in a human-computer interaction interface. Through the method and the device, the original image of the video and other image information can be displayed accurately and synchronously, so that the display mode of the video is enriched.

Description

Video processing method and device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of cloud technologies and computer multimedia technologies, and in particular, to a video processing method and apparatus, an electronic device, and a computer-readable storage medium.

Background

With the development of communication technology and the advancement of communication infrastructure construction, video transmission and playback by means of cloud technology are widely used.

Taking live broadcasting as an example, in the live broadcasting process, terminal equipment associated with a main broadcasting end continuously collects the main broadcasting and transmits the collected video data to a live broadcasting platform at the cloud end, and then the video data received by the live broadcasting platform at the cloud end can be distributed to different audience ends. In order to avoid that the video content is too monotonous, when the live broadcast platform transmits video data, the live broadcast platform also needs to transmit additional image information displayed at the audience, but due to the limitation of network transmission conditions, the synchronous display of the live broadcast platform and the audience is difficult to guarantee.

Disclosure of Invention

The embodiment of the application provides a video processing method and device, electronic equipment and a computer readable storage medium, which can accurately and synchronously display original images and other image information of videos so as to enrich the display modes of the videos.

The technical scheme of the embodiment of the application is realized as follows:

an embodiment of the present application provides a video processing method, including:

receiving a video code stream, wherein the video code stream comprises multiple frames of images and enhanced image information synchronized with each frame of image;

decoding the video code stream to obtain the multi-frame images and enhanced image information synchronous with each frame of image;

performing fusion processing on each frame of image and the enhanced image information synchronized with each frame of image to obtain a composite image corresponding to each frame of image;

and sequentially displaying the composite images corresponding to the images of each frame in a human-computer interaction interface.

In the foregoing solution, the decoding the video code stream to obtain the multiple frames of images and the enhanced image information synchronized with the multiple frames of images includes: reading each network extraction layer unit included in the video code stream, and determining the type of each network extraction layer unit; when the type of the read network extraction layer unit is an image slice, performing slice segmentation decoding operation to obtain a multi-frame image; when the type of the read nal unit is supplemental enhancement information, performing a supplemental enhancement information decoding operation to obtain enhanced image information synchronized with the image of each frame.

An embodiment of the present application provides a video processing apparatus, including:

the receiving module is used for receiving a video code stream, wherein the video code stream comprises multiple frames of images and enhanced image information synchronous with the images of the frames;

the decoding processing module is used for decoding the video code stream to obtain the multi-frame images and the enhanced image information synchronous with the each frame image;

the fusion processing module is used for carrying out fusion processing on each frame of image and the enhanced image information synchronous with each frame of image to obtain a composite image corresponding to each frame of image;

and the display module is used for sequentially displaying the composite images corresponding to the images of each frame in a human-computer interaction interface.

In the foregoing solution, when the enhanced image information is an image mask, the fusion processing module is further configured to perform the following processing for each frame of image: masking the image based on an image mask synchronized with the image to obtain a composite image which corresponds to the image and is removed of a background; wherein the image mask is generated by object recognition of the image.

In the above scheme, the receiving module is further configured to obtain a background image; the fusion processing module is further configured to merge the background-removed composite image with the background image; and the display module is also used for displaying the merged image obtained by merging in the human-computer interaction interface.

In the above scheme, the size of the image mask synchronized with each frame of image is smaller than the size of the image; the device also comprises an amplification processing module which is used for amplifying the enhanced image information synchronous with each frame of image; the device also comprises a noise reduction processing module which is used for carrying out noise reduction processing on the enhanced image information obtained after the amplification processing.

In the above scheme, when the enhanced image information is the position of a key point of an object in the image, the fusion processing module is further configured to obtain a special effect matched with the key point; adding the special effect to the position of the key point corresponding to the object in each frame of image to obtain a composite image added with the special effect corresponding to each frame of image.

In the foregoing solution, when the enhanced image information includes a pose of an object in the image, the fusion processing module is further configured to perform at least one of the following processing for each frame of image: determining the state corresponding to each object in the image according to the type of the gesture, counting the states of all the objects in the image, adding a statistical result into the image, and obtaining a composite image which corresponds to the image and is added with the statistical result; and acquiring a special effect adapted to the posture of the object in the image, and adding the special effect into the image to obtain a composite image which corresponds to the image and is added with the special effect.

In the above scheme, the decoding processing module is further configured to read each network extraction layer unit included in the video code stream, and determine a type of each network extraction layer unit; when the type of the read network extraction layer unit is an image slice, performing slice segmentation decoding operation to obtain a multi-frame image; when the type of the read nal unit is supplemental enhancement information, performing a supplemental enhancement information decoding operation to obtain enhanced image information synchronized with the image of each frame.

An embodiment of the present application provides another video processing method, including:

acquiring a multi-frame image;

carrying out computer vision processing on each acquired frame image to obtain enhanced image information corresponding to each frame image;

generating a video code stream comprising each frame of image and enhanced image information corresponding to each frame of image;

sending the video code stream;

the video code stream is used for being decoded by terminal equipment so as to display a composite image corresponding to each frame of image, and the composite image is obtained by carrying out fusion processing on each frame of image and enhanced image information synchronous with each frame of image.

In the foregoing solution, the generating a video code stream including each frame of image and enhanced image information corresponding to each frame of image includes: respectively carrying out coding processing on the image and the enhanced image information corresponding to the image to obtain an image code and an enhanced image information code corresponding to the image code; the image coding is packaged into a network abstraction layer unit with the type of an image slice, and the enhanced image information is coded and packaged into a network abstraction layer unit with the type of supplemental enhancement information; and assembling the network extraction layer unit with the type of the image slice and the network extraction layer unit with the type of the supplementary enhancement information into a video code stream.

An embodiment of the present application provides another video processing apparatus, including:

the acquisition module is used for acquiring multi-frame images;

the computer vision processing module is used for carrying out computer vision processing on each frame of acquired image to obtain enhanced image information corresponding to each frame of image;

the generating module is used for generating a video code stream comprising each frame of image and enhanced image information corresponding to each frame of image;

the sending module is used for sending the video code stream;

In the foregoing solution, when the enhanced image information is an image mask, the computer vision processing module is further configured to perform the following processing for each frame of image: calling an image segmentation model based on the image to identify an object in the image, taking an area outside the object as a background, and generating an image mask corresponding to the background; the image segmentation model is obtained by training an object labeled in a sample image based on the sample image.

In the above solution, the computer vision processing module is further configured to use the image as an input of the image segmentation model to determine an object in the image, use a region other than the object as a background, and generate an image mask corresponding to the background, where a size of the image mask is consistent with a size of the image; or the image segmentation model is used for reducing the size of the image, the reduced image is used as the input of the image segmentation model to determine an object in the reduced image, the region outside the object is used as a background, and an image mask corresponding to the background is generated, wherein the size of the image mask is smaller than that of the image.

In the foregoing solution, the computer vision processing module is further configured to perform the following processing for each frame of image: calling a key point detection model to perform key point detection on the image so as to obtain the position of the key point of the object in the image; determining the position of the key point of the object as enhanced image information corresponding to the image; the key point detection model is obtained by training based on a sample image and the position of a key point of an object marked in the sample image.

In the foregoing solution, the computer vision processing module is further configured to perform the following processing for each frame of image: calling a posture detection model to perform posture detection so as to obtain posture information of the object in the image; and determining the posture information of the object as the enhanced image information corresponding to the image.

In the foregoing solution, the generating module is further configured to perform coding processing on the image and the enhanced image information corresponding to the image, respectively, so as to obtain an image code and an enhanced image information code corresponding to the image code; the image coding is packaged into a network abstraction layer unit with the type of an image slice, and the enhanced image information is coded and packaged into a network abstraction layer unit with the type of supplemental enhancement information; and assembling the network extraction layer unit with the type of the image slice and the network extraction layer unit with the type of the supplementary enhancement information into a video code stream.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the video processing method provided by the embodiment of the application when executing the executable instructions stored in the memory.

The embodiment of the present application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute the method for processing video provided by the embodiment of the present application.

The embodiment of the application has the following beneficial effects:

the enhanced image information synchronous with each frame of image is integrated into the video code stream, so that after the receiving end receives the video code stream, each frame of image and the enhanced image information synchronous with each frame of image can be subjected to fusion processing to obtain a composite image corresponding to each frame of image, the original image and other image information of the video are accurately and synchronously displayed, and the display mode of the video is enriched.

Drawings

Fig. 1 is a schematic diagram of a picture-in-picture provided by an embodiment of the present application;

fig. 2 is a schematic block diagram of a video processing system 100 according to an embodiment of the present disclosure;

fig. 3A is a schematic structural diagram of a terminal device 300 provided in an embodiment of the present application;

fig. 3B is a schematic structural diagram of a terminal device 400 provided in the embodiment of the present application;

fig. 4 is a schematic flowchart of a video processing method provided in an embodiment of the present application;

fig. 5 is a schematic diagram of a hierarchical structure of an h.264 code stream provided in the embodiment of the present application;

fig. 6 is a schematic flowchart of a video processing method according to an embodiment of the present application;

fig. 7 is a schematic flowchart of a decoding process for NAL units according to an embodiment of the present application;

fig. 8 is a schematic application scenario diagram of a video processing method provided in an embodiment of the present application;

fig. 9 is a schematic application scenario diagram of a video processing method provided in an embodiment of the present application;

fig. 10 is a schematic flow chart of an SEI information production phase provided by an embodiment of the present application;

fig. 11 is a schematic flowchart of an SEI information consumption stage according to an embodiment of the present application;

fig. 12 is a schematic flowchart of post-processing performed on a mask map obtained after decoding at a receiving end according to an embodiment of the present application;

FIG. 13 is a schematic diagram of a saturation filter provided by an embodiment of the present application;

FIG. 14 is a schematic diagram of a picture-in-picture interface provided by an embodiment of the present application;

fig. 15 is a schematic page diagram corresponding to different stages of a video processing method according to an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Picture In Picture (PiP), which is a technique for adding one image to another image, for example, referring to fig. 1, fig. 1 is a schematic diagram of a PiP provided in an embodiment of the present application, as shown in fig. 1, a small Picture 102 is added to a large Picture 101, that is, the small Picture 102 is played on the large Picture 101 in a floating window manner, and the small Picture 102 supports arbitrary movement and scaling.

2) Supplemental Enhancement Information (SEI), vocabulary in the video transmission field, wherein SEI provides a method for adding additional Information into a video code stream, and is also one of the characteristics of video compression standards such as h.264, h.265, and the like, and meanwhile, SEI has the following basic characteristics: 1. is not an essential item of the decoding process; 2. it is possible to help the decoding process; 3. and integrating the video code stream.

For example, referring to table 1, table 1 is a schematic table of h.264/AVC fields provided in an embodiment of the present application, and as shown in table 1, a Network Abstraction Layer Unit Type (Network Abstraction Layer Unit Type) of an SEI in the h.264/AVC standard is 6.

TABLE 1H.264/AVC field schematic

Network abstraction layer unit types	Network abstraction layer unit content
		1	non-IDR images, and not using fragments of data partitioning
5	IDR picture
		6	Supplemental Enhancement Information (SEI)
7	Sequence Parameter Set (SPS)
		8	Picture Parameter Set (PPS)
11	End of stream symbol

3) Segmentation of human images, computer vision domain vocabulary, Segmentation of human images, input of images including human images, and output of human image masks. Generally, the mask directly output needs to be post-processed to be used as an Alpha (Alpha) channel of the image. In addition, in order to save the calculation cost, the image is usually reduced first, then the human image mask is solved, and at the post-processing stage, the human image mask is enlarged to the size of the original image.

4) Face keypoint detection, computer vision domain vocabulary, also known as face keypoint location or face alignment, refers to the location of key regions of a given face image, including eyebrows, eyes, nose, mouth, face contours, and the like, of a face. The face key point detection method is roughly divided into the following three methods: active Shape Model (ASM); cascade shape Regression (CPR); a method based on deep learning.

5) Human posture assessment, computer vision field vocabulary includes: single person posture assessment, multi-person posture assessment, human body posture tracking and three-dimensional human body posture assessment. In actual solution, the estimation of the human body posture is often converted into a prediction problem of key points of the human body, namely, the position coordinates of each key point of the human body are predicted firstly, and then the spatial position relation between the key points is determined according to the priori knowledge, so that the predicted human body skeleton is obtained.

With the development of communication technology and the advancement of communication infrastructure, real-time video communication technology is widely used. In some event scenes of a real-time video, it is desirable to transmit not only the video but also additional information (also referred to as enhanced image information), where the additional information may include a human image mask (hereinafter also referred to as a mask image) corresponding to each frame of image, a human face key point corresponding to each frame of image, or human body pose information corresponding to each frame of image.

However, the scheme provided by the related art has the following problems in simultaneously transmitting additional information:

1) it is difficult to achieve strict synchronization

When the additional information is transmitted, the additional information that is desired to be transmitted can be strictly corresponding to the video frame, for example, taking the example that the additional information is a portrait mask, the portrait mask of the I-th frame is desired to be strictly corresponding to the image of the I-th frame. If corresponding transmission channels are specially established for the video frame and the portrait mask respectively to transmit the video and the portrait mask respectively, strict correspondence between the portrait mask and the video frame is difficult to realize at a receiving end due to the limitation of network transmission conditions.

2) Bandwidth increase

Due to the need to transmit additional information (e.g. portrait mask), the bandwidth occupied by the transmission will increase dramatically, for example, when the same size portrait mask as the video frame is transmitted, the amount of data to be transmitted will increase by about 30%, i.e. equivalent to adding one channel to a Red-Green-Blue (RGB) image.

3) Loss is difficult to recover

In order to reduce the bandwidth pressure, the transmitted extra information (such as the portrait mask) needs to be compressed, and the compression introduces noise into the original signal (i.e. the portrait mask), which affects the display effect of the subsequent image if the portrait mask containing the noise is directly used to mask the video frame image.

In view of the foregoing technical problems, embodiments of the present application provide a video processing method, an apparatus, an electronic device, and a computer-readable storage medium, which can accurately and synchronously display an original image of a video and other image information to enrich a display manner of the video. An exemplary application of the video processing method provided by the embodiment of the present application is described below, and the video processing method provided by the embodiment of the present application may be implemented by various electronic devices, for example, may be implemented by a terminal device alone, or may be implemented by a server and a terminal device cooperatively.

The following describes an example of implementing the video processing method provided by the embodiment of the present application by using a server and a terminal device in a cooperative manner. Referring to fig. 2, fig. 2 is a schematic structural diagram of a video processing system 100 provided in the embodiment of the present application, in order to implement accurate synchronous presentation between an original image of a video and other image information. Among them, the video processing system 100 includes: the server 200, the terminal device 300, and the terminal device 400, which will be described separately below.

The server 200 is a background server of the client 310 and the client 410, and is configured to receive the video code stream sent by the terminal device 300 associated with the sender, and push the received video code stream to the terminal device 400 associated with the receiver. For example, when

clients

310 and 410 are live clients, server 200 may be a background server of a live platform; when

clients

310 and 410 are video conference clients, server 200 may be a backend server that serves video conferences; when

clients

310 and 410 are instant messaging clients, server 200 may be a background server for the instant messaging clients.

The terminal device 300 is a terminal device associated with the sender (i.e. used by the sender), e.g. for live broadcast, the terminal device 300 may be a terminal device associated with a main broadcast; for a video conference, end device 300 may be an end device associated with a moderator of the video conference. The terminal device 300 has a client 310 running thereon, and the client 310 may be various types of clients, such as a live client, a video conference client, an instant messaging client, and the like. The terminal device 300 is configured to obtain a plurality of frames of images (for example, an image acquired by calling a camera of the terminal device 300 to a host or a host), and call computing capability of the terminal device 300 to perform computer vision processing on each frame of the obtained images to obtain (synchronous) enhanced image information corresponding to each frame of the image, and then, the terminal device 300 performs encoding processing on the obtained plurality of frames of images and the enhanced image information corresponding to each frame of the image to generate a video code stream including each frame of the image and the enhanced image information corresponding to each frame of the image (a process of generating the video code stream will be specifically described below). Finally, the terminal device 300 transmits the generated video code stream to the server 200.

Terminal device 400 is a terminal device associated with the recipient, e.g., for live broadcast, terminal device 400 may be a terminal device associated with the viewer; for a video conference, terminal device 400 may be a terminal device associated with a participant of the video conference. The terminal device 400 runs thereon a client 410, and the client 410 may be various types of clients, such as a live client, a video conference client, an instant messaging client, and the like. The terminal device 400 is configured to receive a video code stream issued by the server 200, and perform decoding processing on the received video code stream to obtain multiple frames of images and enhanced image information synchronized with each frame of image; then, the terminal device 400 performs fusion processing on each frame of image and the enhanced image information synchronized with each frame of image to obtain a composite image corresponding to each frame of image; subsequently, the terminal device 400 invokes the human-computer interaction interface of the client 410 to sequentially display the composite image corresponding to each frame of image.

In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides basic cloud computing services such as cloud services, a cloud database, cloud computing, cloud functions, cloud storage, a network service, cloud communication, middleware services, domain name services, security services, a CDN, and a big data and artificial intelligence platform. The terminal device 300 and the terminal device 400 may be, but are not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal device 300 and the server 200, and the server 200 and the terminal device 400 may be connected directly or indirectly through wired or wireless communication, which is not limited in the embodiment of the present application.

In other embodiments, the video bitstream may also be generated by the server 200. For example, the terminal device 300 only sends the collected multiple frames of images to the server 200, so that the server 200 performs computer vision processing on each frame of received image to obtain enhanced image information corresponding to each frame of image, then the server 200 generates a video code stream including each frame of image and the enhanced image information corresponding to each frame of image, and then the server 200 pushes the generated video code stream to the terminal device 400 associated with the receiving party.

It should be noted that, the above embodiment is described by taking two users (for example, a main broadcaster and a viewer, or a conference host and a participant object) as an example, in practical applications, the number of users is not limited, and may be three users (for example, a main broadcaster and two viewers), or more users (for example, a main broadcaster and multiple viewers), and the embodiment of the present application is not limited specifically here. In addition, the client 310 and the client 410 may be the same type of client, for example, the client 310 and the client 410 may be video conference clients with the same function, that is, the client 310 and the client 410 both have functions of initiating a video conference and participating in the video conference; of course, the client 310 and the client 410 may also be different types of clients, for example, the client 310 is a live client of a main broadcast end, and can collect the main broadcast and push the collected video data to a background server of a live broadcast platform; the client 410 is a live client of the viewer, and has only functions of playing live content and sending a barrage.

The structure of the terminal device 300 in fig. 2 is explained below. Referring to fig. 3A, fig. 3A is a schematic structural diagram of a terminal device 300 according to an embodiment of the present application, where the terminal device 300 shown in fig. 3A includes: at least one processor 360, memory 350, at least one network interface 320, and a user interface 330. The various components in the terminal device 300 are coupled together by a bus system 340. It will be appreciated that the bus system 340 is used to enable connected communication between these components. The bus system 340 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 340 in fig. 3A.

The Processor 360 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 330 includes one or more output devices 331, including one or more speakers and/or one or more visual display screens, that enable presentation of media content. The user interface 330 also includes one or more input devices 332, including user interface components to facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 350 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 350 optionally includes one or more storage devices physically located remote from processor 360.

The memory 350 may include either volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 350 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 350 is capable of storing data, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below, to support various operations.

The operating system 351, including system programs for handling various basic system services and performing hardware related tasks, such as a framework layer, a core library layer, a driver layer, etc., is used for implementing various basic services and for handling hardware based tasks.

A network communication module 352 for communicating to other computing devices via one or more (wired or wireless) network interfaces 320, exemplary network interfaces 320 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), among others.

A presentation module 353 for enabling presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 331 (e.g., a display screen, speakers, etc.) associated with the user interface 330.

An input processing module 354 for detecting one or more user inputs or interactions from one of the one or more input devices 332 and translating the detected inputs or interactions.

In some embodiments, the apparatus provided by the embodiments of the present application may be implemented in software, and fig. 3A shows a video processing apparatus 355 stored in a memory 350, which may be software in the form of programs and plug-ins, and the like, and includes the following software modules: the obtaining module 3551, the computer vision processing module 3552, the generating module 3553, and the transmitting module 3554 are logical and thus may be arbitrarily combined or further separated depending on the functionality implemented. The functions of the respective modules will be explained below.

In other embodiments, the apparatus provided in the embodiments of the present Application may be implemented in hardware, and for example, the apparatus provided in the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to execute the video processing method provided in the embodiments of the present Application, for example, the processor in the form of the hardware decoding processor may be one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

The following continues the description of the structure of the terminal device 400 in fig. 2. Referring to fig. 3B, fig. 3B is a schematic structural diagram of a terminal device 400 according to an embodiment of the present application. As shown in fig. 3B, the terminal device 400 includes: a memory 450 for storing executable instructions; the processor 460 is configured to implement the video processing method provided by the embodiment of the present application when processing the executable instructions stored in the memory 450. Further, the video processing apparatus 455, which may be software in the form of programs and plug-ins, etc., stored in the memory 450 includes the following software modules: a receiving module 4551, a decoding processing module 4552, a fusion processing module 4553, a display module 4554, an enlargement processing module 4555, and a noise reduction processing module 4556, which are logical and thus may be arbitrarily combined or further divided according to the functions implemented. In addition, the terminal device 400 further includes a network interface 420, a user interface 430 (including an output device 431 and an input device 432), and a bus system 440, and the memory 450 further stores an operating system 451, a network communication module 452, a presentation module 453, and an input processing module 454, the functions of the above components are similar to those of the corresponding components in fig. 3A, and reference may be made to the description of fig. 3A, which is not repeated herein in this embodiment of the present application.

The video processing method provided by the embodiment of the present application will be described below with reference to an exemplary application and implementation of the terminal device provided by the embodiment of the present application. It is to be understood that the steps performed by the terminal device may specifically be performed by a client running on the terminal device.

The video processing method provided by the embodiment of the application mainly comprises a generation stage of the video code stream and a consumption stage of the video code stream, and the generation stage of the video code stream is firstly explained below.

Referring to fig. 4, fig. 4 is a schematic flowchart of a video processing method provided in an embodiment of the present application, and will be described with reference to the steps shown in fig. 4.

It should be noted that the execution subject of steps S101 to S104 shown in fig. 4 may be the terminal device 300 associated with the sender in fig. 2, and for convenience of description, the terminal device 300 associated with the sender is hereinafter referred to as a first terminal device, and the terminal device 400 associated with the receiver is hereinafter referred to as a second terminal device.

In step S101, a plurality of frame images are acquired.

In some embodiments, the target object may be acquired by invoking a camera of the first terminal device to obtain a multi-frame image. For example, taking a video conference as an example, the first terminal device may be a terminal device associated with a conference initiator, and acquires the conference initiator by calling a camera of the first terminal device to obtain a multi-frame image including the conference initiator; for example, for a live scene, the first terminal device may be a terminal device associated with a main broadcast, and the main broadcast is acquired by invoking a camera of the first terminal device to obtain a multi-frame image including the main broadcast.

In step S102, computer vision processing is performed on each acquired frame of image, and enhanced image information corresponding to each frame of image is obtained.

In some embodiments, the first terminal device may perform the computer vision processing on each acquired frame of image to obtain enhanced image information corresponding to each frame of image by: the following processing is performed for each frame image: calling an image segmentation model based on the image to identify an object in the image, taking an area outside the object as a background, generating an image mask corresponding to the background, and taking the generated image mask as enhanced image information synchronized with the image; the image segmentation model is obtained by training an object labeled in the sample image based on the sample image.

For example, the image segmentation model may be trained using a supervised training method, wherein the image segmentation model may be various types of neural network models, including a deep convolutional neural network model, a fully-connected neural network model, and the like, and the Loss Function may be any form of Loss Function for characterizing the difference between the predicted object position and the annotated object position, such as a Mean square Error Loss Function (MSE), a Hinge Loss Function (HLF), a Cross Entropy Loss Function (Cross Entropy), and the like.

In an example, taking a live scene as an example, after a camera of the first terminal device is called to collect the anchor and a multi-frame image including the anchor is obtained, the following processing is performed on any one frame image of the collected multi-frame images: inputting any frame of image into a pre-trained portrait segmentation model, so that the portrait segmentation model performs pixel-level segmentation on any frame of image (i.e. performing pixel-level judgment to judge whether the current pixel belongs to a pixel point corresponding to a portrait, and if so, dividing the current pixel into portrait regions), determining a portrait region corresponding to the position of a main broadcast in any frame of image based on the segmentation result, and using the region outside the portrait region corresponding to the position of the main broadcast as a background region to generate a portrait mask corresponding to the background region (wherein the portrait mask can be used for removing the background region in the image and only reserving the portrait; performing masking operation on the image based on the portrait mask refers to recalculating the value of each pixel in the image according to the portrait mask, for example, the portrait mask can be an image mask in which the values are all 1, and the regions outside the portrait region are all 0, therefore, after the mask operation is carried out on the image based on the portrait mask, the pixel values of the areas outside the portrait area in the image are reset to be 0, and therefore the fact that only the portrait is reserved in the image is achieved. In addition, the human image mask can also depict the influence degree of the neighborhood pixel points on the new pixel values, and meanwhile, the original pixel points are weighted and averaged according to the weighting factors in the human image mask. Masking operation of the image is commonly used in the fields of image smoothing, edge detection, feature analysis and the like), and finally, the generated portrait mask is used as enhanced image information corresponding to any frame of image; the human image segmentation model is obtained by training a human image labeled in a sample image based on the sample image. In addition, it should be noted that the portrait mask output based on the portrait segmentation model may be a two-dimensional array, and the value range of the array is between [0, 1], that is, the portrait mask is composed of decimal numbers between [0, 1 ].

In other embodiments, in order to reduce the bandwidth cost required for transmitting the image masks, the acquired multi-frame image may be first reduced, that is, the acquired multi-frame image is reduced in advance, for example, assuming that the resolution of the acquired image is 1080 × p (i.e., 1920 × 1080), the size of the image may be reduced to 100 × 100, and then the reduced image may be input to an image segmentation model to obtain image masks with a size smaller than the original size of the image, for example, an image mask with a size of only 100 × 100, so that the size of the transmitted image masks is much smaller than the original size of the image, and the bandwidth pressure caused by transmitting the image masks may be greatly reduced.

In other embodiments, the first terminal device may further perform the computer vision processing on each acquired frame of image to obtain enhanced image information corresponding to each frame of image by: the following processing is performed for each frame image: calling a key point detection model to perform key point detection on the image so as to obtain the position of a key point of an object in the image; determining the position of the obtained key point of the object as enhanced image information corresponding to the image; the key point detection model is obtained by training based on the sample image and the position of the key point of the object included in the sample image.

For example, taking a video conference as an example, after a camera carried by a first terminal device is called to collect a conference initiator and a multi-frame image including the conference initiator is obtained, the following processing is performed on any one of the collected multi-frame images: inputting any frame of image into a human face key point detection model trained in advance, so that the human face key point detection model determines the positions of key points of a conference initiator in the image, such as the position of a nose of the conference initiator, the position of eyebrows of the conference initiator, the position of a mouth of the conference initiator and the like, and then determining the determined positions of the key points of the conference initiator as enhanced image information synchronous with any frame of image; the face key point detection model is obtained by training based on the sample image and the positions of the key points of the face labeled in the sample image. For example, a supervised training method may be used to train a face key point detection model, where the face key point detection model may be various types of neural network models, including a deep convolutional neural network model, a fully-connected neural network model, and the like, and the loss function may be any form of loss function, such as MSE, HLF, and the like, for characterizing a difference between a position of a face key point of a predicted object and a position of a face key point of an annotated object.

In some embodiments, the first terminal device may also perform the computer vision processing on each acquired frame of image to obtain enhanced image information corresponding to each frame of image by: the following processing is performed for each frame image: calling a posture detection model to perform posture detection on the object in the image so as to obtain posture information of the object in the image; the obtained pose information of the object is determined as enhanced image information corresponding to the image.

For example, taking a live broadcast scene as an example, after a camera of a first terminal device is called to collect an anchor and a multi-frame image including the anchor is obtained, the following processing is performed on any one of the collected multi-frame images: inputting any frame of image into a human body posture detection model trained in advance to perform posture detection so as to obtain the posture information of the anchor in any frame of image, namely extracting the skeleton of the anchor from any frame of image by the human body posture detection model, connecting the skeleton based on the extracted skeleton so as to obtain the posture information of the anchor, and then taking the determined posture information of the anchor as the enhanced image information synchronous with any frame of image.

In step S103, a video code stream including each frame image and enhanced image information corresponding to each frame image is generated.

To facilitate understanding of the video processing method provided in the embodiments of the present application, before describing a process of generating a video bitstream including each frame of image and enhanced image information corresponding to each frame of image, a structure of the video bitstream will be described first.

For example, taking h.264 as an example, the functions of h.264 are divided into two layers: a Video Coding Layer (VCL), i.e. the output of the Coding process, which represents a sequence of compression-coded Video data, and a Network Abstraction Layer (NAL). These encoded VCL data are mapped or encapsulated into NAL units prior to transmission or storage of the VCL data. Each NAL unit includes a Raw Byte Sequence Payload (RBSP), a set of NAL header information corresponding to the video stream. The basic structure of the RBSP is as follows: tail bits are added after the original encoded data for byte alignment. That is, the h.264 code stream is composed of NAL units one after another, where each NAL unit includes NAL header information and RBSP.

The overall structure of the h.264 code stream is explained below. For example, referring to fig. 5, fig. 5 is a schematic diagram of a hierarchical structure of an h.264 code stream provided in the embodiment of the present application. As shown in fig. 5, the overall structure of the h.264 code stream can be divided into six layers, which will be described below.

The layer is a packing format of the h.264 code stream, and includes a byte stream (Annexb) format and a Real-time Transport Protocol (RTP) format, where the Annexb format is a default output format of most encoders, and is also called a naked stream because the Annexb format is not encapsulated by a Transport Protocol; and the RTP format is a data transmission format for network transmission.

Layer two is a NAL unit, each NAL unit including a set of NAL unit header and NAL unit body corresponding to video coding, wherein the types of NAL unit header are diverse, including: delimiter, padding, Supplemental Enhancement Information (SEI), Sequence Parameter Set (SPS), Picture Parameter Set (PPS), and the like.

And the Slice header contains information such as Slice type, Macroblock type in the Slice, the number of Slice frames, which picture the Slice belongs to, and setting and parameters of the corresponding frame. Slice data is a macroblock, i.e., where image pixel data is stored.

It should be noted that the concept of a slice is different from a frame, the frame is used for describing one image, one frame corresponds to one image, and the slice is a new concept proposed in h.264, and is a concept that is integrated in an efficient manner by slicing after encoding an image, and one image has at least one or more slices, and the slices are loaded through an NAL unit and are transmitted over a network, and in addition, other information for describing a video can be loaded in the NAL unit.

Layer four is Slice data, i.e., macroblocks, which are the primary carriers of video information, and contain luminance and chrominance information for each pixel. The main task of video decoding is to provide an efficient way to obtain the pixel arrays in the macroblocks from the video stream. Furthermore, it should be noted that the Slice type corresponds to the macroblock type, for example, for an I Slice, the Slice only includes I macroblocks, and the I macroblock performs intra prediction by using the decoded pixels from the current Slice as a reference; p slices include P macroblocks that can be intra predicted using a previously encoded picture as a reference picture, and I macroblocks.

Layer five is a Pulse Code Modulation (PCM) type, which indicates a manner for encoding original pixel values stored in a macroblock, for example, when the macroblock type (mb _ type) is an I _ PCM mode, the original pixel values of the macroblock may be stored in a differentially encoded form.

Layer six is the Residual (Residual) used to store Residual data.

In some embodiments, after obtaining the enhanced image information corresponding to each frame of image, the first terminal device may generate a video code stream including each frame of image and the enhanced image information corresponding to each frame of image by: respectively carrying out coding processing on each frame of image and the enhanced image information corresponding to each frame of image to obtain a multi-frame image code and an enhanced image information code corresponding to each frame of image code; respectively packaging multi-frame image codes into a plurality of network extraction layer units which are in one-to-one correspondence and are of image slices, and respectively packaging multi-frame enhanced image information codes into a plurality of network extraction layer units which are in one-to-one correspondence and are of supplemental enhancement information; and assembling a plurality of network extraction layer units with the types of image slices and a plurality of network extraction layer units with the types of supplemental enhancement information into a video code stream.

By way of example, taking the enhanced image information as an image mask, after obtaining an image mask corresponding to each frame of image, the first terminal device performs encoding processing on the multiple frames of image and the image mask synchronized with each frame of image to obtain multiple frames of encoded image and the encoded image mask corresponding to each frame of encoded image, and then encapsulates the multiple frames of encoded image into NAL units of the type of image slice, where the NAL units of the type of image slice include: non-partitioned, non-Instantaneous Decoding Refresh (IDR) picture slices, slice partition a, slice partition B, slice partition C, and slices in IDR picture. In addition, the second terminal device may further sequentially encapsulate the coded image masks corresponding to each frame of coded images into NAL units of the type SEI, so as to assemble the video code stream to be transmitted based on the plurality of NAL units of the type SEI and the plurality of NAL units of the type SEI.

It should be noted that, in addition to the above SEI generation method during video encoding, the SEI generation method may also be used to filter an existing code stream and insert SEI field information, or insert SEI during writing in a container layer, which is not specifically limited in this embodiment of the present application.

In step S104, a video stream is transmitted.

In some embodiments, after the first terminal device generates the video code stream, the server may send the generated video code stream, so that the server pushes the received video code stream to the second terminal device.

For example, taking a live broadcast scenario as an example, after a terminal device (i.e., a first terminal device) associated with a main broadcast generates a video code stream, the generated video code stream is pushed to a background server of a live broadcast platform, so that the background server of the live broadcast platform pushes the received video code stream to a terminal device (i.e., a second terminal device) associated with a viewer after receiving the video code stream.

The following is a description of the consumption phase of the video stream. For example, referring to fig. 6, fig. 6 is a schematic flowchart of a video processing method provided in an embodiment of the present application, and will be described with reference to the steps shown in fig. 6.

It should be noted that the main body of the execution of steps S201 to S204 shown in fig. 6 may be the terminal device 400 in fig. 2, that is, the terminal device associated with the receiving party, and is used to receive the video code stream and perform decoding and playing, and for convenience of description, the terminal device associated with the receiving party (that is, the terminal device 400 shown in fig. 2) is referred to as a second terminal device.

In step S201, a video bitstream is received.

In some embodiments, after generating a video code stream, the first terminal device sends the generated video code stream to the server, so that the server pushes the received video code stream to the second terminal device after receiving the video code stream sent by the first terminal device, where the video code stream received by the second terminal device includes multiple frames of images and enhanced image information synchronized with the multiple frames of images.

For example, taking a live broadcast scene as an example, the first terminal device may be a terminal device associated with a main broadcast, and after generating a video code stream including a plurality of frames of images and enhanced image information synchronized with each frame of image, the first terminal device sends the generated video code stream to a background server of a live broadcast platform, so that after receiving the video code stream sent by the first terminal device, the background server of the live broadcast platform sends the received video code stream to a terminal device (i.e., a second terminal device) associated with a viewer, so that the second terminal device performs decoding and playing on the received video code stream.

For example, taking a video conference as an example, the first terminal device may be a terminal device associated with an initiator of the conference, and after generating a video code stream including multiple frames of images and enhanced image information synchronized with each frame of image, the first terminal device sends the generated video code stream to a background server of the video conference, so that after receiving the video code stream sent by the first terminal device, the background server of the video conference pushes the received video code stream to a terminal device (i.e., a second terminal device) associated with a participant object, so that the second terminal device performs decoding and playing on the received video code stream.

In addition, in order to facilitate understanding of the video processing method provided in the embodiment of the present application, before describing a decoding process of a subsequent video code stream, a process of performing a decoding process on a plurality of NAL units included in the video code stream by a receiving end is described first.

For example, referring to fig. 7, fig. 7 is a schematic flowchart of a decoding process performed on a NAL unit according to an embodiment of the present application, as shown in fig. 7, after receiving a video stream, a receiving end first reads the NAL unit from the video stream, then extracts an RBSP syntax structure from the read NAL unit, and then performs a corresponding decoding process according to a type of the NAL unit. For example, when the receiving end determines that the type of the NAL unit is 6 (i.e. SEI is 6), the SEI decoding process is performed to obtain additional information transmitted based on the SEI; when the receiving end judges that the type of the NAL unit is 7 (namely SPS is 7), the SPS decoding process is carried out to obtain a sequence parameter; when the receiving end judges that the type of the NAL unit is 5 (i.e. IDR is 5), the slice decoding process is entered to obtain a decoded picture.

In step S202, a video code stream is decoded to obtain multiple frames of images and enhanced image information synchronized with each frame of image.

In some embodiments, the second terminal device may perform the decoding processing on the video code stream in the following manner to obtain multiple frames of images and enhanced image information synchronized with each frame of image: reading each network extraction layer unit included in the video code stream, and determining the type of each network extraction layer unit; when the type of the read network extraction layer unit is an image slice, performing slice segmentation decoding operation to obtain a multi-frame image; when the type of the read nal unit is the supplemental enhancement information, a supplemental enhancement information decoding operation is performed to obtain enhanced image information synchronized with each frame of image.

Illustratively, taking the enhanced image information as an image mask, after receiving a video code stream, the second terminal device traverses and reads a plurality of NAL units included in the received video code stream, and determines the type of each read NAL unit; when the type of the read NAL unit is determined to be an image slice, performing slice segmentation decoding operation to obtain a corresponding decoded image; when it is determined that the type of the read NAL unit is SEI, an SEI decoding operation is performed to obtain an image mask synchronized with each frame image.

In step S203, each frame of image and the enhanced image information synchronized with each frame of image are subjected to fusion processing, so as to obtain a composite image corresponding to each frame of image.

In some embodiments, when the enhanced image information is an image mask, the second terminal device may perform the above-described fusion processing on each frame of image and the enhanced image information synchronized with each frame of image to obtain a composite image corresponding to each frame of image by: the following processing is performed for each frame image: masking the image based on an image mask synchronized with the image to obtain a composite image which corresponds to the image and is removed of the background; wherein the image mask is generated by performing object recognition on the image.

For example, taking a video conference as an example, after the terminal device associated with the participant performs decoding processing on the received video code stream to obtain multiple frames of images and a portrait mask corresponding to each frame of image, the following processing is performed on each frame of image: the image is subjected to mask operation based on the image-synchronized portrait mask to obtain a composite image (i.e. the image of the conference initiator with the background removed) which corresponds to the image and with the background removed, so that the image is subjected to mask operation based on the image-synchronized portrait mask, only the image of the conference initiator with the background removed can be presented on the terminal equipment of the participant, and the display effect of the video is improved.

In some embodiments, in order to save the network bandwidth occupied by transmitting the image mask, the image mask may be compressed, and the compression may introduce noise into the image mask, and if the second terminal device directly performs a masking operation on the image using the decoded image mask, the display effect of the image may be affected. In view of this, the second terminal device may further perform the following operations before performing the fusion processing on each frame image and the image mask synchronized with each frame image: and calling a saturation filter to perform noise reduction processing on the decoded image mask so as to recover the information of the image mask before compression to the maximum extent, thereby improving the display effect of the composite image.

In other embodiments, when the enhanced image information is the position of the key point of the object in the image, the second terminal device may perform the fusion processing on each frame of image and the enhanced image information synchronized with each frame of image in the following manner to obtain a composite image corresponding to each frame of image: obtaining a special effect matched with the key points; and adding special effects at the positions of the key points corresponding to the targets in each frame of image to obtain a composite image which corresponds to each frame of image and is added with special effects.

For example, taking a live broadcast scene as an example, when a terminal device associated with a viewer performs decoding processing on a received video code stream to obtain a multi-frame image and a position of a key point of a main broadcast synchronized with each frame of image, a special effect matched with the key point of the main broadcast can be obtained, for example, when the key point is an eye of the main broadcast, cartoon glasses can be obtained, and the cartoon glasses are added to the position of the eye corresponding to the main broadcast in each frame of image, so that the main broadcast picture added with the cartoon glasses is presented on the terminal device associated with the viewer, thereby enriching the display effect of the live broadcast picture and improving the user experience.

For example, referring to fig. 8, fig. 8 is a schematic view of an application scene of the video processing method provided in the embodiment of the present application, and as shown in fig. 8, after obtaining a position 801 of an eyebrow of a anchor in an image, a special effect, such as a cartoon glasses, matching the position 801 of the eyebrow may be obtained, and then a cartoon glasses 802 may be added to the position of the eyebrow of the anchor in a live broadcast picture, so that a display effect of the live broadcast picture is enriched.

In some embodiments, when the enhanced image information includes the posture of the object in the image, the second terminal device may perform the fusion processing on each frame of image and the enhanced image information synchronized with each frame of image in the following manner to obtain a composite image corresponding to each frame of image: performing at least one of the following processes for each frame of image: determining the state corresponding to each object in the image according to the type of the gesture, counting the states corresponding to all the objects in the image, and adding a statistical result into the image to obtain a composite image which corresponds to the image and is added with the statistical result; and acquiring a special effect matched with the posture of the object in the image, and adding the acquired special effect into the image to obtain a special-added composite image corresponding to the image.

For example, taking a video conference as an example, when a terminal device associated with a participant object performs decoding processing on a received video code stream to obtain multiple frames of images and postures of objects in images corresponding to the multiple frames of images, the following processing is performed on each frame of image: and determining the state (such as mood, viewpoint and the like) corresponding to each participant in the image according to the type of the gesture, counting the states corresponding to all the participants appearing in the image, and adding a counting result into the image. For example, it is assumed that when a certain topic vote is carried out, 5 participant objects (including 1 conference initiator and 4 conference participants) always exist in an image, and the postures of 3 participant objects are detected to be the thumbs up state, namely, the consent is indicated, and the postures of the other 2 participant objects are the thumbs down state, namely, the objection is indicated, so that the statistical results of the consent and objection of the 2 tickets are generated, and the generated statistical results are added to the corresponding images. For example, referring to fig. 9, fig. 9 is a schematic view of an application scenario of the video processing method provided in the embodiment of the present application, as shown in fig. 9, there are 3 conference objects in a frame of image, and pose information 901 corresponding to the 3 conference objects respectively has been acquired, where the body orientations of the 2 conference objects are screen-oriented and indicate agreement; and the body orientation of 1 participant is back to the screen, which represents objection, then the statistical result 902 can be generated according to the states of the 3 participants and presented in the image, thus facilitating the participants of the conference to know the statistical result.

For example, taking a live broadcast scene as an example, when a terminal device associated with a viewer decodes a received video code stream to obtain a plurality of frames of images and a gesture of a anchor synchronized with each frame of image, a special effect adapted to the gesture of the anchor in the images can be obtained, that is, after the terminal device associated with the viewer obtains the gesture of the anchor, the special effects such as a graphic, a style, an artistic modeling and the like can be sequentially loaded on the anchor body corresponding to the plurality of frames of images, that is, by tracking the change of the gesture of the anchor body, the added special effect can be naturally fused with the anchor when the anchor moves.

In step S204, a composite image corresponding to each frame image is sequentially displayed in the human-computer interaction interface.

In some embodiments, before the second terminal device sequentially displays the composite image corresponding to each frame of image in the human-computer interaction interface, the following processing may be further performed: acquiring a background image; and combining the composite image without the background obtained after performing mask operation on the image based on the image mask synchronized with the image with the background image, so as to display the combined image obtained after combination in a human-computer interaction interface.

For example, taking a video conference scene as an example, when sending a video code stream including a plurality of frames of images acquired by calling a camera and a portrait mask corresponding to each frame of image to a background server of a video conference, a terminal device associated with a conference initiator may also send a video code stream obtained by intercepting a screen of a terminal device associated with the conference initiator, such as a slide show (PPT) shared by the conference initiator, to the background server of the video conference, so that the terminal device associated with a participant may simultaneously receive the video code stream including the shared PPT and the video code stream including the plurality of frames of images and the portrait mask corresponding to each frame of image sent by the conference initiator, and then, after performing a masking operation on the images based on the portrait mask synchronized with the images to obtain a character image corresponding to the images and having the backgrounds removed, the person image with the background removed can be merged and rendered with the shared PPT to achieve a picture-in-picture effect (e.g., a picture-in-picture interface display effect as shown in fig. 14).

According to the video processing method provided by the embodiment of the application, the enhanced image information synchronous with each frame of image is integrated in the video code stream, so that after a receiving end receives the video code stream, each frame of image and the enhanced image information synchronous with each frame of image can be subjected to fusion processing to obtain a composite image corresponding to each frame of image, and then the composite image corresponding to each frame of image is sequentially displayed in a human-computer interaction interface, so that the original image of the video and other image information (such as a mask image, key point information and the like) can be accurately and synchronously displayed, and the display mode of the video is enriched.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

The embodiment of the application provides a video processing method, which transmits additional information based on SEI (solid information interface), wherein the additional information transmitted based on SEI comprises a human image mask, a human face key point, human body posture information and the like. Generally, SEI is used to transmit encoder parameters, video copyright information, camera parameters, and event identification, etc. That is, the data transmitted by the conventional SEI is often constant data or data with a short length, and these data do not have a one-to-one correspondence relationship with the original video frame. In contrast, in the video processing method provided in the embodiment of the present application, the additional information transmitted based on the SEI is an image processing result obtained after performing Computer Vision (CV) processing on each frame of image, and needs to be strictly synchronized with the original video image, and meanwhile, the video processing method transmitted based on the SEI provided in the embodiment of the present application further has the following advantages:

1) protocol independence, the current mainstream protocols (such as Real Time Streaming Protocol (RTSP) and Real Time Messaging Protocol (RTMP)) both support SEI, and other protocols can be used only by parsing received SEI information at the receiving end.

2) The compatibility is high, SEI is not a necessary option in the decoding process, and the compatibility with the past version can be achieved, that is, when SEI parsing is not supported, the original decoder (for example, H264) ignores SEI, and can still continue to decode video.

3) The method is characterized in that the bandwidth is friendly, the resolution of the video code stream is high-definition, the resolution of the image can reach 1080p, 1280p or even higher, if a portrait mask of an original size corresponding to each frame of image is directly transmitted, if an RGB image is considered, the bandwidth is increased by 30%, if a YUV image is considered (YUV is a color coding method, Y represents brightness (Luma), i.e. a gray-scale value, U and V represent Chroma (Chroma or Chroma), and the format of the YUV image is NV12), the bandwidth is increased by 50%, i.e. the bandwidth cost caused by introducing the portrait mask is quite high.

In order to solve the bandwidth problem, in the embodiment of the present application, the post-processing of the portrait segmentation model is split, that is, a small-sized mask image (much smaller than the video size, for example, may be of the order of 100 × 100, and for convenience of description, hereinafter referred to as a small mask image) output by the portrait segmentation model is processed, but a large-sized mask image (i.e., a mask image having the same size as the video) output by the complete post-processing is not processed, so that it is ensured that the transmitted portrait mask is independent of the size of the video frame, and the bandwidth occupied by transmitting the portrait mask is reduced.

In addition, in order to further reduce the bandwidth occupied by transmitting the portrait mask, the embodiment of the present application may further compress the small mask map, for example, the small mask map may be compressed by using a DEFLATE algorithm provided by zlib (zlib is a general-purpose compression library, provides a set of in-memory compression and decompression functions, and can detect the integrity of decompressed data), so as to reduce the size of the small mask map to some extent. However, this type of compression algorithm does not take into account the timing nature of the small mask map, and can only compress each frame of small mask map independently. Thus, for further compression, H264 compression techniques may be employed to perform video compression on the sequence of bitmapped images. Thus, by truncating the portrait segmentation post-processing and the video compression algorithm, the portrait mask can be transmitted with the minimum bandwidth.

4) The embodiment of the application provides a post-processing method for recovering a portrait mask, aiming at noise introduced by video compression, which can effectively inhibit the noise introduced by the video compression, thereby recovering information before small mask image compression to the maximum extent.

The following specifically describes the video processing method provided by the embodiment of the present application, taking an example in which only the portrait (i.e., the image of the person after the background is removed) is presented when the desktop of the video conference is shared, and the PPT is played.

The video processing method provided by the embodiment of the application mainly comprises an SEI information production stage and an SEI information consumption stage, wherein the SEI information production stage mainly comprises the steps that a video code stream processing module of a sending end generates additional information aiming at a video code stream to be transmitted, and the generated additional information is packaged into SEI information which is put back into the video code stream; in the SEI information consumption stage, the receiving end mainly analyzes the received SEI information and displays the picture-in-picture by using mask information in the SEI information. The two stages are specifically described below.

For example, referring to fig. 10, fig. 10 is a schematic flow chart of an SEI information production phase provided in an embodiment of the present application. The processing of the video stream after being input into the video stream processing module is described with reference to fig. 10.

After the video code stream is input into the video code stream processing module, decoding processing is firstly performed in the decoding module, and after the decoding is completed, a decoded image, such as an image in a single frame RGB format or YUV format, is obtained. For example, the decoding module decodes an incoming video code stream, and the decoding tool may be FFMPEG (FFMPEG is a set of open source computer programs that can record, convert digital audio and video, and convert them into streams, and adopts LGPL or GPL licenses, which provides a complete solution for recording, converting, and streaming audio and video, and has very powerful functions including a video capture function, video format conversion, video capture, and video watermarking, etc.), so that the video code stream can be converted into YUV data or RGB data that can be processed by the CV algorithm.

The sending end (corresponding to the first terminal device) then performs computer vision processing on the single frame decoded image obtained after the decoding processing, including computer vision processing such as calling a portrait segmentation model to perform portrait segmentation processing on the decoded image, calling a key point detection model to perform key point detection on a face in the decoded image, or calling a posture evaluation model to perform posture evaluation on a human body in the decoded image. For example, in the case of human image segmentation, first, a decoded image is subjected to reduction processing, and the reduced decoded image is input to a human image segmentation model to output a small mask map M_small。

After CV processing is performed on the decoded image, an image processing result (taking a small mask image as an example), and then the sending end performs video compression on the small mask image and the decoded image respectively to obtain image coding and image processing result coding (namely, small mask image coding), and packs the small mask image coding into SEI of a video frame.

In addition, since the input supported by video compression is often an image, taking an RGB image as an example, the range of the image is [0, 255], and the range of the small mask map is between [0, 1], the small mask map cannot be directly subjected to video compression. Therefore, firstly, the value range of the small mask map needs to be mapped to the interval [0, 255], that is, after the small mask map is directly multiplied by 255, video compression is performed, and the calculation formula is as follows:

when the receiving end subsequently uses the small mask graph, the value range of the small mask graph needs to be restored to the interval of [0, 1] and then used.

And finally, the transmitting end pushes the packed SEI information out to the video code stream, and the production of the SEI information is finished. For example, FFMPEG may be used to convert the small mask map into an SEI code stream, and the SEI code stream is written into the original video code stream and merged. And finally, pushing the merged video code stream, thereby finishing the production stage of the whole SEI information.

It should be noted that, in addition to filtering the existing code stream and inserting the SEI field information as shown in fig. 10, the SEI information may be generated during video encoding or inserted during writing in the container layer, which is not specifically limited in this embodiment of the present application.

The SEI information consumption phase is explained below.

For example, referring to fig. 11, fig. 11 is a schematic flowchart of an SEI information consumption stage provided in an embodiment of the present application. As shown in fig. 11, the SEI information consumption phase is mainly divided into the following three steps:

after receiving the video code stream, a receiving end (corresponding to the second terminal device) first performs decoding processing on the received video code stream, that is, decodes image codes included in the video code stream and analyzes SEI information to obtain a decoded image and a decoded small mask image respectively. For example, the receiving end may decode an image from the video code stream by using FFMPEG, wherein the decoded image may be image I in YUV format_streamMeanwhile, the small mask map M can be decoded from SEI information of video code stream by using FFMPEG_streamWherein the small mask pattern M_streamHas a value range of substantially [0, 255]]This is due to the noise introduced by video compression, which may result in partial values greater than 255, and the small mask map M subsequently needs to be applied_streamIs substantially reduced to [0, 1]]Interval, the calculation formula is as follows:

wherein, the first and the second end of the pipe are connected with each other,

the small mask image obtained for decoding and the small mask image M directly output from the human image segmentation model_smallIs different because of

Noise introduced by video compression is superimposed.

Then, the receiving end performs post-processing on the mask image obtained by decoding, including amplification processing and noise reduction processing, and fuses the mask image obtained by post-processing with the decoded image, that is, the mask image obtained by post-processing is used as an Alpha (Alpha) channel of the decoded image. Where Alpha channel refers to the transparency or translucency of an image, for example, for a bitmap stored using 16 bits per pixel, 5 bits may be used for red, 5 bits for green, 5 bits for blue, and the last bit is Alpha, for each pixel in the image, in which case it indicates either transparency or opacity. In the embodiment of the present application, the human image mask is used as an alpha channel of the decoded image to perform fusion processing, so as to obtain the human image with the background removed.

The following describes a process of performing post-processing on the mask map obtained after decoding by the receiving end.

For example, referring to fig. 12, fig. 12 is a schematic flowchart of a process of performing post-processing on a mask map obtained after decoding at a receiving end according to an embodiment of the present application, and as shown in fig. 12, a post-processing process performed at the receiving end mainly includes three steps of mask amplification, mask noise reduction, and image merging, which are respectively described below.

(1) Mask amplification

The receiving end analyzes SEI information in the video code stream to obtain the value range which is generally distributed in [0, 1]]Small mask map of interval

Thereafter, the small mask graph may be mapped by a resize operation

Up to the same size as the decoded picture, where optionally reThe size algorithm includes a bilinear interpolation algorithm, a neighbor sampling algorithm and the like. Thus, by matching the small mask map

The enlargement processing is performed to obtain a large mask image having the same size as the decoded image

(2) Mask noise reduction

Obtaining a large mask image after amplification treatment

Is video compressed and introduces noise if a distorted large mask map is used directly

A large damage is caused to the result of the final rendering. Therefore, to map the large mask

To do this, a filter, called a saturation filter, may be provided. The receiving end resolves a small mask map M aiming at SEI information in the video code stream_streamThen, the small mask image M is firstly processed_streamIs approximately reduced to [0, 1]]The interval, i.e. divided by 255, yields a small mask map

Then, the small mask map is processed

Is enlarged to the same size as the decoded image to obtain a large mask map

Finally, the obtained large mask graph

And inputting the signal into a saturation filter for noise reduction. For example, the saturation filter f _ sat is defined as follows:

where upper represents the upper limit of the saturation filter, as large mask map

When the value of a certain pixel point x is larger than upper, the value is set to 1 after the noise reduction processing of a saturation filter f _ sat; lower represents the lower limit of the saturation filter, when the mask is large

When the value of a certain pixel point x is smaller than lower, after the noise reduction processing of the saturation filter f _ sat, the value is set to 0. For example, referring to fig. 13, fig. 13 is a schematic diagram of a saturation filter provided in an embodiment of the present application, and as shown in fig. 13, a large mask diagram may be generated by two parameters, i.e., lower may take 0.2 and upper may take 0.8, such that the large mask diagram may be generated by using two parameters, i.e., lower and upper

Is limited to [0, 1]]To remove most of the noise introduced by video compression. That is, in a large mask map that will be distorted

After the input saturated filter f _ sat is subjected to noise reduction processing, a denoised large mask image can be obtained

The calculation formula is as follows:

(3) image merging

Obtaining a denoised large mask map

Then, de-noised large mask map can be processed

And combining the image into a decoded image in the form of an Alpha channel to obtain a character image with background removed.

And finally, the sending end uses an interface to render and use the composite image at the user, namely the whole process of the picture-in-picture function is completed. For example, the receiving end is on the large mask image after de-noising

Merging the decoded image in the form of Alpha channel, merging and rendering the decoded image containing Alpha channel (for example, the character image with background removed obtained after the mask operation is performed on the decoded image by using the denoised large mask image) and the background image B (for example, the screen image shared by the initiator of the video conference) to obtain an image I finally displayed on the user interface_compositeThe calculation formula is as follows:

the background image B and the video code stream may be respectively transmitted to a receiving end through different transmission channels, so as to perform merging and rendering at the receiving end, thereby implementing a picture-in-picture function (for example, a character image with the background removed may be displayed as a small picture suspended on a PPT (i.e., a large picture) shared by an initiator of a video conference, so as to achieve a picture-in-picture display effect). For example, referring to fig. 14, fig. 14 is a schematic interface diagram of a picture-in-picture provided in this embodiment of the present application, as shown in fig. 14, a character image 1402 with a background removed may be displayed in a floating manner in a background image 1401, where the background image 1401 may be various windows or Web pages (Web) such as Excel, Word, and the like, in addition to the PPT shown in fig. 14. In addition, since the embodiment of the present application is based on the human figure mask transmitted by SEI and strictly synchronized with the video frame, there is no delay problem in the background-removed human figure image 1402 (i.e. a small picture in picture) when performing PPT page turning.

The following further describes beneficial effects of the video processing method provided by the embodiment of the present application with reference to page diagrams respectively corresponding to different stages.

For example, referring to fig. 15, fig. 15 is a schematic view of pages corresponding to different stages of a video processing method provided in the embodiment of the present application, and as shown in fig. 15, a page 1501 is a decoded image obtained by a decoding module at a sending end decoding an input video code stream; the page 1502 is a character image obtained by performing background replacement by using a mask image output by the character segmentation model, that is, a character image obtained by performing background replacement by using a mask image which is not transmitted by SEI; the page 1503 is a human image obtained by directly using a mask image transmitted by SEI to realize background replacement by a receiving end, wherein the mask image transmitted based on SEI is not subjected to noise reduction processing by a saturation filter; the page 1504 is a human image obtained by the receiving end by performing background replacement using a mask image which is transmitted by the SEI and subjected to noise reduction processing by a saturation filter.

Comparing page 1503 with page 1504 can see that if the saturation filter denoising process is not performed (i.e., page 1503), the background image will be noisy because the mask map is not "binarized" (i.e., after the mask map is input into the saturation filter, the value larger than upper in the mask map will be set to 1, and the value smaller than lower in the mask map will be set to 0), so that the background area in the mask map is not set to 0, thereby leaving a background undertone. Thus, the saturation filter acts to reduce noise and remove background under color. Furthermore, as can be seen from comparing the page 1502 with the page 1504, the display effects of the page 1502 and the page 1504 are equivalent, that is, the display effects of the personal image obtained by performing background replacement by using the mask image transmitted by the SEI and the personal image obtained by performing background replacement by using the mask image directly output by the portrait segmentation model are almost the same, that is, the information before the mask image compression is restored to the maximum extent.

In other embodiments, additional information, such as a mask map or other CV processing results corresponding to each frame of decoded image, may also be directly written on the decoded image of the corresponding frame according to a certain rule, so that the decoded image is transmitted as a carrier. At the receiving end, according to the specific rule, the extra information written in the decoded image is separated, and the extra information and the decoded image are respectively recovered, so that the video and the extra information are simultaneously transmitted.

Continuing with the exemplary structure of the video processing device 355 implemented as software modules provided by the embodiments of the present application, in some embodiments, as shown in fig. 3A, the software modules stored in the video processing device 355 of the memory 350 may include: an acquisition module 3551, a computer vision processing module 3552, a generation module 3553, and a transmission module 3554.

An obtaining module 3551, configured to obtain multiple frames of images; a computer vision processing module 3552, configured to perform computer vision processing on each acquired frame of image, so as to obtain enhanced image information corresponding to each frame of image; a generating module 3553, configured to generate a video code stream including each frame of image and enhanced image information corresponding to each frame of image; a sending module 3554, configured to send a video bitstream; the video code stream is used for being decoded by the terminal equipment to display a composite image corresponding to each frame of image, and the composite image is obtained by carrying out fusion processing on each frame of image and enhanced image information synchronous with each frame of image.

In some embodiments, when the enhanced image information is an image mask, the computer vision processing module 3552 is further configured to perform the following for each frame of image: calling an image segmentation model based on the image to identify an object in the image, taking an area outside the object as a background, and generating an image mask corresponding to the background; the image segmentation model is obtained by training an object labeled in the sample image based on the sample image.

In some embodiments, the computer vision processing module 3552 is further configured to directly take the image as an input of the image segmentation model to determine an object in the image, take a region outside the object as a background, and generate an image mask corresponding to the background, wherein the size of the image mask is consistent with the size of the image; or the image mask is used for reducing the size of the image, the reduced image is used as the input of the image segmentation model to determine an object in the reduced image, the region outside the object is used as the background, and the image mask corresponding to the background is generated, wherein the size of the image mask is smaller than that of the image.

In some embodiments, the computer vision processing module 3552 is further configured to perform the following for each frame of image: calling a key point detection model to perform key point detection on the image so as to obtain the position of a key point of an object in the image; determining the position of the key point of the object as enhanced image information corresponding to the image; the key point detection model is obtained by training based on the sample image and the position of the key point of the object marked in the sample image.

In some embodiments, the computer vision processing module 3552 is further configured to perform the following for each frame of image: calling a posture detection model to perform posture detection so as to obtain posture information of an object in the image; and determining the posture information of the object as the enhanced image information corresponding to the image.

In some embodiments, the generating module 3553 is further configured to perform encoding processing on the image and the enhanced image information corresponding to the image to obtain an image code and an enhanced image information code corresponding to the image code, respectively; the image coding is packaged into a network abstraction layer unit with the type of an image slice, and the enhanced image information coding is packaged into a network abstraction layer unit with the type of supplemental enhancement information; and assembling the network extraction layer unit with the type of the image slice and the network extraction layer unit with the type of the supplementary enhancement information into a video code stream.

Continuing with the exemplary structure of the video processing device 455 implemented as software modules provided by the embodiments of the present application, in some embodiments, as shown in fig. 3B, the software modules stored in the video processing device 455 of the memory 450 may include: a receiving module 4551, a decoding processing module 4552, a fusion processing module 4553, a display module 4554, an enlargement processing module 4555, and a noise reduction processing module 4556.

A receiving module 4551, configured to receive a video code stream, where the video code stream includes multiple frames of images and enhanced image information synchronized with each frame of image; the decoding processing module 4552 is configured to perform decoding processing on the video code stream to obtain multiple frames of images and enhanced image information synchronized with each frame of image; a fusion processing module 4553, configured to perform fusion processing on each frame of image and the enhanced image information synchronized with each frame of image to obtain a composite image corresponding to each frame of image; and a display module 4554 configured to sequentially display the composite image corresponding to each frame of image in the human-computer interaction interface.

In some embodiments, when the enhanced image information is an image mask, the fusion processing module 4553 is further configured to perform the following processing for each frame of image: masking the image based on an image mask synchronized with the image to obtain a composite image which corresponds to the image and is removed of the background; wherein the image mask is generated by object recognition of the image.

In some embodiments, the receiving module 4551 is further configured to acquire a background image; the fusion processing module 4553 is further configured to merge the background-removed composite image with the background image; the display module 4554 is further configured to display the merged image obtained through merging in the human-computer interaction interface.

In some embodiments, the size of the image mask synchronized with each frame of the image is smaller than the size of the image; the video processing apparatus 455 further includes a zoom-in processing module 4555 configured to perform zoom-in processing on the enhanced image information synchronized with each frame image; the video processing apparatus 455 further includes a noise reduction module 4556 configured to perform noise reduction processing on the enhanced image information obtained after the enlargement processing.

In some embodiments, when the enhanced image information is the position of a key point of an object in the image, the fusion processing module 4553 is further configured to obtain a special effect matched with the key point; and adding special effects at the positions of the key points corresponding to the objects in each frame of image to obtain a composite image which corresponds to each frame of image and is added with special effects.

In some embodiments, when the enhanced image information includes a pose of an object in the image, the fusion processing module 4553 is further configured to perform at least one of the following processing for each frame of image: determining the state corresponding to each object in the image according to the type of the gesture, counting the states of all the objects in the image, adding a statistical result into the image, and obtaining a composite image which corresponds to the image and is added with the statistical result; and acquiring a special effect matched with the posture of the object in the image, and adding the special effect into the image to obtain a composite image which corresponds to the image and is added with the special effect.

In some embodiments, the decoding processing module 4552 is further configured to read each nal included in the video bitstream, and determine a type of each nal; when the type of the read network extraction layer unit is an image slice, performing slice segmentation decoding operation to obtain a multi-frame image; when the type of the read nal unit is supplemental enhancement information, a supplemental enhancement information decoding operation is performed to obtain enhanced image information synchronized with each frame of image.

It should be noted that the description of the apparatus in the embodiment of the present application is similar to that of the method embodiment described above, and has similar beneficial effects to the method embodiment, and therefore, the description is not repeated. The technical details, which are not exhaustive, of the video processing apparatus provided in the embodiments of the present application can be understood from the description of fig. 4 or any one of the drawings of fig. 6.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the video processing method described in the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium having stored therein executable instructions that, when executed by a processor, cause the processor to perform a method provided by embodiments of the present application, for example, a video processing method as illustrated in fig. 4 or fig. 6.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, the enhanced image information synchronized with each frame of image is integrated in the video code stream, so that after the receiving end receives the video code stream, each frame of image and the enhanced image information synchronized with each frame of image can be fused to obtain a composite image corresponding to each frame of image, and then the composite image corresponding to each frame of image is sequentially displayed in the human-computer interaction interface, so that the original image of the video and other image information (such as a mask image, key point information and the like) can be accurately and synchronously displayed, and the display mode of the video is enriched.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method of video processing, the method comprising:

receiving a video code stream, wherein the video code stream comprises a plurality of frames of images and enhanced image information synchronous with each frame of image;

performing fusion processing on each frame of image and the enhanced image information synchronous with each frame of image to obtain a composite image corresponding to each frame of image;

2. The method according to claim 1, wherein when the enhanced image information is an image mask, the fusing the each frame of image and the enhanced image information synchronized with the each frame of image to obtain a composite image corresponding to the each frame of image comprises:

performing the following processing for each frame image:

masking the image based on an image mask synchronized with the image to obtain a composite image which corresponds to the image and is removed of a background;

wherein the image mask is generated by object recognition of the image.

3. The method of claim 2, wherein before the composite image corresponding to each frame of image is sequentially displayed in the human-computer interaction interface, the method further comprises:

acquiring a background image;

and merging the composite image without the background with the background image, and displaying a merged image obtained by merging in a human-computer interaction interface.

4. The method of claim 2,

the size of the image mask synchronized with each frame image is smaller than the size of the image;

before the fusion processing is performed on each frame of image and the enhanced image information synchronized with each frame of image, the method further includes:

and amplifying the enhanced image information synchronous with each frame of image, and performing noise reduction processing on the enhanced image information obtained after amplification processing.

5. The method according to claim 1, wherein when the enhanced image information is a position of a key point of an object in the image, the fusing the each frame of image and the enhanced image information synchronized with the each frame of image to obtain a composite image corresponding to the each frame of image comprises:

acquiring a special effect matched with the key points;

and adding the special effect to the position of the key point corresponding to the object in each frame of image to obtain a composite image which corresponds to each frame of image and is added with the special effect.

6. The method according to claim 1, wherein when the enhanced image information includes a posture of an object in the image, the fusing the each frame of image and the enhanced image information synchronized with the each frame of image to obtain a composite image corresponding to the each frame of image comprises:

performing at least one of the following processes for each frame of image:

determining the state corresponding to each object in the image according to the type of the gesture, counting the states of all the objects in the image, adding a statistical result into the image, and obtaining a composite image which corresponds to the image and is added with the statistical result;

and acquiring a special effect adapted to the posture of the object in the image, and adding the special effect into the image to obtain a composite image which corresponds to the image and is added with the special effect.

7. A method of video processing, the method comprising:

acquiring a multi-frame image;

sending the video code stream;

8. The method according to claim 7, wherein when the enhanced image information is an image mask, the performing computer vision processing on each frame of the acquired image to obtain the enhanced image information corresponding to each frame of the image comprises:

performing the following processing for each frame image:

calling an image segmentation model based on the image to identify an object in the image, taking an area outside the object as a background, and generating an image mask corresponding to the background;

the image segmentation model is obtained by training an object marked in a sample image based on the sample image.

9. The method of claim 8, wherein the invoking an image segmentation model based on the image to identify an object in the image, using a region outside the object as a background, and generating an image mask corresponding to the background comprises:

using the image as an input of the image segmentation model to determine an object in the image, using a region outside the object as a background, and generating an image mask corresponding to the background, wherein the size of the image mask is consistent with that of the image; or

Reducing the size of the image, using the reduced image as an input of the image segmentation model to determine an object in the reduced image, using a region except the object as a background, and generating an image mask corresponding to the background, wherein the size of the image mask is smaller than that of the image.

10. The method of claim 7, wherein the performing computer vision processing on each frame of the acquired image to obtain enhanced image information corresponding to each frame of the image comprises:

performing the following processing for each frame image:

calling a key point detection model to perform key point detection on the image so as to obtain the position of the key point of the object in the image;

determining positions of key points of the object as enhanced image information corresponding to the image;

the key point detection model is obtained by training based on a sample image and the position of a key point of an object marked in the sample image.

11. The method of claim 7, wherein the computer vision processing of each frame of acquired image to obtain enhanced image information corresponding to each frame of image comprises:

performing the following processing for each frame image:

calling a posture detection model to perform posture detection so as to obtain posture information of the object in the image;

and determining the attitude information of the object as enhanced image information corresponding to the image.

12. A video processing apparatus, characterized in that the apparatus comprises:

13. A video processing apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring multi-frame images;

the sending module is used for sending the video code stream;

14. An electronic device, comprising:

a memory for storing executable instructions;

a processor for implementing the video processing method of any of claims 1 to 6, or any of claims 7 to 11, when executing executable instructions stored in the memory.

15. A computer readable storage medium having stored thereon executable instructions for implementing the video processing method of any one of claims 1 to 6, or any one of claims 7 to 11 when executed.