CN116708809A

CN116708809A - Processing method and device

Info

Publication number: CN116708809A
Application number: CN202310684369.8A
Authority: CN
Inventors: 牛海亮
Original assignee: Shenzhen Huawei Cloud Computing Technology Co ltd
Current assignee: Shenzhen Huawei Cloud Computing Technology Co ltd
Priority date: 2023-06-09
Filing date: 2023-06-09
Publication date: 2023-09-05

Abstract

The embodiment of the application provides a processing method and a processing device, which relate to the technical field of Internet, and the method comprises the following steps: an SEI field is embedded in a code stream after image coding, and the SEI field carries the position information of the region of interest in the image, so that the transmission of the position information of the region of interest of the image can be effectively realized, and the image processing based on the position information is facilitated.

Description

Processing method and device

Technical Field

The embodiment of the application relates to the technical field of IT, in particular to a processing method and device.

Background

In the current video communication, there are some application scenarios, and different processing needs to be performed on the region of interest and the region of no interest of the image in the video stream.

For example, in a video conference scene, processing such as replacement of a background in a video screen, blurring of the background, and the like may be performed in order to highlight a portrait region of interest in the video screen. In addition, in order to promote the immersion of the participants in the shared content of the video conference, a person region may be superimposed on top of the shared content. For another example, the person images of the participants may be combined into a particular scene using a frame mode. Thereby improving the immersion and experience of the video conference.

However, the image processing scheme in the related art is less effective.

Disclosure of Invention

In order to solve the technical problems, the application provides a processing method and a processing device. In the method, the SEI field is embedded in the coded code stream of the image, and the SEI field carries the position information of the region of interest in the image, so that the transmission of the position information of the region of interest of the image can be effectively realized, and the image processing based on the position information is facilitated.

In one possible embodiment, the application provides a method of treatment. The method comprises the following steps: acquiring a first code stream, wherein the first code stream is a code stream of a target image after coding; adding media Supplementary Enhancement Information (SEI) of the target image in the first code stream to obtain a second code stream; wherein the SEI includes location information for indicating a location of a region of interest in the target image; and sending the second code stream.

Alternatively, the method may be applied to a transmitting end, which may be an electronic device, or a software program installed on an electronic device, or the like, and the transmitting end may also be a server, or a software program installed on a server, or the like.

Wherein the target image may be one or more frames.

The data after the encoding of one frame of image is described as a first code stream, and the data after the encoding of multiple frames of images is a plurality of continuous first code streams.

Wherein the encoded first code stream supports adding an SEI field.

The transmitting end may add an SEI field to the first bitstream after image encoding, and carry the location information through the SEI field.

The present application may add an SEI field to the encoded first bitstream after encoding the image, and before encoded data of the image is transmitted over a network, and the SEI field carries location information. Therefore, the effective bearing of the position information of the region of interest in the target image in the code stream of the video or the image is realized. The position information is embedded in the first code stream after image encoding, and can be effectively decoupled from the encoder. And the target image and the position information thereof of one frame are transmitted in one path of second code stream, synchronization of the image and the position information thereof is not needed, the scheme is not excessively complex to realize, and the problem that the synchronization of the image and the position information thereof is difficult to process is not caused even in a frame loss scene. When the method is used for transmitting the position information of the region of interest in the image, the method is low in implementation complexity, can be decoupled from the encoder, and facilitates the subsequent highlighting of the region of interest of the image by utilizing the transmitted position information, so that the immersion and experience of a video conference for example are improved in a video transmission scene.

In one possible embodiment, the method further comprises: acquiring the position information; performing a first operation on the position information, wherein the data size of the position information after the first operation is smaller than the data size of the position information before the first operation; the SEI includes position information after the first operation.

For example, if the region of interest is a portrait region, the transmitting end may process the target image before encoding by using a portrait segmentation technique to obtain mask (mask) information about the portrait of the target image, where the mask information is an example of template information, but the location information is not limited to mask information.

In addition, the manner of acquiring the position information to be transmitted is not limited to the example herein, and other practice manners may be included, which are not limited herein.

Because the second code stream adds the position information on the basis of the original first code stream, in order to reduce the excessive occupation of the network bandwidth caused by transmitting the position information and reduce the influence on the network transmission speed, the transmitting end may perform a first operation on the position information before embedding the position information into the first code stream, where the first operation may include, but is not limited to: downsampling, compression processing, etc., so that the data amount of the position information after the first operation is smaller to reduce the occupation of bandwidth.

In one possible implementation, the SEI further includes information indicating that the load type is a custom type.

Considering that the first code stream obtained after encoding the target image may carry SEI, but the load types of the SEIs are load types defined by an encoding protocol, in order to distinguish the SEI embedded in the second code stream from the SEI originally existing in the first code stream, the load type of the SEI set by the embedded SEI is a custom type, so that a receiving end can conveniently position the SEI with position information from the second code stream by further utilizing the load type of the SEI.

In a possible implementation manner, the adding the SEI of the target image in the first bitstream includes: and adding a network abstraction layer unit into the first code stream, wherein the type of the network abstraction layer unit is SEI type, and the main content of the network abstraction layer is SEI.

The first code stream may be composed of a plurality of Network Abstraction Layer (NAL) units, and the NAL units may be composed of a NAL header and NAL body content. Each NAL has a NAL type (carried in the NAL header), where the SEI type is one NAL type.

Then to embed the SEI in the first bitstream (which may also be described in terms of implementation as embedding the SEI field), a NAL unit may be added to the first bitstream, and the added NAL unit is of the SEI type, while the main content of the NAL unit is the SEI. The SEI may include the location information. This achieves the process of adding the SEI in the first bitstream.

For example, the data structure of the SEI may be a data structure of an SEI field defined by the H264 or H265 standard protocol (e.g. including a payload type, a payload size, and a payload content), and then the custom type is set to a value of a field of the payload type in the new NAL unit, and the location information is set to a value of a field of the payload content in the new NAL unit, and the payload size in the new NAL unit is the size of the location information.

In one possible embodiment, the method further comprises: and increasing SEI at the starting position or the ending position of the first code stream.

In order to facilitate the receiving end to quickly locate the SEI carrying the position information in the second bitstream, the NAL unit of the present application may be added at a start position (e.g., before the first NAL unit) or an end position (e.g., after the last NAL unit) of the first bitstream, where the NAL unit is a NAL unit of the SEI type, and the main content of the NAL is the SEI, so as to implement adding the SEI in the first bitstream.

In one possible implementation manner, the first code stream is a code stream obtained by encoding the target image using an H264 or H265 standard protocol.

The first code stream coded by adopting the H264 or H265 standard protocol can support carrying SEI, thereby facilitating the new SEI and writing the position information in the new SEI.

In one possible implementation, the location information is mask information indicating a portrait area in the target image.

In one possible implementation, the method is applied to a cloud data center, and the cloud management platform is used for managing an infrastructure for providing cloud services, wherein the infrastructure comprises a plurality of cloud data centers arranged in a plurality of areas, and at least one cloud data center is arranged in each area.

The cloud data center may include a computing node, the sender may be a computing node, or the sender may be a virtual instance (e.g., a virtual machine or container) in the computing node.

In one possible embodiment, the application provides a method of treatment. The method comprises the following steps: receiving a second code stream, wherein the second code stream is a code stream of a target image after being coded, and the second code stream carries SEI of the target image; determining the SEI in the second bitstream; position information is extracted in the SEI, the position information being used to indicate the position of a region of interest in the target image.

Alternatively, the method may be applied to a receiving end, which may be an electronic device, or a software program installed on an electronic device, or the like, and the receiving end may also be a server, or a software program installed on a server, or the like.

The second code stream is a code stream received by the receiving end from the sending end.

The receiving end may locate the SEI in the received second bitstream and extract the location information from the located SEI, so as to process the target image using the location information.

The receiving end extracts the SEI from the second code stream before decoding the second code stream, so that the process of extracting the SEI or extracting the position information from the second code stream can be decoupled from a decoder, and the receiving end does not need to synchronize the position information and the coded code stream of the target image, because the SEI and the coded code stream are transmitted to the receiving end through one path of code stream, the SEI and the second code stream are bound together, the method has lower implementation complexity and higher efficiency of acquiring the position information.

In a possible implementation manner, after the extracting the position information in the SEI, the method further includes: and performing a second operation on the position information, wherein the data volume of the position information after the second operation is larger than the data volume of the position information in the SEI.

For example, the position information embedded in the second code stream is subjected to the preprocessing operation of the transmitting end, so that in order to recover the position information, the receiving end can perform a second operation (such as decompression, upsampling, etc.) on the position information after extracting the position information, thereby recovering the position information, so as to process the target image by using the more accurate position information, and improve the display effect of the region of interest in the image in the video.

In a possible implementation manner, the determining the SEI in the second bitstream includes: and determining the load type in the second code stream as SEI of a custom type.

When locating the SEI in the second code stream, considering that the second code stream may include a plurality of SEIs, in order to accurately locate the required SEI for carrying the position information, the receiving end may search the SEI with the load type of the custom type in the second code stream as the required SEI in the application.

In a possible implementation manner, the determining the SEI in the second bitstream includes: and determining a network abstraction layer unit with a type of SEI type in the second code stream.

In some implementations, the second bitstream is composed of multiple NAL units, then the SEI is also a type of NAL unit, and then the receiving end, in order to locate the SEI, may locate the NAL unit with the NAL type of SEI in the second bitstream (optionally, further by the payload type being a custom type) to find the NAL unit carrying the location information required by the present application.

Wherein the information indicating the SEI type may be an identifier of the SEI field (for example, 0000000106 is an identifier of the SEI field in the H264 standard protocol). The identifier is located at the header of the NAL unit.

In a possible implementation manner, the extracting the position information in the SEI includes: extracting location information from the body content of the network abstraction layer unit.

After locating the type of the NAL unit to the required NAL unit for the SEI type, the receiving end may read the main content of the NAL unit to read the SEI.

For example, the body content may include an SEI (also referred to as an SEI field), and the data structure of the SEI field may include a payload type (e.g., a custom type), a payload size (e.g., a size of location information), and a payload content (e.g., location information), and then the receiving end may read the SEI from the located NAL unit to obtain the payload content and obtain the location information.

In one possible embodiment, the method further comprises: the SEI is located at a start position or an end position of the first bitstream.

In one possible embodiment, the method further comprises: acquiring the target image obtained by decoding the second code stream; determining a target position of the region of interest in the target image based on the position information; based on the target position, the region of interest and the region outside the region of interest (also called non-region of interest) in the target image are processed differently.

The application can utilize the position information to enable the receiving end to carry out different image processing (such as color processing, transparency processing and the like) on the interested area and the non-interested area in the target image so as to meet the image processing requirements in various application scenes.

In a possible implementation manner, when the region of interest and the region of no interest are processed in different manners, the processed image parameters may include, but are not limited to, at least one of the following: transparency, gray scale, color, etc.

Taking transparency as an example of the processed image parameter, the receiving end can set different transparency of the region of interest and the region of non-interest in the target image based on the position information, for example, the transparency of the pixel point of the region of non-interest is set to 100%, and the transparency of the pixel point of the region of interest is set to 0% (i.e. opaque), so that the region of interest in the processed target image can be highlighted, and the background is subjected to transparentization.

In a possible implementation manner, the obtaining the target image obtained by decoding the second code stream includes: deleting the SEI in the second code stream; and decoding the second code stream after deleting the SEI by adopting an H264 or H265 standard protocol to acquire the target image.

The cloud data center may include a computing node, the receiving end may be a computing node, or the receiving end may be a virtual instance (e.g., a virtual machine or container) in the computing node.

The effects of the processing method applied to the receiving end in the above embodiments are similar to those of the processing method applied to the transmitting end in the above embodiments, and will not be described here again.

In one possible embodiment, the present application provides a processing apparatus. The processing device comprises: the first acquisition module is used for acquiring a first code stream, wherein the first code stream is a code stream obtained by encoding a target image; the processing module is used for adding the SEI of the target image in the first code stream to obtain a second code stream; wherein the SEI includes location information for indicating a location of a region of interest in the target image; and the sending module is used for sending the second code stream.

In one possible embodiment, the apparatus further comprises: the second acquisition module is used for acquiring the position information; the preprocessing module is used for performing first operation on the position information, wherein the data volume of the position information after the first operation is smaller than the data volume of the position information before the first operation; the SEI includes position information after the first operation.

In a possible implementation manner, the processing module is specifically configured to: and adding a network abstraction layer unit into the first code stream, wherein the type of the network abstraction layer unit is SEI type, and the main content of the network abstraction layer unit is SEI.

In a possible implementation manner, the processing module is specifically configured to increase the SEI at a start position or an end position of the first code stream.

In one possible implementation manner, the first code stream is a code stream obtained by encoding the target image by using an H264 or H265 standard protocol.

In a possible implementation manner, the cloud data center comprises the processing device, and the cloud management platform is used for managing an infrastructure for providing cloud services, wherein the infrastructure comprises a plurality of cloud data centers arranged in a plurality of areas, and at least one cloud data center is arranged in each area.

The effects of the processing apparatus of each of the above embodiments are similar to those of the processing method applied to the transmitting end of each of the above embodiments, and will not be described here again.

In one possible embodiment, the present application provides a processing apparatus. The processing device comprises: the receiving module is used for receiving a second code stream, wherein the second code stream is the code stream of the target image after being coded, and the second code stream carries SEI of the target image; a positioning module for determining the SEI in the second bitstream; an extraction module for extracting location information in the SEI, the location information being used to indicate a location of a region of interest in the target image.

In one possible embodiment, the apparatus further comprises: and the post-processing module is used for carrying out a second operation on the position information, and the data volume of the position information after the second operation is larger than the data volume of the position information in the SEI.

In a possible implementation manner, the positioning module is specifically configured to determine that the load type in the second code stream is a custom type SEI.

In a possible implementation manner, the positioning module is specifically configured to determine, in the second code stream, a network abstraction layer unit with a type of SEI type.

In a possible implementation manner, the extracting module is specifically configured to extract location information from the body content of the network abstraction layer unit.

In one possible implementation, the SEI is located at a start position or an end position of the first bitstream.

In one possible embodiment, the apparatus further comprises: the acquisition module is used for acquiring the target image obtained by decoding the second code stream; and the first processing module is used for processing the region of interest and the region outside the region of interest in the target image in different modes based on the target position.

In one possible embodiment, the apparatus further comprises: the acquiring module is specifically configured to delete the SEI in the second code stream, and decode the second code stream after the SEI is deleted by using an H264 or H265 standard protocol, so as to acquire the target image.

The effects of the processing apparatus of the present embodiment are similar to those of the processing method applied to the receiving end of each of the above embodiments, and will not be described here again.

In one possible embodiment, the present application provides a processing apparatus. The processing device includes one or more interface circuits and one or more processors; the interface circuit is configured to receive a signal from the memory and to send the signal to the processor, the signal comprising computer instructions stored in the memory; when the processor executes the computer instructions, the processor may implement the method applied to the transmitting end in any of the above embodiments.

The effects of the processing apparatus of the present embodiment are similar to those of the processing method applied to the transmitting end of each of the above embodiments, and will not be described here again.

In one possible implementation, the present application provides a computer-readable storage medium. The computer-readable storage medium stores a computer program which, when run on a computer or processor, causes the computer or processor to perform the method applied to the transmitting end in any one of the above embodiments.

The effects of the computer-readable storage medium of the present embodiment are similar to those of the processing method applied to the transmitting end of each of the above embodiments, and will not be described here again.

In one possible implementation, the present application provides a computer program product. The computer program product contains a software program which, when executed by a computer or processor, causes the method applied to the sender in any of the above embodiments to be performed.

The effects of the computer program product of the present embodiment are similar to those of the processing method applied to the transmitting end of each of the above embodiments, and will not be described here again.

In one possible embodiment, the present application provides a processing apparatus. The processing device includes one or more interface circuits and one or more processors; the interface circuit is configured to receive a signal from the memory and to send the signal to the processor, the signal comprising computer instructions stored in the memory; when the processor executes the computer instructions, the processor may implement the method applied to the receiving end in any of the above embodiments.

In one possible implementation, the present application provides a computer-readable storage medium. The computer readable storage medium stores a computer program which, when run on a computer or processor, causes the computer or processor to perform the method of any of the above embodiments applied to the receiving end.

The effects of the computer-readable storage medium of the present embodiment are similar to those of the processing method applied to the receiving end of each of the above embodiments, and will not be described here again.

In one possible implementation, the present application provides a computer program product. The computer program product comprises a software program which, when executed by a computer or processor, causes the method applied to the receiving end in any of the above embodiments to be performed.

The effects of the computer program product of the present embodiment are similar to those of the processing method applied to the receiving end of each of the above embodiments, and will not be described here again.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an exemplary illustrated system framework;

fig. 2 is a schematic diagram illustrating a data structure of a code stream;

FIG. 3a is a schematic diagram of an exemplary illustrated data processing process;

FIG. 3b is a schematic diagram of an exemplary illustrated data processing process;

FIG. 4 is a schematic structural diagram of an apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a chip according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone.

The terms first and second and the like in the description and in the claims of embodiments of the application, are used for distinguishing between different objects and not necessarily for describing a particular sequential order of objects. For example, the first target object and the second target object, etc., are used to distinguish between different target objects, and are not used to describe a particular order of target objects.

In embodiments of the application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

In the description of the embodiments of the present application, unless otherwise indicated, the meaning of "a plurality" means two or more. For example, the plurality of processing units refers to two or more processing units; the plurality of systems means two or more systems.

In the current video conference, the portrait segmentation technology is relatively mature. The portrait segmentation technique of the video conference can be used for scenes such as background replacement, background blurring and the like in video pictures. On the basis, in order to further improve the user experience of the video conference, new application scenes are derived, for example, in order to improve the immersion of the participants on the shared content of the video conference, the images can be superimposed on the shared content. For another example, the person images of the meeting participants may be composited into a particular scene using a frame-by-frame mode. Thereby improving the immersion and experience of the video conference.

In order to apply the portrait obtained by using the portrait segmentation technique to the respective scenes described above to achieve any effect such as background replacement, background blurring, portrait superimposition of shared content, portrait composition of participants, and the like, it is necessary to make transparent an area other than a portrait area (which may be referred to as a background area) displayed by each of the transmitting end of the video and the receiving end of the video.

In the above scenario, the region of interest is a portrait region, and in other application scenarios, the region of interest may be other types of regions, which may be set according to requirements, which is not limited herein.

In the related art, in order to make the background of the portrait in the video frame displayed by the receiving end of the video transparent, the following three technical schemes mainly exist.

In related art 1, an encoding unit at a transmitting end may directly encode an original image with a transparent channel; then, transmitting the encoded image data to a decoding unit of a receiving end through a network; finally, the decoding unit is responsible for decoding the image information with the transparent channel.

In the related art 1, the encoding unit needs to support encoding of the transparent channel, but only a specific encoder (e.g., VP9 encoder) can support encoding of the transparent channel, so that the scheme can only adapt to the specific encoder, cannot be used for various encoders, has no universality of application, and has a narrow application range.

In the related art 2, a coding unit of a transmitting end only codes an original image for a video stream to be transmitted, which is acquired by a camera of a local end; then, the sending end transmits the encoded original image data to a decoding unit of the receiving end through a network; then, the decoding unit of the receiving end decodes the original image; finally, the decoded original image is segmented by a portrait segmentation unit at the receiving end, mask information of the portrait is generated, and the mask is utilized to process the original image so as to set the background part of the original image to be fully transparent. In addition, when the transmitting end displays the video stream acquired by the camera of the transmitting end, the image segmentation unit of the transmitting end is also required to segment the original image in the video stream to generate a mask of the image, and then the mask is used for processing the original image so as to transparence the image background in the video stream of the transmitting end displayed by the transmitting end.

In the related art 2, for the same frame of original image, both the transmitting end and the receiving end need to perform image segmentation, which may cause waste of resources.

In the related art 3, the transmitting end divides the coded original image and the mask for dividing the original image into two paths of code streams to be transmitted on a network for being transmitted to the receiving end; the receiving end respectively decodes the code stream of the coded original image and the code stream of the mask after receiving the code stream of the coded original image; then, the receiving end synchronizes the decoded original image and the mask, because the original image and the mask are in one-to-one correspondence, and the non-correspondence can lead to inaccurate segmentation; finally, the receiving end uses the mask to perform human image segmentation on the original image based on the synchronized mask and the original image so as to set the background part of the original image to be transparent.

In the related art 3, synchronization needs to be performed on the mask and the original image, otherwise, inaccurate segmentation of the receiving end is caused, and a scheme for implementing synchronization is complex.

Therefore, in the related art, in order to implement transparency of a portrait background in a video conference scene, the following technical problems mainly exist: only a specific type of encoder can be used, and the resource waste of a transmitting end and a receiving end and the synchronization requirement between the mask of the original image and the portrait thereof are required.

In order to solve the problem of effective bearing of mask information (transparent information) of a portrait in video communication, the application provides a method, which can take mask information after a frame of original image is subjected to portrait segmentation as video auxiliary enhancement information to be embedded into media supplementary enhancement information (Supplemental Enhancement Information, SEI) fields in code streams of the frame of original image after encoding, and can enable the mask information to be transmitted to a receiving end along with the code streams of the original image in one path of code streams, thereby solving the problem of effective bearing of mask information of the portrait.

The method embeds the mask information in the encoded one-path code stream of the original image of one frame, so that the mask information and the original image thereof are simply and efficiently synchronized, and the mask information is embedded into the code stream of the encoded original image, so that the realization of the method can be decoupled with an encoder and a decoder.

The receiving end of the method can extract the mask information from the one-path code stream embedded with the mask information of a received frame of original image, and the mask information is utilized to carry out the transparency processing of the background of the portrait on the decoded original image. In the method, the receiving end does not need to perform image segmentation processing on the original image so as to obtain mask information, and resources can be saved.

Fig. 1 is a schematic diagram of the system framework of the present application as exemplarily shown.

The system shown in fig. 1 may include a transmitting end 100 and a receiving end 200, and the application scenario of the system may be various video service scenarios. For example, a video conference scene, a video phone scene, an online education scene, a remote coaching scene, a low-delay live broadcast scene, a cloud game scene, a wireless screen inter-projection scene, a wireless expansion screen scene, and the like, which are not limited by the embodiments of the present application.

The transmitting end 100 and the receiving end 200, respectively, may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.

The transmitting end 100 and the receiving end 200 may have more or less modules than those shown in the drawings, may combine two or more modules, or may have different module configurations, to which the present application is not limited.

Illustratively, the sender 100 may be deployed at a first device, which may include, but is not limited to: a server and an electronic device, which may be a PC (Personal Computer ), a notebook, a tablet, a cell phone, a wearable device, etc.

Illustratively, the receiving end 200 may be deployed at a second device, which may include, but is not limited to: an electronic device, which may be a PC (Personal Computer ), notebook, tablet, cell phone, wearable device, etc.

For example, in a video conferencing scenario, the first device may be a PC or notebook or server and the second device may be a PC or notebook.

For example, in an online educational scenario, the first device may be a PC or notebook computer and the first device may be a tablet computer.

For example, in a cloud game scenario, the first device may be a server and the second device may be a tablet, or a PC, or a notebook, or a cell phone, etc.

As shown in fig. 1, the transmitting end 100 may include an embedding module 102, optionally, a preprocessing module 101, optionally, a portrait segmentation module 103, and optionally, an encoding module 104.

As shown in fig. 1, the receiving end 100 may include a positioning extraction module 201, optionally further include a post-processing module 202, optionally include a decoding module 203, and optionally include a background transparent processing module 204.

Each of the modules in the transmitting end 100 and the receiving end 200 may be a software module or a hardware module, which is not limited in the present application. It should be understood that fig. 1 is only an example of a transmitting end 100 and a receiving end 200, and that the transmitting end 100 and the receiving end 200 may have more or less modules than those shown in fig. 1 in other embodiments of the present application, and different modules may be combined or split, which is not limited in this respect.

In fig. 1, taking a frame of image as a processing object as an example, the implementation process of the method and the system of the present application is described, and in various video service scenarios, the processing object may be a multi-frame image in a video, and the principle of the implementation process is similar and will not be repeated here.

Taking a video conference scenario as an example, a specific process of implementing the method of the present application by each module in the system shown in fig. 1 is described.

The transmitting end 100 may collect the video stream and process each frame of image in the video stream according to the processing procedure of each module in the transmitting end 100 in fig. 1.

For example, the transmitting end 100 may collect the video stream through a camera of the first device, for example, the video stream is a video stream of a participant of the first device collected by a front camera of the first device.

In other application scenarios, the transmitting end 100 may collect multiple frames of images output by a video card (also called a display card) to collect the video stream.

In other application scenarios, the sending end 100 may also collect the video stream by means of screen capturing.

With continued reference to fig. 1, for each frame of original image in the video stream collected by the sender 100, the sender 100 may perform the following processing procedure:

the image segmentation module 103 may be configured to perform image segmentation on a frame of an original image by using an image segmentation technique, so as to obtain an original mask (an example of a first mask) of an image (including a face, optionally including a body part such as a human body) in the frame of the original image.

Wherein, the original image can be any one of the following images: RAW images, RGB (Red Green Blue) images, and YUV ("Y" for brightness (Luminance, luma), "U" and "V" for chromaticity, density (Chrominance, chroma)) images, as the application is not limited in this respect.

The image segmentation technique used by the image segmentation module 103 may be any technique that can segment an image region from an original image, whether existing or developed in the future, and is not limited herein.

Wherein, the image resolution of the first mask is the same as the image resolution of the original image.

In the embodiment of the application, the method aims at processing different transparencies of a portrait area and an area (called a background area) outside the portrait in an original image. Therefore, in this embodiment, the first mask may be a pixel matrix with a value of 1 for a pixel point corresponding to a portrait position in the original image and a value of 0 for a pixel point corresponding to a position other than the portrait position in the original image, and the resolution of the first mask is the same as that of the original image.

Alternatively, in another embodiment, the first mask may also be a pixel matrix with a value of 0 corresponding to a pixel point at a position other than the portrait position in the original image and a value of 1 corresponding to a pixel point at a position other than the portrait position in the original image.

The mask of the present application has the effect that the receiving end 200 can perform setting processing of different transparency for the portrait area and the background area when performing transparency processing for the original image using the recovered third mask. In other words, the mask information of the present application may be used to indicate information that processes different transparency (or other image parameters or pixel parameters, such as color, gray scale, etc.) for different areas in the original image. In other embodiments, each mask in fig. 1 may be replaced with other indication information (may also be represented as location information) capable of indicating the location information of the region of interest in the original image, so that the receiving end 200 may identify, based on the indication information carried in the SEI field in the second code stream, a pixel point located in the region of interest in the decoded original image, and may also be used to identify a pixel point of a region not of interest (for example, a region other than the region of interest), and set different parameters (for example, transparency) for the pixel point of the region of interest and the pixel point of the region not of interest.

For example, if the indication information is a mask, the pixel position with a value of 1 in the mask belongs to the region of interest, and the pixel position with a value of 0 belongs to the region of no interest.

With continued reference to fig. 1, the encoding module 104 may be configured to encode the above-mentioned frame of original image according to the H264 or H265 standard protocol, so as to obtain an encoded one-way code stream of the frame of original image, where the one-way code stream may be represented as a first code stream of H264/H265.

The H264 and H265 protocols agree that SEI fields are arranged in the code stream coded by the protocols, and the SEI fields are optional fields agreed by the H264 and H265 protocols.

Of course, in other embodiments, some other known or future developed codec protocols that can support SEI fields may also be applied to the encoding module 104 or the decoding module 203 to implement the method of the present application.

In other words, the codec protocol used by the encoding module 104 and the decoding module 203 in the present application is not limited to the standard protocol of H264 or H265, but may be other codec protocols capable of supporting the SEI field.

In order to carry mask information in the encoded code stream, the present application may encode the original image by using the H264 or H265 standard protocol through the encoding module 104, so that an SEI field can be newly added in the encoded code stream, and the mask information is embedded in the newly added SEI field.

The preprocessing module 101 may be configured to perform a preprocessing operation on the first mask of the portrait to obtain a second mask.

The preprocessing operations may include, but are not limited to: at least one processing operation such as downsampling and compression is performed to reduce the data size of mask information carried in the code stream sent from the sender 100 to the receiver 200, and reduce excessive occupation of network bandwidth due to transmission of mask information.

For example, the preprocessing module 101 may perform downsampling processing on an original mask of a portrait first, and then perform compression processing on the mask after downsampling processing, so as to obtain a second mask of the portrait in the original image with smaller data size.

In some embodiments, to ensure that the second mask after the sender 100 is preprocessed by the preprocessing module 101 is restored to the third mask by the receiver 200 through the post-processing module 202, the third mask can be the same as or less different from the original first mask, and the compression processing manner adopted by the preprocessing module 101 may be lossless compression, so that in some cases, the original first mask shown in fig. 1 can be the same as the third mask restored on the receiver 200 side.

In other embodiments, when there is a certain difference between the original first mask and the recovered third mask due to the preprocessing process of the preprocessing module 101 and the processing process of the post-processing module 202, the edge area of the portrait in the portrait image obtained after the background transparent processing by the receiving end 200 may not be smooth enough.

When in specific implementation, specific implementation algorithms of downsampling, compression processing, upsampling and decompression processing can be reasonably adopted according to the display quality requirements of application scenes and images so as to flexibly meet different application requirements.

With continued reference to fig. 1, the embedding module 102 may be configured to embed an SEI field in a first bitstream of the H264/H265 encoded by the original image, and embed a preprocessed second mask in the SEI field, to obtain a second bitstream.

The object processed by the embedding module 102 of the present application is the first code stream after encoding the original image, in other words, the processing object is the encoding result of the original image, without performing an encoding operation on the processing object. In this way, the operation of embedding the mask information into the code stream of the original image can be decoupled from the encoder (for example, the encoding module 104 described above), any adjustment is not required to be made on the encoder, and any encoder existing or developed in the future can be used, so that the application universality of the scheme is higher.

The processing of the embedding module 102 is described in detail below in connection with the data structures of the first and second code streams shown in fig. 2.

As shown in fig. 2 (1), a first code stream after encoding an original image with H264 or H265 may include a plurality of network abstraction layer (Network Abstraction Layer, NAL) units, where n NAL units are made, where n NAL units are NAL units 1 to n NAL units in sequence from beginning to end, where NAL unit 1 is a first NAL unit in the first code stream, and NAL unit n is a last NAL unit in the first code stream, where n is greater than 1, and is not particularly limited.

With continued reference to fig. 2 (1), the data structures of the different NAL units are identical, and for purposes of illustration of NAL unit 1, NAL unit 1 may include a NAL header and a NAL content (also referred to as NAL body content or NAL body).

The NAL header may carry various header information including the NAL type shown in fig. 2 (1), among others.

The NAL type may include an SEI type.

NAL units of SEI type may or may not already be included in the encoded first bitstream.

It should be appreciated that NAL units of SEI type are supported in the encoded bitstream using the H264 or H265 standard.

In the embodiment of the present application, the embedding module 102 may be configured to add a new NAL unit to the first code stream, where the location of the new NAL unit in the first code stream is not limited.

Wherein, in order for the location extraction module 201 of the receiving end 200 to be able to quickly locate the new NAL unit, the embedding module 102 may be configured to embed the new NAL unit of the present application before the first NAL unit (here NAL unit 1) in the first code stream, or to embed the new NAL unit of the present application at the end position of the first code stream (here after NAL unit n).

As shown in fig. 2 (2), embedding module 102 embeds a new NAL unit before NAL unit 1 in the first code stream to obtain a second code stream, wherein the new NAL unit is shown as NAL unit 0 in the second code stream.

The following describes NAL unit 0 embedded in the first code stream by the embedding module 102 in detail:

referring to fig. 2 (2), NAL unit 0 also includes a NAL header and NAL content as shown in fig. 2 (2).

In this NAL header of NAL unit 0, as shown in fig. 2 (2), a field regarding the NAL type is valued as the SEI type.

For example, in the H264 protocol, the SEI type is denoted by 0000000106, that is, the NAL header of NAL unit 0 may carry 0000000106 to denote that the NAL type of NAL unit 0 is the SEI type. Thus, the transmitting end 100 adds the SEI field to the encoded bitstream of the original image.

Whereas the NAL content of NAL unit 0 may include SEI (also referred to as SEI field).

For example, as shown in fig. 2 (2), the data structure of the SEI field may include a payload type (payload type), a payload size (payload size), and a payload content (payload). In other embodiments, the data structure of the SEI field may be other structures, without limitation.

In some embodiments, some load types of the SEI fields are defined in the H264, H265 standard protocol, such as load type a, load type B, load type C, etc., and the meaning of the different load types expressions is different.

In some embodiments, as described above, the first code stream encoded by the H264 or H265 protocol may carry NAL units of SEI type, that is, NAL units 1 to n shown in fig. 2 may have NAL units of SEI type, and by using the above 0000000106 of the NAL header alone, NAL unit 0 newly added in the present application cannot be uniquely located.

Whereas the SEI field within NAL unit 0 embedded by embedding module 102 in the first bitstream is mainly used to carry mask information in the present application. To facilitate the fast localization of the receiving end to NAL unit 0 carrying mask information, embedding module 102 may customize a payload type (simply referred to as a custom type) to distinguish from the respective payload types defined in the standard protocol. For example, the SEI field with the payload type being the custom type is used to indicate that the payload carried is mask information.

In this way, the receiving end 200 may find NAL unit 0 carrying mask information according to the present application from a plurality of NAL units in combination with an identifier of the SEI field (for example, 0000000106 agreed by the H264 protocol) and a payload type being a custom type.

Specifically, as shown in fig. 2 (2), the embedding module 102 embeds an NLA unit 0 before the first NAL unit of the first code stream, where the NAL header of the NAL unit 0 may include information for indicating that the NAL type is an SEI type, and the NAL content of the NAL unit may carry at least three kinds of information, that is, a payload type, a payload size, and a payload content, where the payload type is a custom type, the payload size is the size of the second mask, and the payload content is the second mask.

Thus, the embedding module 102 embeds the SEI field in the first code stream of the H264, H265 protocol, and carries the mask information of the preprocessed portrait in the SEI field.

The embedding module 102 of the present application may be configured to add an identifier of an SEI field (for example, 0000000106 agreed by H264 protocol) to a first code stream after encoding an original image of a frame, and embed second mask information of a portrait corresponding to the original image of the frame processed by the preprocessing module 101 as SEI content in the SEI field, and optionally set a load type of the SEI field to a custom type, so as to obtain a second code stream with the second mask embedded in the SEI field.

With continued reference to fig. 1, the transmitting end 100 may transmit the second code stream obtained by the embedding module 102 and having the second mask embedded in the SEI field to the receiving end 200 through a network.

In this way, for each frame of image in the video stream collected by the transmitting end 100, mask information of a portrait in the frame of image can be carried in the coded code stream of each frame of image through an SEI field, so as to transmit the mask information to the receiving end 200 through a network. In this way, the transmitting end 100 may transmit the encoded code stream of the video stream to the receiving end 200, and the code stream transmitted by the transmitting end 100 to the receiving end 200 may include the second code stream related to each frame of the original image and the mask information thereof in the video stream. Thus, the coded code stream of a frame of original image and mask information of its portrait can be transmitted to the receiving end 200 in one code stream without synchronization.

The sending end 100 of the present application can pre-process the mask of the portrait, then add the pre-processed mask of the portrait to the SEI in the code stream of each frame of the original image, bind the encoded data of the original image and the portrait mask of the original image together, send the encoded data and the portrait mask of the original image to the receiving end 200 in one code stream, and solve some column problems caused by frame loss and do not need to be synchronized, compared with a scheme of separate sending. In addition, the mask carried in the SEI in the code stream is the mask subjected to preprocessing, the data volume is smaller, the problem that the mask is carried to occupy too much network bandwidth can be solved, the influence on the network bandwidth is small, and the sending efficiency of the mask is improved.

The following describes, with reference to each module in the receiving end 200 in fig. 1, the processing procedure of the second code stream related to a frame of original image and mask information in the code stream of the receiving end 200 after encoding the video stream from the transmitting end 100:

as shown in fig. 1, the location extraction module 201 in the receiving end 200 may be configured to locate the SEI field in the second code stream and extract mask information (here, the second mask) embedded in the SEI field.

As will be understood with continued reference to fig. 2, as shown in fig. 2 (2), the positioning extraction module 201 may be configured to search, among a plurality of NAL units (n+1 NAL units here) in the second code stream related to the original image of one frame, for NAL units of a NAL type of SEI type, for example, specifically, for NAL units including an identifier of a SEI field (for example, 0000000106 by the H264 protocol), so as to position the SEI field in the second code stream of the original image of each frame.

In some embodiments, as shown in fig. 2, the SEI field carrying the mask information is inserted before the NAL unit 1 or after the last NAL unit n in the first bitstream, so that the NAL unit carrying the mask information inserted at the start position or the end position of the first bitstream is more easily located by the location extraction module 201, thereby improving the location efficiency of the mask information.

As described above, NAL units with NAL types of SEI types may also be included in the first bitstream before no mask information is added, and then the location extraction module 201 searches for NAL units only by the identifier of the SEI field, possibly including not only NAL units carrying mask information (for example, NAL unit 0 shown in fig. 2 (2), but also NAL units carrying no mask information that are originally included in the first bitstream.

Then, in some embodiments, the positioning extraction module 201 locates after NAL units of SEI type from the second bitstream. In order to improve the positioning accuracy of the mask information in the second code stream, the positioning extraction module 201 may read, for each NAL unit with the searched NAL type being the SEI type, a field of the payload type in the NAL content.

When the positioning extraction module 201 determines that the load type is a custom type, it indicates that the corresponding NAL unit is an NAL unit carrying mask information through an SEI field, and is an NAL unit required to perform load extraction; when the payload type is not a custom type, it is indicated that the corresponding NAL unit is not the NAL unit that needs to be located to extract the payload in the method of the present application.

In this way, the location extraction module 201 of the present application can accurately locate the NAL unit carrying mask information in the second code stream related to the received original image of a frame by using the identifier of the SEI field and the load type of the SEI field as the custom type, which is NAL unit 0 shown in fig. 2.

Further, after the locating extraction module 201 locates the SEI field (for example, NAL unit 0 shown in fig. 2) carrying the mask information in the second bitstream by using the identifier of the SEI field and the payload type of the SEI field as a custom type, the locating extraction module 201 may be configured to read a payload from the NAL unit 0 to extract the second mask.

For example, in the scenario of H264, the location extraction module 201 may read all data in the second bitstream after searching for a NAL unit with identifier 0000000106 (0000000106) until the NAL header of the next NAL unit. Then for one NAL unit, the data read by the location extraction module 201 may include the payload type, payload size, and payload content as shown in fig. 2. Then, the location extraction module 201 may extract payload for NAL units with a payload type of a custom type to obtain the second mask.

Alternatively, the positioning extraction module 201 may delete the NAL unit extracted to the second mask from the second code stream, and in the example of fig. 2, after the second code stream is input to the positioning extraction module 201, the positioning extraction module 201 may delete the NAL unit 0 extracted to the mask information to obtain the first code stream shown in fig. 2 (1).

With continued reference to FIG. 1, the post-processing module 202 may be configured to perform post-processing operations on the extracted second mask.

The post-processing operation corresponds to the preprocessing operation performed by the preprocessing module 101 described above.

For example, in this embodiment, the post-processing module 202 may perform decompression processing on the second mask, and then perform upsampling processing on the decompressed mask to obtain a third mask of the portrait in the original image, where the resolution of the image of the third mask obtained after the post-processing may be matched with the resolution of the original image, so as to facilitate portrait segmentation and background processing.

In addition, the present application is not limited to specific implementation manners of compressing, decompressing, downsampling, upsampling the mask, and any implementation manner in the prior art may be adopted, or any implementation manner developed in the future may be adopted.

With continued reference to fig. 1, the decoding module 203 may be configured to decode the first code stream shown in fig. 2 (1), which is also referred to as the first code stream of H264/H265, to obtain a frame of original image that is the same as the original image in the transmitting end 100.

In the embodiment of the present application, the processing object of the decoding module 203 is an encoded code stream of the original image, and the mask information in the code stream has been extracted, so that the decoding operation of the decoding module 203 does not need to be improved in cooperation with the mask embedding and mask extraction of the present application, so that the method, the transmitting end 100 and the receiving end 200 of the present application can adapt to any encoder and decoder supporting the H264 and H265 standards in the prior art or developed in the future, and can be completely decoupled from the encoder and the decoder.

Optionally, with continued reference to fig. 1, the background transparency processing module 204 in the received 200 may process the decoded original image based on the third mask to transparence the background area except the portrait in the original image.

For example, the background transparent processing module 204 may identify, based on the third mask, a pixel location of the portrait region of the original image (for example, a pixel location with a value of 1 in the third mask), and a pixel location of the background region in the original image (for example, a pixel location with a value of 0 in the third mask). Then, the background transparency processing module 204 may set the transparency of the pixel position corresponding to the third mask to be 0 in the original image to be a percentage transparent, and set the transparency of the pixel position corresponding to the third mask to be opaque (e.g., the transparency is 0%) in the original image.

In other embodiments, the set image parameters are not limited to transparency, but may be colors, etc., and are not limited herein, and may be flexibly set according to the application scene.

In the embodiment of the present application, the sending end 100 may perform image segmentation on the original image to obtain the mask, but the receiving end 200 only needs to extract the code stream carrying the mask, and uses the extracted mask to perform operations such as background processing, without performing image segmentation, and compared with the scheme of performing image segmentation on both the sending end and the receiving end in the related art 2, the present application can save resources.

The method of the present application is described below in conjunction with the specific examples described above in conjunction with fig. 1 and 2.

Example 1

In this example 1, the system may include clients 1 and 2 that perform video interactions, and a server that performs video data forwarding.

Wherein client 1 may be one example of sender 100 shown in fig. 1 and client 2 may be one example of receiver 200 shown in fig. 1.

The server may be a server.

The server may be a server in a cloud management platform.

The cloud management platform is used for managing an infrastructure for providing cloud services, the infrastructure comprises a plurality of cloud data centers arranged in a plurality of areas, and at least one cloud data center is arranged in each area. The cloud data center may include a compute node, which may include the sender and/or receiver, which may be virtual instances (e.g., virtual machines or containers) on the compute node.

As shown in fig. 3a, the process may include the steps of:

s101, the client side 1 performs image segmentation on an original image in the collected video stream to generate mask1 information of the image.

For example, the application scene is a video conference scene, and the video stream is a video stream collected by a front camera of the electronic device to which the client 1 belongs.

Optionally, when the client 1 side displays the video frame of the video stream, the mask1 after the portrait segmentation may be utilized to process the original image in the video stream, so as to obtain a video stream with transparent background outside the portrait for rendering and displaying.

S102, the client 1 performs preprocessing such as downsampling and compression on mask1 information in sequence.

For example, the client 1 downsamples the mask1 of the original 360p to 90p, and applies zlib (zlib is a function library for providing data compression) to compress the downsampled mask, so as to further reduce the mask data volume and reduce the bandwidth occupation.

S103, the client side 1 embeds the preprocessed mask2 information into the coded code stream of the original image.

Where the encoded original picture is an H264 or H265 bitstream, here illustrated as H264 (H265 is similar), the H264 bitstream is actually composed of a syntax structure of a series of NAL units, where the SEI field has an identifier of 000000 0106.

Client 1 may add an SEI field at a specific position (e.g., a start position or an end position, etc.) in the H264 bitstream of each frame of the original image, and the load type of the added SEI field may be set to a custom type (e.g., 200), and the load content of the SEI is set to mask2 after preprocessing.

S104, the client 1 sends the code stream carrying the mask2 information to the server, and the server thoroughly transmits the code stream to the client 2.

S201, the client 2 extracts mask2 information from the code stream received from the server.

For the code stream of each frame of original image of the portrait mask to be extracted, which is received by the client 2, the client 2 may search identifiers of SEI fields in NAL in the code stream of each frame of original image (for example, H264 is 0000000106) so as to locate positions of the SEI fields in the code stream related to each frame of original image, and the SEI fields also need to satisfy a condition that a load type of the SEI fields is a custom type (for example, 200), so that after locating a required SEI field, the SEI fields may be extracted, for example, after identifier 0000000106 in a scene of H264, until all data before a NAL header of a next NAL, so as to obtain mask2 information.

S202, the client side 2 decompresses and upsamples the mask2 to obtain mask1 information.

The client 2 may decompress (available zlib) the extracted mask2, and upsample the decompressed mask (e.g., upsample from 90p to 360 p) to obtain mask1 information.

And S203, the client side 2 deletes mask2 information in the code stream to obtain the code stream after the original image is encoded.

For example, the client 2 may delete the NAL unit located to the mask2 information from the code stream received from the server, so as to obtain the code stream encoded by the original image.

S204, the client side 2 decodes the code stream after the original image encoding to obtain the original image.

S205, the client 2 performs transparent processing on the background area other than the portrait in the original image based on the mask1 information.

Optionally, the client 2 may send the original image after the background is transparent to the renderer for rendering, so that the electronic device to which the client 2 belongs displays only the portrait on the video screen from the client 1, and the background is transparent.

In the embodiment of the present application, the client 1 may embed the mask obtained by dividing the image of the original image as video auxiliary enhancement information in the SEI field in the code stream of each frame of the original image, so that the mask may be sent to the client 2 along with the data of each frame of the original image. The client 2 can extract the mask according to the received code stream, so that the extracted mask is utilized to carry out transparent processing on the background part of the decoded original image, the immersion sense in video communication is enhanced, the method has wide applicability, does not depend on a coder and a decoder, does not need to be used for synchronizing the code stream of the original image and the portrait mask of the original image, and has lower implementation complexity, simplicity, high efficiency and flexibility.

In the above example 1, the client 1 may be used as the transmitting end 100 shown in fig. 1, the client 2 may be used as the receiving end 200 shown in fig. 1, and the implementation details and principles of the relevant steps in example 1 are the same as those of the corresponding modules in the transmitting end 100 and the receiving end 200 in fig. 1, which are not described herein again, and reference is made to the description of fig. 1.

In this example 1, the transmitting end 100 may perform portrait segmentation on the original image in the video stream, but the receiving end 200 may directly use the mask after portrait segmentation of the transmitting end 100 without performing repeated portrait segmentation, so that the waste of resources may be reduced.

In another example, some or all of S201 to S205 shown in fig. 3a may be performed by the server, and then the server may not transmit the code stream carrying the mask2 information to the client 2, but send the data after processing the code stream carrying the mask2 information (for example, the code stream after processing the background area transparent to the portrait) to the client 2, so when the performance of the client 2 is poor or the configuration is low, the processing such as mask extraction and background transparency may be performed by the server to reduce the performance influence on the client 2.

In this example, then, client 1 may be referred to as sender 100 shown in fig. 1 and the server may be referred to as receiver 200 shown in fig. 1.

Example 2

In this example 2, the system may include clients 1 and 2 that perform video interactions, and a server that performs video data forwarding.

For the description of the client 1, the client 2, and the server, reference may be made to example 1, and details thereof are not repeated here.

As shown in fig. 3b, the process may include the steps of:

s301, the client 1 collects a video stream.

For example, the video stream is a video captured by a front-facing camera of the client 1, or a video captured in another manner, which is not limited herein.

S302, the client 1 sends the video stream to the server.

Optionally, when the client 1 sends the video stream to the server, encoding processing may be performed to reduce the data transmission rate and reduce the bandwidth occupation. The specific encoding method of the encoding process is not limited, and may be implemented by any encoding method in the prior art or developed in the future, which is not limited herein.

S101, the service end performs image segmentation on an original image in the acquired video stream to generate mask1 information of the image.

Wherein the server may receive the video stream from the client 1 as the video stream collected here.

Alternatively, the server may decode the encoded video stream from the client 1 to obtain the video stream collected here (e.g., the same video stream as collected by the camera of the client 1).

The server may perform image segmentation to identify the image region for each frame of image (the original image described herein) in the acquired video stream, so as to generate mask1 information with resolution matching the original image.

S102, the service end sequentially performs preprocessing such as downsampling, compression and the like on mask1 information.

S103, the service end embeds the preprocessed mask2 information into the coded code stream of the original image.

Among them, S101 to S103 in example 2 are the same as the implementation details and principles of S101 to S103 in example 1, except that S101 to S103 are executed by the server side in example 2 only.

In consideration of the performance of the device that would be consumed by the portrait segmentation, in the case where the device performance of the client 1 is poor or the configuration is low, in this example 2, the processes of portrait segmentation and mask preprocessing and mask embedding of S101 may be performed by the server side to reduce the influence of portrait segmentation on the client side.

After S103, the server may execute S104a, optionally S104b, and the present application does not limit the execution sequence of S104a and S104b, and may be executed concurrently.

S104a, the server side sends a code stream carrying mask2 information to the client side 2.

After S104a, the client 2 may execute S201 to S205, and details of steps and implementation of S201 to S205 may refer to S201 to S205 in example 1, and in both examples, S201 to S205 executed by the client 2 are the same, which will not be described herein.

After S103, the server may optionally further perform S104b.

And S104b, the server side transmits a code stream carrying mask2 information to the client side 1.

In this embodiment, the performance and configuration of the client 1 are low, so as to avoid the influence on the performance of the client 1 caused by the client 1 performing portrait segmentation on the original image in the video stream in S301. Even if the video stream in S301 is a local video stream collected by the client 1, the client 1 may perform the portrait segmentation without performing the portrait segmentation processing, but may have a server to perform the portrait segmentation, and send a code stream carrying the information of mask2 to the client 1. In this way, when the client 1 displays the video stream collected by the local end, the image segmentation processing is not required, and only the code stream added with the mask information returned by the server end is required to be subjected to mask extraction, and the extracted mask is utilized to perform background transparent processing on the original image in the video stream collected by the local end.

In the above example 2, the server may be the transmitting end 100 shown in fig. 1, the client 2 may be the receiving end 200 shown in fig. 1, and the implementation details and principles of the relevant steps in example 2 are the same as those of the corresponding modules in the transmitting end 100 and the receiving end 200 in fig. 1, which are not repeated herein, and reference may be made to the description of fig. 1.

An apparatus provided by an embodiment of the present application is described below. As shown in fig. 4:

fig. 4 is a schematic structural diagram of a processing apparatus according to an embodiment of the present application. As shown in fig. 4, the apparatus 500 may include: processor 501, transceiver 505, and optionally memory 502.

The transceiver 505 may be referred to as a transceiver unit, a transceiver circuit, etc. for implementing a transceiver function. The transceiver 505 may include a receiver, which may be referred to as a receiver or a receiving circuit, etc., for implementing a receiving function, and a transmitter; the transmitter may be referred to as a transmitter or a transmitting circuit, etc., for implementing a transmitting function.

The memory 502 may store a computer program or software code or instructions 504, which computer program or software code or instructions 504 may also be referred to as firmware. The processor 501 may implement the processing method provided by the embodiments of the present application by running a computer program or software code or instructions 503 therein, or by calling a computer program or software code or instructions 504 stored in the memory 502. The processor 501 may be a central processing unit (central processing unit, CPU), and the memory 502 may be, for example, a read-only memory (ROM), or a random access memory (random access memory, RAM).

The processor 501 and transceiver 505 described in the present application may be implemented on an integrated circuit (integrated circuit, IC), analog IC, radio frequency integrated circuit RFIC, mixed signal IC, application specific integrated circuit (application specific integrated circuit, ASIC), printed circuit board (printed circuit board, PCB), electronic device, etc.

The apparatus 500 may further include an antenna 506, and the modules included in the apparatus 500 are merely illustrative, and the present application is not limited thereto.

The structure of the processing device may be, for example, not limited by fig. 4. The processing means may be a stand-alone device or may be part of a larger device. For example, the processing device may be implemented in the form of:

(1) A stand-alone integrated circuit IC, or chip, or a system-on-a-chip or subsystem; (2) A set of one or more ICs, optionally including storage means for storing data, instructions; (3) modules that may be embedded within other devices; (4) an in-vehicle apparatus, etc.; (5) others, and so forth.

For the case where the processing means is implemented in the form of a chip or chip system, reference is made to the schematic diagram of the chip shown in fig. 5. The chip shown in fig. 5 includes a processor 601 and an interface 602. Wherein the number of processors 601 may be one or more, and the number of interfaces 602 may be a plurality. Alternatively, the chip or system of chips may include a memory 603.

All relevant contents of each step related to the above method embodiment may be cited to the functional description of the corresponding functional module, which is not described herein.

Based on the same technical idea, the embodiments of the present application also provide a computer-readable storage medium storing a computer program, the computer program containing at least one piece of code executable by a computer to control the computer to implement the above-mentioned method embodiments.

Based on the same technical idea, the embodiments of the present application also provide a computer program for implementing the above-mentioned method embodiments when the computer program is executed.

The program may be stored in whole or in part on a storage medium that is packaged with the processor, or in part or in whole on a memory that is not packaged with the processor.

Based on the same technical conception, the embodiment of the application also provides a chip comprising a processor. The processor may implement the method embodiments described above.

The steps of a method or algorithm described in connection with the present disclosure may be embodied in hardware, or may be embodied in software instructions executed by a processor. The software instructions may be comprised of corresponding software modules that may be stored in random access Memory (Random Access Memory, RAM), flash Memory, read Only Memory (ROM), erasable programmable Read Only Memory (Erasable Programmable ROM), electrically Erasable Programmable Read Only Memory (EEPROM), registers, hard disk, a removable disk, a compact disc Read Only Memory (CD-ROM), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.

Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims

1. A method of processing, the method comprising:

Acquiring a first code stream, wherein the first code stream is a code stream of a target image after coding;

adding media supplementary enhancement information SEI of the target image in the first code stream to obtain a second code stream;

wherein the SEI includes location information for indicating a location of a region of interest in the target image;

and sending the second code stream.

2. The method according to claim 1, wherein the method further comprises:

acquiring the position information;

performing a first operation on the position information, wherein the data size of the position information after the first operation is smaller than the data size of the position information before the first operation;

the SEI includes position information after the first operation.

3. The method according to claim 1 or 2, characterized in that the SEI further comprises information indicating that the load type is a custom type.

4. A method according to any one of claims 1 to 3, characterized in that said increasing the SEI of the target image in the first bitstream comprises:

and adding a network abstraction layer unit into the first code stream, wherein the type of the network abstraction layer unit is SEI type, and the main content of the network abstraction layer unit is SEI.

5. The method according to any one of claims 1 to 4, further comprising:

the SEI is added at a start position or an end position of the first code stream.

6. The method according to any one of claims 1 to 5, wherein the first code stream is a code stream obtained by encoding the target image using an H264 or H265 standard protocol.

7. The method according to any one of claims 1 to 6, wherein the position information is mask information indicating a portrait area in the target image.

8. The method according to any one of claims 1 to 7, wherein the method is applied to a cloud data center, and a cloud management platform is used for managing an infrastructure for providing cloud services, the infrastructure comprising a plurality of cloud data centers arranged in a plurality of areas, each area being provided with at least one cloud data center.

9. A method of processing, the method comprising:

receiving a second code stream, wherein the second code stream is a code stream of a target image after being coded, and the second code stream carries media supplementary enhancement information SEI of the target image;

Determining the SEI in the second bitstream;

position information is extracted in the SEI, the position information being used to indicate the position of a region of interest in the target image.

10. The method of claim 9, wherein after extracting the position information in the SEI, the method further comprises:

and performing a second operation on the position information, wherein the data volume of the position information after the second operation is larger than the data volume of the position information in the SEI.

11. The method according to claim 9 or 10, characterized in that said determining the SEI in the second bitstream comprises:

and determining the load type in the second code stream as SEI of a custom type.

12. The method according to any one of claims 9 to 11, characterized in that said determining the SEI in the second bitstream comprises:

and determining a network abstraction layer unit with a type of SEI type in the second code stream.

13. The method according to claim 12, characterized in that said extracting the position information in the SEI comprises:

extracting location information from the body content of the network abstraction layer unit.

14. The method according to any one of claims 9 to 13, characterized in that the SEI is located at a start position or an end position of the first bitstream.

15. The method according to any one of claims 9 to 14, further comprising:

acquiring the target image obtained by decoding the second code stream;

determining a target position of the region of interest in the target image based on the position information;

and processing the region of interest and the region outside the region of interest in the target image in different ways based on the target position.

16. The method of claim 15, wherein the obtaining the target image resulting from decoding the second code stream comprises:

deleting the SEI in the second code stream;

and decoding the second code stream after deleting the SEI by adopting an H264 or H265 standard protocol to acquire the target image.

17. The method according to any one of claims 9 to 16, wherein the position information is mask information indicating a portrait area in the target image.

18. The method according to any one of claims 9 to 17, wherein the method is applied to a cloud data center, and wherein a cloud management platform is used for managing an infrastructure for providing cloud services, the infrastructure comprising a plurality of cloud data centers arranged in a plurality of areas, each area being provided with at least one cloud data center.

19. A processing apparatus, the processing apparatus comprising:

the first acquisition module is used for acquiring a first code stream, wherein the first code stream is a code stream obtained by encoding a target image;

the processing module is used for adding media supplementary enhancement information SEI of the target image in the first code stream to obtain a second code stream;

and the sending module is used for sending the second code stream.

20. The apparatus of claim 19, wherein a cloud data center comprises the processing means, and wherein a cloud management platform is configured to manage an infrastructure for providing cloud services, the infrastructure comprising a plurality of cloud data centers disposed in a plurality of areas, each area being provided with at least one cloud data center.

21. A processing apparatus, the processing apparatus comprising:

the receiving module is used for receiving a second code stream, wherein the second code stream is the code stream of the target image after being coded, and the second code stream carries media supplementary enhancement information SEI of the target image;

a positioning module for determining the SEI in the second bitstream;

An extraction module for extracting location information in the SEI, the location information being used to indicate a location of a region of interest in the target image.

22. The apparatus of claim 21, wherein a cloud data center comprises the processing means, and wherein a cloud management platform is configured to manage an infrastructure for providing cloud services, the infrastructure comprising a plurality of cloud data centers disposed in a plurality of areas, each area being provided with at least one cloud data center.

23. A computer readable storage medium comprising a computer program which, when run on a computer or processor, causes the computer or processor to perform the method of any one of claims 1 to 8 or to perform the method of any one of claims 9 to 18.

24. A processing device comprising one or more interface circuits and one or more processors; the interface circuit is configured to receive a signal from the memory and to send the signal to the processor, the signal comprising computer instructions stored in the memory; the processor, when executing the computer instructions, is adapted to perform the method of any one of claims 1 to 8 or to perform the method of any one of claims 9 to 18.

25. A computer program product comprising a software program which, when executed by a computer or processor, causes the method of any one of claims 1 to 8, or the steps of the method of any one of claims 9 to 18, to be performed.