CN115665361A

CN115665361A - Video fusion method in virtual environment and online video conference communication method

Info

Publication number: CN115665361A
Application number: CN202211106196.3A
Authority: CN
Inventors: 李彬哲; 王钊; 叶琰; 王诗淇
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-09-09
Filing date: 2022-09-09
Publication date: 2023-01-31

Abstract

The application provides a video fusion method and an online video conference communication method in a virtual environment, which relate to the technical field of video processing, and the method comprises the following steps: acquiring a virtual environment image and a plurality of videos, wherein the plurality of videos comprise a plurality of target objects; aiming at any target object in a plurality of target objects, determining target mask information of the target object in a video, position information of the target object in a virtual environment image and depth mask information of the target object in the virtual environment image, wherein the depth mask information represents an occlusion relation between the target object and the virtual environment image; and fusing the plurality of videos and the virtual environment image based on the target mask information, the position information and the depth mask information to obtain a fused video containing a plurality of target objects. In the embodiment, the video fusion is performed based on the depth mask information, so that the shielding relation between the target object and the virtual environment information can be accurately expressed, the obtained fusion video is more real, and the presentation effect is better.

Description

Video fusion method in virtual environment and online video conference communication method

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a video fusion method and an online video conference communication method in a virtual environment.

Background

Many people are affected by an epidemic and choose to work or learn at home. With the change of working environment, the communication mode of people is shifted from off-line to on-line. This fact has led to a significant increase in the demand for online conferencing, but at the same time, online conferencing affects people's communication efficiency and immersion. In a live meeting, people gather together in the same environment and are clear at a glance. For video conferencing, each person can only watch one person at a time in a different place. This results in a lack of sense of agglomeration and ambience for online meetings, reducing the communication efficiency of people.

With the popularization of network conferences, image fusion based on a virtual background is widely applied. The videos of multiple persons in the meeting are fused into the same virtual scene, the effect that the multiple persons meet in the same virtual scene is presented, and the atmosphere of the meeting can be enhanced. However, in the current video fusion method, the portrait part of the input video is directly extracted and put in the same scene, so that the adaptability to different input videos is lacked, and the visual effect presented after the video fusion is poor.

Disclosure of Invention

The embodiment of the application provides a video fusion method and an online video conference communication method in a virtual environment, so as to improve the presentation effect after video fusion.

In a first aspect, an embodiment of the present application provides a method for video fusion in a virtual environment, including:

acquiring a virtual environment image and a plurality of videos, wherein the plurality of videos comprise a plurality of target objects;

aiming at any target object in a plurality of target objects, determining target mask information of the target object in a video, position information of the target object in a virtual environment image and depth mask information of the target object in the virtual environment image, wherein the depth mask information represents an occlusion relation between the target object and the virtual environment image;

and fusing the plurality of videos and the virtual environment image based on the target mask information, the position information and the depth mask information to obtain a fused video containing a plurality of target objects.

In a second aspect, an embodiment of the present application provides a video fusion apparatus in a virtual environment, including:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a virtual environment image and a plurality of videos, and the videos comprise a plurality of target objects;

the determining module is used for determining target mask information of a target object in a video to which the target object belongs, position information of the target object in a virtual environment image and depth mask information of the target object in the virtual environment image aiming at any target object in a plurality of target objects, wherein the depth mask information represents an occlusion relation between the target object and the virtual environment image;

and the fusion module is used for fusing the plurality of videos and the virtual environment image based on the target mask information, the position information and the depth mask information to obtain a fused video containing a plurality of target objects.

In a third aspect, an embodiment of the present application provides an online video conference communication method, including:

receiving a first video stream transmitted by first user equipment, and acquiring a first target object from the first video stream;

receiving a virtual environment image;

receiving a second video stream transmitted by second user equipment, and acquiring a second target object from the second video stream;

aiming at any one of a first target object and a second target object, determining target mask information of the any target object in a video stream, position information of the any target object in a virtual environment image and depth mask information of the any target object in the virtual environment image, wherein the depth mask information represents an occlusion relation between the any target object and the virtual environment image;

and fusing the first video stream, the second video stream and the virtual environment image based on the target mask information, the position information and the depth mask information to obtain a fused video stream containing the first target object and the second target object.

In a fourth aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory, where the processor implements the method of any one of the above when executing the computer program.

In a fifth aspect, the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the method provided in any embodiment of the present application.

Compared with the prior art, the method has the following advantages:

the embodiment of the application provides a video fusion method and an online video conference communication method in a virtual environment, which are used for acquiring a virtual environment image and a plurality of videos, wherein the plurality of videos comprise a plurality of target objects; aiming at any target object in a plurality of target objects, determining target mask information of the target object in a video, position information of the target object in a virtual environment image and depth mask information of the target object in the virtual environment image, wherein the depth mask information represents an occlusion relation between the target object and the virtual environment image; and fusing the plurality of videos and the virtual environment image based on the target mask information, the position information and the depth mask information to obtain a fused video containing a plurality of target objects. In the embodiment, video fusion is performed based on the depth mask information, so that the shielding relation between the target object and the virtual environment information can be accurately expressed, the obtained fusion video is more real, and the presentation effect is better.

The foregoing description is only an overview of the technical solutions of the present application, and the following detailed description of the present application is given to enable the technical means of the present application to be more clearly understood and to enable the above and other objects, features, and advantages of the present application to be more clearly understood.

Drawings

In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the application and are not to be considered limiting of its scope.

Fig. 1 is a scene schematic diagram of a video fusion method in a virtual environment provided in the present application;

FIG. 2 is a flowchart of a video fusion method in a virtual environment according to an embodiment of the present application;

fig. 3 is a schematic diagram of a video fusion method in the related art;

fig. 4 is a schematic diagram of a video fusion method in a virtual environment according to an embodiment of the present application;

fig. 5 is a flowchart of an online videoconference communication method according to an embodiment of the present application;

fig. 6 is a block diagram illustrating a video fusion apparatus in a virtual environment according to an embodiment of the present disclosure;

fig. 7 is a block diagram of an online video conference communication apparatus according to an embodiment of the present application; and

FIG. 8 is a block diagram of an electronic device used to implement embodiments of the present application.

Detailed Description

In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

To facilitate understanding of the technical solutions of the embodiments of the present application, the following describes related arts of the embodiments of the present application. The following related arts as alternatives can be arbitrarily combined with the technical solutions of the embodiments of the present application, and all of them belong to the scope of the embodiments of the present application.

Fig. 1 is a schematic diagram of an exemplary application scenario for implementing the method of the embodiment of the present application. As shown in fig. 1, a plurality of videos including a target object (a video of the target object 1, a video of the target object 2, and a video of the target object 3) and a virtual scene image are fused to obtain a fused video. The target object includes, but is not limited to, a human, an animal, and any object. The virtual scene image may also be a video of the virtual scene. The embodiment of the application can be applied to a video conference, and a server or user equipment acquires a plurality of videos containing human images and virtual environment images; the method comprises the steps of determining target mask information of a portrait in a video, position information of the portrait in a virtual environment image and depth mask information of the portrait in the virtual environment image, fusing a plurality of videos containing the portrait and the virtual environment image based on the target mask information, the position information and the depth mask information to obtain a fused video, presenting pictures of a plurality of characters in the same virtual scene in the fused video, and enhancing atmosphere of a plurality of people in a meeting.

An embodiment of the present application provides a video fusion method in a virtual environment, and fig. 2 is a flowchart of the video fusion method in a virtual environment according to the embodiment of the present application, where an execution subject may be a computing device, for example, a server or a user terminal. The method comprises the following steps:

step S201, acquiring a virtual environment image and a plurality of videos, where the plurality of videos includes a plurality of target objects.

In this embodiment, a server is used as an execution subject. The target object includes, but is not limited to, a human, an animal, and any object. The virtual environment image may be any image or a video of a virtual scene. The acquiring mode of the virtual environment image may include: and receiving an image sent by the user equipment or an image acquired from a preset database. The virtual environment image may be a two-dimensional image or a three-dimensional image obtained by virtualizing an environment in the real world, for example, a three-dimensional virtual meeting room image obtained by virtualizing an environment in a meeting room. When the server receives a video fusion request, video fusion is triggered, videos sent by a plurality of user terminals are received, or a plurality of videos and virtual environment images are obtained from a preset database, wherein each video comprises at least one target object.

Step S202, aiming at any target object in a plurality of target objects, determining target mask information of the target object in a video, position information of the target object in a virtual environment image and depth mask information of the target object in the virtual environment image, wherein the depth mask information represents an occlusion relation between the target object and the virtual environment image.

The target mask information may obtain mask information of the target object, that is, the target mask information, from each frame image of the input video by a target object matting method. For example, a single video frame of the input video is subjected to Matting by a target Decomposition Network (MODNet). The MODNet is a lightweight target decomposition network, and the network is adopted to obtain target mask information, so that the image processing speed can be increased. The position information of the target object in the virtual environment image may be determined according to the position information of each video frame of the video containing the target object in the virtual environment image, and the position information of each video frame of the video containing the target object in the virtual environment image may be configured in advance. In practical application, a video containing a target object can be used as a foreground, a virtual environment image is used as a background, and the depth mask information represents an occlusion relation between the foreground and the background.

Step S203, fusing the plurality of videos and the virtual environment image based on the target mask information, the position information and the depth mask information to obtain a fused video containing a plurality of target objects.

Affine transformation is carried out on the target mask information and the depth mask information by utilizing the position information, wherein the affine transformation refers to that linear transformation is carried out once in one vector space, translation is carried out on the target mask information and the depth mask information, and the translation is carried out on the target mask information and the depth mask information to obtain another vector space.

Illustratively, video fusion is performed by the following formula:

wherein, I _f,m Representing pixel values in each frame image of the fused video,

representing target mask information in each frame image of a video containing a target object, I _in,m,v Representing pixel values in each frame image of a video containing a target object, I _bg Representing values of pixels in the virtual environment image, I _mask,n Indicating depth mask information, M ∈ M, V ∈ V, N ∈ N, M indicating the number of frames of a video containing a target object, V indicating the number of videos containing a target object, and N indicating the maximum number of human figures.

The embodiment of the application provides a video fusion method in a virtual environment, which comprises the steps of obtaining a plurality of videos containing target objects and virtual environment images; determining target mask information of a target object in a video, position information of the target object in a virtual environment image and depth mask information of the target object in the virtual environment image, wherein the depth mask information represents an occlusion relation between the target object and the virtual environment image; and fusing the videos containing the target objects and the virtual environment images based on the target mask information, the position information and the depth mask information to obtain a fused video containing the target objects. In the embodiment, the video fusion is performed based on the depth mask information, so that the shielding relation between the target object and the virtual environment information can be accurately expressed, the obtained fusion video is more real, and the presentation effect is better.

The depth mask information in step S202 is determined in the following manner:

in one possible implementation, determining depth mask information of a target object in a virtual environment image includes: acquiring first depth information of a target object in a video and second depth information of a virtual environment image; and determining depth mask information according to the comparison result of the first depth information and the second depth information.

In practical application, the video is used as a foreground, and the virtual environment image is used as a background. In order to synthesize an image more accurately, it is necessary to know the occlusion relationship between the foreground and the background. Depth information of the foreground and the background is obtained using a depth estimation model, respectively. Depth mask information for the foreground and background is then calculated, the depth mask information indicating whether each pixel is the background occluded by the foreground.

The first depth information is depth information of the target object in the video.

Optionally, the first depth information and the second depth information may be estimated by a depth estimation model Adabins. Specifically, an image in an RGB format is input to a depth estimation model, the depth estimation model outputs depth estimation results of the same size, and each pixel position of the depth estimation results represents an estimated depth. The depth estimation result may be expressed in the form of a matrix or a grayscale image. And inputting each frame of image of the video containing the target object into the depth estimation model to obtain first depth information, and inputting the virtual environment image into the depth estimation model to obtain second depth information.

In addition, the depth information may also be acquired by other methods, for example, using a depth camera or pre-configuring depth information when shooting a background.

In determining the depth mask information, it may be determined by comparing sizes of the first depth information and the second depth information.

Illustratively, the depth mask information may be obtained by the following formula:

I _mask ＝{I _mask |if I _d,w,h <dp _n ,I _mask,w,h ＝1,else I _mask,w,h ＝0} (2)

wherein, I _mask Representing depth mask information, dp _n Representing first depth information, I _d,w,h Representing the second depth information. w, h denote the coordinates of the pixel, n denotes the index of the target object in the video, and d denotes the depth. If the second depth information is less than the first depth information, the pixel value of the corresponding position in the depth mask information matrix is 1, otherwise, the pixel value is 10. The depth mask information characterizes an occlusion relationship between the target object and the virtual environment image.

Optionally, the occlusion relationship between the target object and the virtual environment image may also be estimated by a deep neural network model.

In one possible implementation, determining the position information of the target object in the virtual environment image includes: determining initial position information and position adjustment information of a target object in a virtual environment image; and adjusting the initial position information by using the position adjustment information to obtain the position information.

The initial position information of the target object in the virtual environment image may be position information of a video in the virtual environment image, and may be configured in advance. The position adjustment information may be determined according to the position of the target object in any video frame in the video.

Since the input video cannot be controlled, the proportion of the target object in the input video is unknown. For example, the target object is a human, and a human body part in the video is unknown. If the video is placed directly at a given location in the virtual environment image, people at similar locations in the composite image may have different sizes or different body parts. Some videos include only the head and shoulders, others include the upper body. Therefore, in order to make the synthesized video more natural and harmonious, the position of the input video in the background is adjusted by position adjustment.

The specific determination of the position adjustment information is described in the following embodiments:

in one possible implementation, determining the position adjustment information includes: determining the positions of key points of a target object in a video; and determining position adjustment information according to the positions of the key points, the size information of the video frames in the video and the size information of the video in the virtual environment image.

Where the keypoint location may be determined at the location of the target object in any video frame in the video.

For example, the target object is a human, and the positions of the key points may be determined based on a Multi-Person Linear Skin Model (SMPL), and the positions of the key points may be distributed in multiple parts of the spine, chest, head, neck, and the like of the human. Meanwhile, the key point detection can be carried out by a key point detection method PIXIE, and the position of the key point is obtained. Namely, the key point position is obtained by the following formula:

{kp _i |i∈N _SMPL }＝PIXIE(I) (3)

wherein, { kp _i |i∈N _SMPL Denotes the ith keypoint coordinate, N _SMPL Representing the total number of keypoints in a video frame and I representing the video containing the portrait.

The position adjustment information may include abscissa adjustment information and ordinate adjustment information, and the abscissa adjustment information and the ordinate adjustment information may be determined according to the abscissa and the ordinate of the key point, the size information of the video frame, and the size information of the video in the virtual environment image, respectively.

In one possible implementation, the target object is a person, and determining the position of a key point of the target object in the video includes: in the image of the video, the positions of key points of the target object in the video are determined from points representing different parts of a human body.

In practical application, if the target object is a person, the position of the key point can be determined in any frame of image of the video, and optionally, the server receives the first frame of image in the video first, so that the position of the key point can be determined according to different parts of the human body in the first frame of image. The keypoint locations can be determined at multiple locations of the spine, chest, head, neck, etc. of the person in the first image.

In one possible implementation, fusing a plurality of videos and virtual environment images based on target mask information, position information, and depth mask information includes: and fusing the plurality of videos and the virtual environment image based on the target mask information, the position information, the depth mask information and the size information of the target object in the virtual environment image to obtain a fused video containing the plurality of target objects.

In practical applications, for example, in portrait video fusion, the content of the portrait video is not controllable, which means that some input videos include the whole human body, while others include only the upper body. This uncontrollable nature will result in a lack of realism in the composite video. Therefore, when video fusion is carried out, the size information of the target object in the virtual environment image is increased, and the video fusion is carried out based on the target mask information, the position information, the depth mask information and the size information of the target object in the virtual environment image, so that the fused video is more real and has better effect.

In one possible implementation, the method further includes: determining size adjustment information according to the key point position of the target object in the video and the size information of the video frame of the video; and adjusting the initial size information of the target object in the virtual environment image by using the size adjustment information to obtain the size information of the target object in the virtual environment image.

The initial size information of the target object in the virtual environment image may be the size information of the target object in the corresponding video, for example, the size information of the target object in the corresponding video may be the width and height of a position frame of the target object in a video frame, or may also be the width and height of a video frame in which the target object is located.

Illustratively, the video is a portrait video, and the position and the size of the portrait in the virtual scene image are adjusted through key points in a first frame of the portrait video. The position and size of the portrait is corrected using two selected key points, see in particular the following formula:

rate ₁ ＝(kpy ₁ -kpy ₂ )/h _in (4)

rate ₂ ＝1-(kpy ₂ /h _in ) (5)

ih′＝ih/rate1 (6)

iw′＝iw/rate1 (7)

py′＝py+ih*rate2 (8)

wherein kpy ₁ 、kpy ₂ Ordinate, h, representing two selected key points _in Representing the height of the video frame, ih, iw representing the initial height and initial width of the portrait in the virtual environment image, respectively, py tableThe initial ordinate of the avatar in the virtual environment image, ih ', iw ' respectively represent the adjusted height and width of the avatar in the virtual environment image, and py ' represents the adjusted ordinate of the avatar in the virtual environment image.

In this embodiment, the abscissa of the portrait in the virtual environment image is not adjusted. In practical applications, the abscissa may be appropriately adjusted according to specific needs by referring to adjustment of the ordinate, and the position adjustment may include adjustment of at least one of the abscissa and the ordinate, which is not limited in this application.

In one possible implementation, the method further includes: and sending the fused video to user terminals corresponding to the plurality of videos respectively for displaying.

In practical application, a plurality of user terminals respectively send videos containing target objects, and after the server fuses a plurality of received videos and virtual environment images to obtain fused videos, the fused videos are sent to the plurality of user terminals for display. Illustratively, the server sends the merged video of a plurality of persons in a meeting in the same virtual scene to the terminal devices of a plurality of users in the meeting, and the merged video is displayed for the users.

In order to facilitate a clearer understanding of the technical idea of the present application, the following compares differences between the technical solution of the present application and the related technologies with reference to fig. 3 and fig. 4.

Fig. 3 is a schematic diagram of a video fusion method in the related art. As shown in fig. 3, in the related art, a plurality of portrait videos are acquired as a foreground input video I _in,m,v Performing portrait cutout to obtain portrait mask information

Using virtual environment image as background I _bg Estimating the position of the portrait to obtain the position and the size px of the portrait _n ,py _n ,iw _n ,ih _n Based on

And px _n ,py _n ,iw _n ,ih _n To a personAnd performing multi-video fusion on the image video and the background, and outputting a fused video.

Fig. 4 is a video fusion method in a virtual environment according to an embodiment of the present disclosure. As shown in fig. 4, a plurality of portrait videos sent by a plurality of user terminals are received as a foreground input video I _in,m,v Performing portrait cutout to obtain portrait mask information

And receiving the virtual environment image sent by the user equipment, or acquiring the virtual environment image from a preset database. Using virtual environment image as background I _bg Estimating the position of the portrait to obtain the position and the size px of the portrait _n ,py _n ,iw _n ,ih _n Using the first frame image I in each portrait video _in,0,v Adjusting the position and the size of the portrait to obtain the adjusted position and size px' _n ,py′ _n ,iw′ _n ,ih′ _n Inputting the portrait video into the depth estimation model, and performing portrait depth estimation to obtain portrait depth information dp _n Inputting the background image into the depth estimation model, and performing background depth estimation to obtain background depth information I _d Generating depth mask information I according to the portrait depth information and the background depth information _mask,n The depth mask information indicates whether each pixel is a background occluded by a foreground. And performing multi-video fusion on the multiple portrait videos and the virtual environment images according to the target mask information, the adjusted position information and the depth mask information, and outputting a fused video. In the embodiment, video fusion is performed based on the depth mask information, the adjusted position information and the adjusted size information, so that the shielding relation between the portrait and the virtual environment information can be accurately expressed, and the obtained fusion video is more natural.

In addition, an embodiment of the present application further provides an online video conference communication method, where an execution subject may be a computing device, for example, a server or a user terminal. As shown in fig. 5, the method includes:

step S501, receiving a first video stream transmitted by first user equipment, and acquiring a first target object from the first video stream;

step S502, receiving a virtual environment image;

step S503, receiving a second video stream transmitted by a second user equipment, and acquiring a second target object from the second video stream;

step S504, aiming at any one of the first target object and the second target object, determining target mask information of any one target object in a video stream, position information of any one target object in a virtual environment image and depth mask information of any one target object in the virtual environment image, wherein the depth mask information represents an occlusion relation between the target object and the virtual environment image;

step S505, based on the target mask information, the position information, and the depth mask information, the first video stream, the second video stream, and the virtual environment image are fused to obtain a fused video stream including the first target object and the second target object.

The first video stream and the second video stream may be video streams recorded by different users while participating in an online video conference. The first target object, the second target object may be a user in the video stream.

The specific implementation process in steps S501 to S505 is the same as the implementation process of the video fusion method in the virtual environment in the above embodiment, and details are not repeated here.

In the embodiment, the video streams are fused based on the depth mask information, so that the video streams of a plurality of users participating in the online video conference can be fused with the virtual environment image, the presentation effect of the plurality of users participating in the conference in the same virtual environment is realized, the shielding relation between the portrait and the virtual environment information is accurately represented, the obtained fused video is more real, and the presentation effect is better.

Corresponding to the application scenario and the method of the video fusion method in the virtual environment provided by the embodiment of the present application, the embodiment of the present application further provides a video fusion device in the virtual environment. Fig. 6 is a block diagram illustrating a video fusion apparatus in a virtual environment according to an embodiment of the present disclosure, where the video fusion apparatus in the virtual environment may include:

an obtaining module 601, configured to obtain a virtual environment image and multiple videos, where the multiple videos include multiple target objects;

a determining module 602, configured to determine, for any target object in a plurality of target objects, target mask information of the target object in a video of the target object, position information of the target object in the virtual environment image, and depth mask information of the target object in the virtual environment image, where the depth mask information represents an occlusion relationship between the target object and the virtual environment image;

a fusion module 603, configured to fuse the multiple videos and the virtual environment image based on the target mask information, the position information, and the depth mask information, so as to obtain a fused video including multiple target objects.

The embodiment of the application provides a video fusion device in a virtual environment, which is used for acquiring a virtual environment image and a plurality of videos, wherein the plurality of videos comprise a plurality of target objects; aiming at any target object in a plurality of target objects, determining target mask information of the target object in a video, position information of the target object in a virtual environment image and depth mask information of the target object in the virtual environment image, wherein the depth mask information represents an occlusion relation between the target object and the virtual environment image; and fusing the plurality of videos and the virtual environment image based on the target mask information, the position information and the depth mask information to obtain a fused video containing a plurality of target objects. In the embodiment, the video fusion is performed based on the depth mask information, so that the shielding relation between the target object and the virtual environment information can be accurately expressed, the obtained fusion video is more real, and the presentation effect is better.

In one possible implementation, the determining module 602, when determining the depth mask information of the target object in the virtual environment image, is configured to:

acquiring first depth information of a target object in a video and second depth information of a virtual environment image;

and determining depth mask information according to the comparison result of the first depth information and the second depth information.

In one possible implementation, the determining module 602, when determining the position information of the target object in the virtual environment image, is configured to:

determining initial position information and position adjustment information of a target object in a virtual environment image;

and adjusting the initial position information by using the position adjustment information to obtain the position information.

In one possible implementation, the determining module 602, when determining the position adjustment information, is configured to:

determining the positions of key points of a target object in a video;

and determining position adjustment information according to the positions of the key points, the size information of the video frames in the video and the size information of the video in the virtual environment image.

In one possible implementation, the target object is a human, and the determining module 602, when determining the location of the key point of the target object in the video, is configured to:

in the image of the video, the positions of key points of the target object in the video are determined from the points of different parts of the human body which are represented.

In a possible implementation manner, the fusion module 603 is configured to:

and fusing the plurality of videos and the virtual environment image based on the target mask information, the position information, the depth mask information and the size information of the target object in the virtual environment image to obtain a fused video containing the plurality of target objects.

In one possible implementation, the apparatus further includes a resizing module to:

determining size adjustment information according to the key point position of the target object in the video and the size information of the video frame of the video;

and adjusting the initial size information of the target object in the virtual environment image by using the size adjustment information to obtain the size information of the target object in the virtual environment image.

In one possible implementation manner, the apparatus further includes a sending module, configured to:

and sending the fused video to user terminals corresponding to the plurality of videos respectively for displaying.

The functions of each module in each device in the embodiment of the present application can be referred to the corresponding description in the above method, and have corresponding beneficial effects, which are not described herein again.

Corresponding to the application scenario and the method of the online video conference communication method provided by the embodiment of the application, the embodiment of the application further provides an online video conference communication device. As shown in fig. 7, the apparatus may include:

a first receiving module 701, configured to receive a first video stream transmitted by a first user equipment, and obtain a first target object from the first video stream;

a second receiving module 702, configured to receive a virtual environment image;

a third receiving module 703, configured to receive a second video stream transmitted by a second user equipment, and obtain a second target object from the second video stream;

the information determining module 704 determines, for any one of the first target object and the second target object, target mask information of the any target object in the video stream to which the any target object belongs, position information of the any target object in the virtual environment image, and depth mask information of the any target object in the virtual environment image, where the depth mask information represents an occlusion relationship between the any target object and the virtual environment image;

the video fusion module 705 is configured to fuse the first video stream, the second video stream, and the virtual environment image based on the target mask information, the position information, and the depth mask information, so as to obtain a fused video stream including the first target object and the second target object.

The functions of the modules in the apparatuses in the embodiment of the present application may refer to the corresponding descriptions in the above method, and have corresponding beneficial effects, which are not described herein again.

FIG. 8 is a block diagram of an electronic device used to implement embodiments of the present application. As shown in fig. 8, the electronic apparatus includes: a memory 810 and a processor 820, the memory 810 having stored therein computer programs operable on the processor 820. The processor 820, when executing the computer program, implements the methods in the embodiments described above. The number of the memory 810 and the processor 820 may be one or more.

The electronic device further includes:

and a communication interface 830, configured to communicate with an external device, and perform data interactive transmission.

If the memory 810, the processor 820 and the communication interface 830 are implemented independently, the memory 810, the processor 820 and the communication interface 830 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 8, but that does not indicate only one bus or one type of bus.

Optionally, in an implementation, if the memory 810, the processor 820 and the communication interface 830 are integrated on a chip, the memory 810, the processor 820 and the communication interface 830 may complete communication with each other through an internal interface.

Embodiments of the present application provide a computer-readable storage medium storing a computer program, which when executed by a processor implements the method provided in the embodiments of the present application.

The embodiment of the present application further provides a chip, where the chip includes a processor, and is configured to call and run an instruction stored in a memory from the memory, so that a communication device in which the chip is installed executes the method provided in the embodiment of the present application.

An embodiment of the present application further provides a chip, including: the system comprises an input interface, an output interface, a processor and a memory, wherein the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing the method provided by the embodiment of the application.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be a processor supporting an Advanced reduced instruction set machine (ARM) architecture.

Further, optionally, the memory may include a read-only memory and a random access memory. The memory may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may include a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can include Random Access Memory (RAM), which acts as external cache Memory. By way of example, and not limitation, many forms of RAM may be used. For example, static Random Access Memory (Static RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), enhanced SDRAM (ESDRAM), SLDRAM (SLDRAM), and Direct Rambus RAM (DR RAM).

In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.

In the description of the present specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

Any process or method described in a flow diagram or otherwise herein may be understood as representing a module, segment, or portion of code, which includes one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes other implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.

The logic and/or steps described in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. All or a portion of the steps of the method of the above embodiments may be performed by associated hardware that is instructed by a program, which may be stored in a computer-readable storage medium, that when executed, includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The above-described integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer-readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The above description is only an exemplary embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope described in the present application, and these should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of video fusion in a virtual environment, the method comprising:

acquiring a virtual environment image and a plurality of videos, wherein the videos comprise a plurality of target objects;

for any target object in the plurality of target objects, determining target mask information of the target object in the video, position information of the target object in the virtual environment image, and depth mask information of the target object in the virtual environment image, wherein the depth mask information represents an occlusion relationship between the target object and the virtual environment image;

and fusing the plurality of videos and the virtual environment image based on the target mask information, the position information and the depth mask information to obtain a fused video containing the plurality of target objects.

2. The method of claim 1, wherein determining depth mask information for the target object in the virtual environment image comprises:

acquiring first depth information of the target object in the video and second depth information of the virtual environment image;

determining the depth mask information according to a comparison result of the first depth information and the second depth information.

3. The method of claim 1, wherein determining the position information of the target object in the virtual environment image comprises:

determining initial position information and position adjustment information of the target object in the virtual environment image;

4. The method of claim 3, wherein determining position adjustment information comprises:

determining the position of a key point of the target object in the video;

and determining the position adjustment information according to the key point position, the size information of the video frame in the video and the size information of the video in the virtual environment image.

5. The method of claim 4, wherein the target object is a person, and wherein the determining the keypoint location of the target object in the video comprises:

in the image of the video, the positions of key points of the target object in the video are determined from points representing different parts of a human body.

6. The method according to any of claims 1-5, wherein said fusing the plurality of videos and the virtual environment image based on the target mask information, the location information, and the depth mask information comprises:

and fusing the plurality of videos and the virtual environment image based on the target mask information, the position information, the depth mask information and the size information of the target object in the virtual environment image to obtain a fused video containing a plurality of target objects.

7. The method of claim 6, further comprising:

8. The method according to any one of claims 1-5, further comprising:

and sending the fused video to user terminals corresponding to the videos respectively for displaying.

9. An online video conference communication method, the method comprising:

receiving a virtual environment image;

for any one of the first target object and the second target object, determining target mask information of the any target object in the video stream, position information of the any target object in the virtual environment image, and depth mask information of the any target object in the virtual environment image, wherein the depth mask information represents an occlusion relationship between the any target object and the virtual environment image;

10. An apparatus for video fusion in a virtual environment, the apparatus comprising:

a determining module, configured to determine, for any target object in the plurality of target objects, target mask information of the target object in the video, position information of the target object in the virtual environment image, and depth mask information of the target object in the virtual environment image, where the depth mask information represents an occlusion relationship between the target object and the virtual environment image;

and the fusion module is used for fusing the plurality of videos and the virtual environment image based on the target mask information, the position information and the depth mask information to obtain a fused video containing the plurality of target objects.

11. An electronic device, comprising a memory, a processor and a computer program stored on the memory, the processor implementing the method of any one of claims 1-9 when executing the computer program.