WO2024001661A1

WO2024001661A1 - Video synthesis method and apparatus, device, and storage medium

Info

Publication number: WO2024001661A1
Application number: PCT/CN2023/097738
Authority: WO
Inventors: 谢炜航
Original assignee: 北京新唐思创教育科技有限公司
Priority date: 2022-06-28
Filing date: 2023-06-01
Publication date: 2024-01-04
Also published as: CN114845136B; CN114845136A

Abstract

The present disclosure relates to the technical field of computers. Disclosed are a video synthesis method and apparatus, a device, and a storage medium. The method is applied to a server and comprises: receiving a user video stream, the user video stream being a video stream filmed by means of a camera of a user terminal; recording a target virtual scene by using a target-view-angle camera independent of a user-view-angle camera so as to generate a scene video stream at a target view angle, the target virtual scene being a virtual scene corresponding to a theme virtual space displayed on the user terminal; and fusing the user video stream and the scene video stream to generate a synthesized video stream. The technical solution lowers the requirements for the device performance of the user terminal and a network, solves the problems of slow uploading and frame loss of the scene video stream and the like and thus improves the efficiency of video synthesis and the smoothness of the synthesized video stream, and also improves the consistency of the content of the synthesized video stream and the content of the target virtual scene.

Description

视频合成方法、装置、设备和存储介质Video synthesis method, device, equipment and storage medium

本申请要求申请日为2022年06月28日，申请号为“202210740529.1”，专利名称为“视频合成方法、装置、设备和存储介质”的发明申请的优先权，其全部内容通过引用并入本文。This application claims priority for an invention application with a filing date of June 28, 2022, an application number of "202210740529.1", and a patent title of "Video synthesis method, device, equipment and storage medium", the entire content of which is incorporated herein by reference. .

技术领域Technical field

本公开涉及计算机技术领域，尤其涉及一种视频合成方法、装置、设备和存储介质。The present disclosure relates to the field of computer technology, and in particular, to a video synthesis method, device, equipment and storage medium.

背景技术Background technique

随着互联网技术的发展，各资源共享平台提供了诸多视频相关的功能。例如，将用户的真实摄像头画面与特定主题场景下的虚拟场景内容进行视频融合，生成合成视频，以供用户后期消费。With the development of Internet technology, various resource sharing platforms provide many video-related functions. For example, the user's real camera footage is video-fused with the virtual scene content in a specific theme scene to generate a synthetic video for the user's later consumption.

目前的视频合成方案，主要有人工编辑方式和服务端自动合成方式。其中，人工编辑方式大致是人工使用视频编辑软件对用户的真实摄像头画面和虚拟场景内容进行合成编辑。服务端自动合成方式大致是用户终端获取用户的真实摄像头画面和虚拟场景内容，并将两者发送至服务端进行自动合成处理。The current video synthesis solutions mainly include manual editing and server-side automatic synthesis. Among them, the manual editing method is roughly to manually use video editing software to synthesize and edit the user's real camera footage and virtual scene content. The server-side automatic synthesis method is roughly that the user terminal obtains the user's real camera picture and virtual scene content, and sends both to the server for automatic synthesis processing.

但是，人工编辑方式耗时耗力，无法满足批量视频合成处理的需求；服务端自动合成方式对网络和用户终端性能的要求均较高，容易造成合成视频画面卡顿的现象。However, the manual editing method is time-consuming and labor-intensive and cannot meet the needs of batch video synthesis processing; the server-side automatic synthesis method has high requirements on network and user terminal performance, which can easily cause the synthesized video screen to freeze.

发明内容Contents of the invention

为了解决上述技术问题，本公开提供了一种视频合成方法、装置、设备和存储介质。In order to solve the above technical problems, the present disclosure provides a video synthesis method, device, equipment and storage medium.

第一方面，本公开提供了一种视频合成方法，应用于服务端，该方法包括：接收用户视频流；其中，所述用户视频流为通过用户终端的摄像头拍摄所得的视频流；利用独立于用户视角相机的目标视角相机，对目标虚拟场景进行录制，生成目标视角下的场景视频流；其中，所述目标虚拟场景为所述用户终端中显示的主题虚拟空间对应的虚拟场景；融合所述用户视频流和所述场景视频流，生成合成视频流。In a first aspect, the present disclosure provides a video synthesis method, which is applied to a server. The method includes: receiving a user video stream; wherein the user video stream is a video stream captured by a camera of a user terminal; using independent The target perspective camera of the user perspective camera records the target virtual scene and generates a scene video stream under the target perspective; wherein the target virtual scene is a virtual scene corresponding to the theme virtual space displayed in the user terminal; the fusion of the The user video stream and the scene video stream generate a composite video stream.

第二方面，本公开提供了一种视频合成装置，配置于服务端，该装置包括：用户视频流接收模块，用于接收用户视频流；其中，所述用户视频流为通过用户终端的摄像头拍摄所得的视频流；场景视频流生成模块，用于利用独立于用户视角相机的目标视角相机，对目标虚拟场景进行录制，生成目标视角下的场景视频流；其中，所述目标虚拟场景为所述用户终端中显示的主题虚拟空间对应的虚拟场景；第一合成视频流生成模块，用于融合所述用户视频流和所述场景视频流，生成合成视频流。In a second aspect, the present disclosure provides a video synthesis device, which is configured on a server. The device includes: a user video stream receiving module, configured to receive a user video stream; wherein the user video stream is captured by a camera of a user terminal. The resulting video stream; the scene video stream generation module is used to record the target virtual scene using a target perspective camera that is independent of the user perspective camera, and generate a scene video stream under the target perspective; wherein the target virtual scene is the The virtual scene corresponding to the theme virtual space displayed in the user terminal; the first synthetic video stream generation module is used to fuse the user video stream and the scene video stream to generate a synthetic video stream.

第三方面，本公开提供了一种的电子设备，该电子设备包括：处理器；以及存储程序的存储器，其中，所述程序包括指令，所述指令在由所述处理器执行时使所述处理器执行本公开任意实施例所说明的视频合成方法。 In a third aspect, the present disclosure provides an electronic device, the electronic device including: a processor; and a memory storing a program, wherein the program includes instructions that, when executed by the processor, cause the The processor executes the video synthesis method described in any embodiment of the present disclosure.

第四方面，本公开提供了一种存储有计算机指令的非瞬时计算机可读存储介质，所述计算机指令用于使所述计算机执行本公开任意实施例所说明的视频合成方法。In a fourth aspect, the present disclosure provides a non-transitory computer-readable storage medium storing computer instructions, the computer instructions being used to cause the computer to execute the video synthesis method described in any embodiment of the present disclosure.

本公开实施例中提供的一个或多个技术方案，能够接收通过用户终端的摄像头拍摄所得的用户视频流，以及利用独立于用户视角相机的目标视角相机，对用户终端中显示的主题虚拟空间对应的目标虚拟场景进行录制，生成目标视角下的场景视频流；并且融合所述用户视频流和所述场景视频流，生成合成视频流；一方面，实现了在服务端中自动生成合成视频流，避免了人工合成视频存在的费时费力的问题；另一方面，通过服务端录制场景视频流，避免了在用户终端录制场景视频流并上传至服务端的过程中因设备性能和网络等原因而造成的合成视频卡顿的问题，既降低了对用户终端的设备性能和网络的要求，又解决了场景视频流上传慢和丢帧等问题，提高了视频合成的效率以及合成视频流的流畅性；又一方面，通过对目标虚拟场景进行录制而得到场景视频流，提高了合成视频流与目标虚拟场景的内容一致性。One or more technical solutions provided in the embodiments of the present disclosure can receive user video streams captured by the camera of the user terminal, and use a target perspective camera that is independent of the user perspective camera to correspond to the theme virtual space displayed in the user terminal. Record the target virtual scene to generate a scene video stream from the target perspective; and fuse the user video stream and the scene video stream to generate a synthetic video stream; on the one hand, the automatic generation of the synthetic video stream in the server is realized, It avoids the time-consuming and labor-intensive problem of artificially synthesized videos; on the other hand, recording the scene video stream through the server avoids problems caused by equipment performance and network reasons in the process of recording the scene video stream at the user terminal and uploading it to the server. The problem of synthetic video freezing not only reduces the requirements for user terminal equipment performance and network, but also solves the problems of slow scene video stream upload and frame loss, improves the efficiency of video synthesis and the smoothness of the synthetic video stream; and On the one hand, the scene video stream is obtained by recording the target virtual scene, which improves the content consistency between the synthetic video stream and the target virtual scene.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本公开的实施例，并与说明书一起用于解释本公开的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

为了更清楚地说明本公开实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，对于本领域普通技术人员而言，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for those of ordinary skill in the art, It is said that other drawings can be obtained based on these drawings without exerting creative labor.

图1是本公开实施例提供的一种视频合成方法的流程图；Figure 1 is a flow chart of a video synthesis method provided by an embodiment of the present disclosure;

图2是本公开实施例提供的一种用户视频流的显示示意图；Figure 2 is a schematic display diagram of a user video stream provided by an embodiment of the present disclosure;

图3是本公开实施例提供的一种合成视频流的显示示意图；Figure 3 is a schematic display diagram of a synthetic video stream provided by an embodiment of the present disclosure;

图4是本公开实施例提供的另一种视频合成方法的流程图；Figure 4 is a flow chart of another video synthesis method provided by an embodiment of the present disclosure;

图5是本公开实施例提供的又一种视频合成方法的流程图；Figure 5 is a flow chart of yet another video synthesis method provided by an embodiment of the present disclosure;

图6是本公开实施例提供的一种视频合成装置的结构示意图；Figure 6 is a schematic structural diagram of a video synthesis device provided by an embodiment of the present disclosure;

图7是本公开实施例提供的一种电子设备的结构示意图。FIG. 7 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例，然而应当理解的是，本公开可以通过各种形式来实现，而且不应该被解释为限于这里阐述的实施例，相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是，本公开的附图及实施例仅用于示例性作用，并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, which rather are provided for A more thorough and complete understanding of this disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of the present disclosure.

应当理解，本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行，和/或并行执行。此外，方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。 It should be understood that various steps described in the method implementations of the present disclosure may be executed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performance of illustrated steps. The scope of the present disclosure is not limited in this regard.

本文使用的术语“包括”及其变形是开放性包括，即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”；术语“另一实施例”表示“至少一个另外的实施例”；术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。需要注意，本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分，并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。As used herein, the term "include" and its variations are open-ended, ie, "including but not limited to." The term "based on" means "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; and the term "some embodiments" means "at least some embodiments". Relevant definitions of other terms will be given in the description below. It should be noted that concepts such as “first” and “second” mentioned in this disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units. Or interdependence.

需要注意，本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的，本领域技术人员应当理解，除非在上下文另有明确指出，否则应该理解为“一个或多个”。It should be noted that the modifications of "one" and "plurality" mentioned in this disclosure are illustrative and not restrictive. Those skilled in the art will understand that unless the context clearly indicates otherwise, it should be understood as "one or Multiple”.

本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的，而并不是用于对这些消息或信息的范围进行限制。The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are for illustrative purposes only and are not used to limit the scope of these messages or information.

本公开实施例提供的视频合成方法，主要适用于将用户终端的摄像头采集的用户视频流与虚拟场景对应的场景视频流进行视频合成的情况。在一些实施例中，该视频合成方法可适用于在短视频的主题场景下，将用户的真实摄像头画面与特效音视频内容进行融合，生成合成特效视频。在另一些实施例中，该视频合成方法可适用于在教育主题、游戏主题或直播间主题下，将用户的真实摄像头画面无缝融合到对应主题的虚拟场景中，生成相应主题下的合成视频(如包含用户画面的回放视频)。The video synthesis method provided by the embodiments of the present disclosure is mainly suitable for video synthesis of a user video stream collected by a camera of a user terminal and a scene video stream corresponding to a virtual scene. In some embodiments, the video synthesis method can be applied to fuse the user's real camera footage with special effects audio and video content in the theme scene of a short video to generate a synthesized special effects video. In other embodiments, the video synthesis method can be used to seamlessly integrate the user's real camera footage into the virtual scene of the corresponding theme under an educational theme, a game theme, or a live broadcast room theme, and generate a synthetic video under the corresponding theme. (such as playback video containing user images).

本公开实施例提供的视频合成方法可以由视频合成装置来执行，该装置可以由软件和/或硬件的方式实现，该装置可以集成在服务端对应的电子设备中，例如可以是笔记本电脑、台式电脑、服务器或服务器集群等。The video synthesis method provided by the embodiments of the present disclosure can be executed by a video synthesis device. The device can be implemented in software and/or hardware. The device can be integrated in the corresponding electronic device of the server, such as a laptop computer, desktop computer, etc. Computers, servers or server clusters, etc.

图1是本公开实施例提供的一种视频合成方法的流程图。参见图1，该视频合成方法具体包括：Figure 1 is a flow chart of a video synthesis method provided by an embodiment of the present disclosure. Referring to Figure 1, the video synthesis method specifically includes:

S110、接收用户视频流。S110. Receive user video stream.

其中，用户视频流为通过用户终端的摄像头拍摄所得的视频流。Among them, the user video stream is a video stream captured by the camera of the user terminal.

具体地，根据上述说明，本公开实施例中的视频合成是将用户终端的摄像头采集的真实画面与虚拟场景对应的场景画面进行融合。所以，服务端会接收用户终端发送的用户视频流。Specifically, according to the above description, the video synthesis in the embodiment of the present disclosure is to fuse the real picture captured by the camera of the user terminal with the scene picture corresponding to the virtual scene. Therefore, the server will receive the user video stream sent by the user terminal.

在一些实施例中，S110包括：通过实时通信传输协议，从用户终端接收用户视频流。In some embodiments, S110 includes: receiving a user video stream from a user terminal through a real-time communication transport protocol.

具体地，相关技术中，用户视频流根据传输控制协议(Transmission Control Protocol,TCP)，从用户终端传输至服务端。但是，因用户视频流的数据量相对较大，且TCP传输协议需要进行三次握手，容易造成传输延迟甚至丢帧的问题。所以，本实施例中采用实时通信传输协议(Real-time Communications，RTC)进行用户视频流的传输。这是因为RTC传输协议中携带有冗余字段，其可用于精准判断是否存在丢包，且其链路上的UDP传输为单向传输，无需进行三次握手，使得该传输协议对网络的要求较低，从而使得用户视频流的传输具有极强的抗弱网性，进而降低用户视频流传输的网络延迟，一定程度上避免丢帧问题。Specifically, in the related technology, the user video stream is transmitted from the user terminal to the server according to the Transmission Control Protocol (TCP). However, due to the relatively large amount of data in the user video stream, and the TCP transmission protocol requiring a three-way handshake, it is easy to cause transmission delays or even frame loss. Therefore, in this embodiment, Real-time Communications (RTC) is used to transmit user video streams. This is because the RTC transmission protocol carries redundant fields, which can be used to accurately determine whether there is packet loss, and the UDP transmission on its link is one-way transmission without the need for a three-way handshake, making this transmission protocol more demanding on the network. Low, thus making the transmission of user video streams extremely resistant to weak networks, thereby reducing the network delay of user video stream transmission and avoiding frame loss problems to a certain extent.

S120、利用独立于用户视角相机的目标视角相机，对目标虚拟场景进行录制，生成目标视角下的场景视频流。S120. Use the target perspective camera that is independent of the user perspective camera to record the target virtual scene and generate Scene video streaming from the target perspective.

其中，用户视角相机是用户通过用户终端观看目标虚拟场景时的观看视角对应的、渲染引擎中的虚拟相机。目标视角是合成视频流所需的观看视角，例如可以是用户之外的旁观者所在的视角。目标视角相机是目标视角对应的、渲染引擎中的虚拟相机。目标虚拟场景为用户终端中显示的主题虚拟空间对应的虚拟场景。主题虚拟空间是应用场景对应的网络空间。示例性地，主题虚拟空间包括线上直播间、虚拟游戏房间或虚拟教育空间。场景视频流是对目标虚拟场景进行录制而生成的视频流。Among them, the user perspective camera is a virtual camera in the rendering engine corresponding to the viewing perspective when the user views the target virtual scene through the user terminal. The target perspective is the viewing perspective required for the synthesized video stream, which may be the perspective of a bystander other than the user, for example. The target perspective camera is a virtual camera in the rendering engine corresponding to the target perspective. The target virtual scene is the virtual scene corresponding to the theme virtual space displayed in the user terminal. The theme virtual space is the network space corresponding to the application scenario. For example, the themed virtual space includes an online live broadcast room, a virtual game room or a virtual education space. The scene video stream is a video stream generated by recording the target virtual scene.

具体地，相关技术中是通过用户终端来录制场景视频流的，这样就需要用户终端来上传场景视频流，从而会存在上述场景视频流上传延迟和丢帧的问题，进而导致视频卡顿。所以，本公开实施例中直接在用户终端对应的服务端中开启一个目标视角相机，并利用该目标视角相机，沿着目标视角，对服务端中运行的目标虚拟场景进行录制，生成目标视角下的场景视频流。Specifically, in the related art, the scene video stream is recorded through the user terminal, which requires the user terminal to upload the scene video stream. This will cause the above-mentioned scene video stream upload delay and frame loss problems, which will lead to video freezes. Therefore, in the embodiment of the present disclosure, a target perspective camera is directly opened in the server corresponding to the user terminal, and the target perspective camera is used to record the target virtual scene running in the server along the target perspective, and generate the target perspective camera. scene video streaming.

例如，对于应用程序主体运行在云端的应用场景(如云游戏、云直播、云课堂等等)，在云端对应的服务端中原本就运行有与用户终端同步的目标虚拟场景，此时，可直接在云端对应的服务端中开启目标视角相机对目标虚拟场景进行录制，得到场景视频流。For example, for application scenarios where the application main body runs in the cloud (such as cloud games, cloud live broadcasts, cloud classrooms, etc.), the corresponding server in the cloud originally runs a target virtual scene synchronized with the user terminal. At this time, you can Directly turn on the target perspective camera in the corresponding server in the cloud to record the target virtual scene and obtain the scene video stream.

再如，对于应用程序主体运行在用户终端的应用场景(如普通游戏、线上教育等等)，因服务端中并未运行应用程序的主体部分，故服务端中可能并未运行目标虚拟场景，此时，需要在用户终端对应的服务端中开启一个服务，以运行目标虚拟场景，并在该服务中启动一个目标视角相机。当服务端接收到场景录制的指令时，服务端开始利用目标视角相机对目标虚拟场景进行录制，得到场景视频流。For another example, for application scenarios where the main body of the application runs on the user terminal (such as ordinary games, online education, etc.), because the main part of the application is not running on the server, the target virtual scene may not be running on the server. , At this time, a service needs to be opened in the server corresponding to the user terminal to run the target virtual scene, and a target perspective camera is started in the service. When the server receives the scene recording instruction, the server starts to record the target virtual scene using the target perspective camera to obtain the scene video stream.

需要说明的是，为了避免录制场景视频流对用户正常使用应用场景对应的应用程序功能的影响，服务端可以采用后端处理的方式对目标虚拟场景进行录制和渲染，即场景视频流的生成过程独立于应用场景对应的应用程序主体的运行过程。至于该生成场景视频流的过程的执行主体，其可以是应用程序主体的执行服务端中开辟的独立线程，也可以是重新启动的一个服务端。It should be noted that in order to avoid the impact of recording the scene video stream on the application functions corresponding to the user's normal use of the application scene, the server can use back-end processing to record and render the target virtual scene, that is, the generation process of the scene video stream It is independent of the running process of the application main body corresponding to the application scenario. As for the execution subject of the process of generating the scene video stream, it can be an independent thread opened in the execution server of the application subject, or it can be a restarted server.

参见图2，以线上教育中的线上演讲应用场景为例，用户终端中显示用户视角相机渲染所得的三维虚拟讲堂场景的视频流，并且会在左上角的位置处显示用户终端的摄像头采集的真实用户画面。服务端在响应用户终端的显示请求之外，还可以从目标视角对目标虚拟场景进行录制，如图3所示。图3中，服务端以观众视角对应的目标视角相机对三维虚拟讲堂场景进行录制，生成观众视角下的场景视频流。Referring to Figure 2, taking the online lecture application scenario in online education as an example, the user terminal displays the video stream of the three-dimensional virtual lecture scene rendered by the user's perspective camera, and the user terminal's camera collection is displayed in the upper left corner. of real user footage. In addition to responding to the display request of the user terminal, the server can also record the target virtual scene from the target perspective, as shown in Figure 3. In Figure 3, the server records the three-dimensional virtual lecture hall scene with the target perspective camera corresponding to the audience perspective, and generates a scene video stream from the audience perspective.

S130、融合用户视频流和场景视频流，生成合成视频流。S130. Fusion of the user video stream and the scene video stream to generate a composite video stream.

具体地，服务端中将用户视频流嵌入场景视频流中的某个位置，以生成包含用户真实画面和虚拟场景画面的合成视频流。Specifically, the server embeds the user video stream at a certain position in the scene video stream to generate a synthetic video stream containing the user's real picture and the virtual scene picture.

在一些实施例中，目标虚拟场景中包括预置视图。该预置视图是指预先在目标虚拟场景中设置的视图层，其用于承载用户视频流。该预置视图的位置是可以自定义设置的；也可以根据目标虚拟场景中包含的各虚拟物体的类型和/或空间位置来确定预置视图的位置。例如，对于上述三维虚拟讲堂场景的示例，该目标虚拟场景中包含用于播放演讲相关信息的虚拟屏幕，那么可在该虚拟屏幕的位置处设置预置视图。再如，可以在目标虚拟场景中的虚拟物体较少的空闲区域处设置预置视图。In some embodiments, the target virtual scene includes a preset view. The preset view refers to the view layer pre-set in the target virtual scene, which is used to carry the user video stream. The position of the preset view can be customized; the position of the preset view can also be determined based on the type and/or spatial position of each virtual object contained in the target virtual scene. Location. For example, for the above example of the three-dimensional virtual lecture scene, the target virtual scene contains a virtual screen for playing lecture-related information, then the preset view can be set at the position of the virtual screen. For another example, a preset view can be set in a free area with fewer virtual objects in the target virtual scene.

相应地，S130包括：将用户视频流融合至场景视频流中的预置视图处，生成合成视频流。Correspondingly, S130 includes: fusing the user video stream to a preset view in the scene video stream to generate a composite video stream.

具体地，服务端可以将用户视频流输入预置视图，以将该用户视频流嵌入场景视频流中，所得结果便为合成视频流。如图3中，将虚拟屏幕设置为预置视图，那么服务端将用户视频流嵌入三维虚拟讲堂场景中的虚拟屏幕处，生成观众视角的线上演讲回放视频。Specifically, the server can input the user video stream into the preset view to embed the user video stream into the scene video stream, and the result is a composite video stream. As shown in Figure 3, the virtual screen is set as the preset view, and then the server embeds the user video stream on the virtual screen in the three-dimensional virtual lecture hall scene to generate an online lecture playback video from the audience's perspective.

本公开实施例提供的上述视频合成方法，能够接收通过用户终端的摄像头拍摄所得的用户视频流，以及利用独立于用户视角相机的目标视角相机，对用户终端中显示的主题虚拟空间对应的目标虚拟场景进行录制，生成目标视角下的场景视频流；并且融合用户视频流和场景视频流，生成合成视频流；一方面，实现了在服务端中自动生成合成视频流，避免了人工合成视频存在的费时费力的问题；另一方面，通过服务端录制场景视频流，避免了在用户终端录制场景视频流并上传至服务端的过程中因设备性能和网络等原因而造成的合成视频卡顿的问题，既降低了对用户终端的设备性能和网络的要求，又解决了场景视频流上传慢和丢帧等问题，提高了视频合成的效率以及合成视频流的流畅性；又一方面，通过对目标虚拟场景进行录制而得到场景视频流，提高了合成视频流与目标虚拟场景的内容一致性。The above video synthesis method provided by the embodiment of the present disclosure can receive the user video stream captured by the camera of the user terminal, and use a target perspective camera that is independent of the user perspective camera to perform target virtualization corresponding to the theme virtual space displayed in the user terminal. Record the scene and generate the scene video stream from the target perspective; and integrate the user video stream and the scene video stream to generate a synthetic video stream; on the one hand, it realizes the automatic generation of synthetic video streams in the server, avoiding the problem of artificially synthesized videos. Time-consuming and labor-intensive problem; on the other hand, recording the scene video stream through the server avoids the problem of synthetic video lagging caused by device performance and network reasons during the process of recording the scene video stream on the user terminal and uploading it to the server. It not only reduces the requirements for the equipment performance and network of the user terminal, but also solves the problems of slow scene video stream upload and frame loss, and improves the efficiency of video synthesis and the smoothness of the synthesized video stream; on the other hand, by virtualizing the target The scene is recorded to obtain a scene video stream, which improves the content consistency between the synthesized video stream and the target virtual scene.

图4是本公开实施例提供的另一种视频合成方法的流程图。其增加了根据用户操作指令来生成包含虚拟对象动作响应的相关步骤。其中与上述各实施例相同或相应的术语的解释在此不再赘述。参见图4，该视频合成方法包括：Figure 4 is a flow chart of another video synthesis method provided by an embodiment of the present disclosure. It adds relevant steps to generate action responses containing virtual objects based on user operation instructions. The explanation of terms that are the same as or corresponding to the above embodiments will not be repeated here. Referring to Figure 4, the video synthesis method includes:

S410、接收用户视频流。S410. Receive user video stream.

S420、接收用户操作指令。S420. Receive user operation instructions.

其中，用户操作指令是用户通过操纵用户终端在主题虚拟空间中产生的操作指令，其用于控制用户对应的虚拟人物在主题虚拟空间中的执行动作，例如移动、跳跃等。Among them, user operation instructions are operation instructions generated by the user in the theme virtual space by manipulating the user terminal, which are used to control the execution actions of the user's corresponding virtual character in the theme virtual space, such as moving, jumping, etc.

具体地，在应用程序运行过程中，用户会通过操作用户终端来执行一些对主题虚拟空间中的虚拟对象进行控制的操作，用户终端会将用户的操作转换为对应的用户操作指令，并根据该用户操作指令来触发应用程序控制虚拟对象执行相应的动作响应(即虚拟对象动作响应)。Specifically, during the running of the application, the user will perform some operations to control virtual objects in the theme virtual space by operating the user terminal. The user terminal will convert the user's operations into corresponding user operation instructions, and according to the User operation instructions trigger the application to control the virtual object to perform corresponding action responses (ie, virtual object action responses).

基于上述说明可知，服务端录制场景视频流的过程与应用程序响应用户操作指令的过程之间是相互独立的。那么，为了使得录制的场景视频流与用户所观看到的应用程序的运行结果一致，服务端可以拉取该用户操作指令，以便在目标虚拟场景中恢复相同的虚拟对象动作响应。Based on the above description, it can be seen that the process of recording scene video streams on the server side and the process of the application responding to user operation instructions are independent of each other. Then, in order to make the recorded scene video stream consistent with the running results of the application viewed by the user, the server can pull the user operation instructions to restore the same virtual object action response in the target virtual scene.

在一些实施例中，服务端可以在录制场景视频流的过程与运行应用程序以响应用户操作指令的过程之间建立通信连接，以便将应用程序中生成的用户操作指令传输至录制场景视频流的过程。In some embodiments, the server can establish a communication connection between the process of recording the scene video stream and the process of running the application program in response to user operation instructions, so as to transmit the user operation instructions generated in the application program to the recording The process of scene video streaming.

例如，对于应用程序主体运行在云端的应用场景，服务端可以在分别运行上述两个过程的服务或线程等主体之间建立通信连接，以将应用程序生成的用户操作指令传输至录制场景视频流的过程。For example, for an application scenario where the application main body runs in the cloud, the server can establish a communication connection between the main bodies such as services or threads running the above two processes respectively to transmit the user operation instructions generated by the application to the recording scene video stream the process of.

又如，对于应用程序主体运行在用户终端的应用场景，可以在用户终端和运行目标虚拟场景的服务端之间建立通信连接，以将用户终端中生成的用户操作指令发送至服务端。For another example, for an application scenario where the main application program runs on the user terminal, a communication connection can be established between the user terminal and the server running the target virtual scene to send user operation instructions generated in the user terminal to the server.

在另一些实施例中，服务端创建虚拟用户，并将虚拟用户关联至主题虚拟空间，且从主题虚拟空间中共享用户操作指令。In other embodiments, the server creates virtual users, associates the virtual users with the theme virtual space, and shares user operation instructions from the theme virtual space.

具体地，为了提高用户操作指令的获取效率和同步性，服务端可以创建一个新的虚拟用户，并将该虚拟用户关联至用户终端对应的主题虚拟空间，例如将该虚拟用户以旁观者身份加入虚拟游戏房间。这样，用户终端对应的虚拟用户和该新的虚拟用户便处于同一个主题虚拟空间。所以，服务端可以实时地从主题虚拟空间中共享得到用户操作指令。Specifically, in order to improve the efficiency and synchronization of obtaining user operation instructions, the server can create a new virtual user and associate the virtual user with the theme virtual space corresponding to the user terminal, for example, add the virtual user as a bystander. Virtual game room. In this way, the virtual user corresponding to the user terminal and the new virtual user are in the same theme virtual space. Therefore, the server can share user operation instructions from the theme virtual space in real time.

S430、在目标虚拟场景中执行用户操作指令对应的虚拟对象动作响应。S430. Execute the virtual object action response corresponding to the user operation instruction in the target virtual scene.

具体地，服务端在录制场景视频流的过程中，根据获得的用户操作指令，在目标虚拟场景中执行对应的虚拟对象动作响应，以便该目标虚拟场景中呈现与应用程序相同的虚拟对象动作响应。Specifically, during the process of recording the scene video stream, the server executes the corresponding virtual object action response in the target virtual scene according to the obtained user operation instructions, so that the same virtual object action response as the application program is presented in the target virtual scene. .

S440、利用目标视角相机，对目标虚拟场景进行录制，生成包含虚拟对象动作响应的场景视频流。S440: Use the target perspective camera to record the target virtual scene and generate a scene video stream containing the virtual object's action response.

具体地，服务端利用目标视角相机，对执行了虚拟对象动作响应的目标虚拟场景进行录制，可得到包含虚拟对象动作响应的、目标视角下的场景视频流。Specifically, the server uses the target perspective camera to record the target virtual scene in which the virtual object's action response is executed, and can obtain a scene video stream from the target perspective that includes the virtual object's action response.

S450、融合用户视频流和场景视频流，生成合成视频流。S450: Fusion of the user video stream and the scene video stream to generate a composite video stream.

本公开实施例提供的上述视频合成方法，通过在目标虚拟场景中执行用户终端生成的用户操作指令对应的虚拟对象动作响应，使得目标虚拟场景中也包含虚拟对象动作响应，并利用目标视角相机对目标虚拟场景进行录制，生成包含虚拟对象动作响应的场景视频流；进一步提高了场景视频流与用户所观看到的应用程序的运行结果之间的一致性，从而进一步提高合成视频流与目标虚拟场景的内容一致性。The above video synthesis method provided by the embodiment of the present disclosure executes the virtual object action response corresponding to the user operation instruction generated by the user terminal in the target virtual scene, so that the target virtual scene also contains the virtual object action response, and uses the target perspective camera to The target virtual scene is recorded to generate a scene video stream containing the action response of the virtual object; this further improves the consistency between the scene video stream and the running results of the application viewed by the user, thus further improving the consistency between the synthesized video stream and the target virtual scene. content consistency.

在一些实施例中，用户视频流中携带第一时间戳，且用户操作指令中携带第二时间戳。这里的第一时间戳和第二时间戳均是产生用户操作指令的时刻(也称为指令时间戳)，但是第一时间戳是记录在用户视频流中的指令时间戳，第二时间戳是记录在用户操作指令中的指令时间戳。这是因为用户视频流和用户操作指令的数据量不同，使得用户操作指令先于用户视频流到达服务端。如果信息到达服务端后便被响应，会使得目标虚拟场景中恢复的虚拟对象动作响应与用户视频流不匹配，导致合成视频流中的内容混乱。所以，本实施例中，用户视频流和用户操作指令中均携带指令时间戳，以便后续根据时间戳进行虚拟对象动作响应的执行。 In some embodiments, the user video stream carries a first timestamp, and the user operation instructions carry a second timestamp. The first timestamp and the second timestamp here are both the moment when the user operation instruction is generated (also called the instruction timestamp), but the first timestamp is the instruction timestamp recorded in the user video stream, and the second timestamp is The instruction timestamp recorded in the user operation instruction. This is because the data amounts of the user video stream and the user operation instructions are different, so the user operation instructions arrive at the server before the user video stream. If the information is responded to after it reaches the server, the virtual object action response restored in the target virtual scene will not match the user video stream, resulting in confusing content in the synthesized video stream. Therefore, in this embodiment, both the user video stream and the user operation instructions carry instruction timestamps, so that subsequent virtual object action responses can be executed based on the timestamps.

相应地，在S420之后，该视频合成方法还包括：缓存用户操作指令。基于上述说明，用户操作指令到达服务端后，不能直接进行响应，所以，服务端会先缓存该用户操作指令。Correspondingly, after S420, the video synthesis method also includes: caching user operation instructions. Based on the above description, after the user operation instruction reaches the server, it cannot respond directly. Therefore, the server will cache the user operation instruction first.

相应地，S430包括：从各用户操作指令中筛选出第二时间戳小于或等于第一时间戳的目标操作指令；在目标虚拟场景中执行目标操作指令对应的虚拟对象动作响应。Correspondingly, S430 includes: filtering out target operation instructions whose second timestamp is less than or equal to the first timestamp from each user operation instruction; and executing a virtual object action response corresponding to the target operation instruction in the target virtual scene.

具体地，服务端在接收到用户视频流之后，便提取其中的第一时间戳。然后，从缓存空间中获得各用户操作指令的第二时间戳，并将第一时间戳与各第二时间戳进行比较，筛选出小于或等于第一时间戳的至少一个第二时间戳。再然后，服务端将筛选出的各第二时间戳对应的用户操作指令作为目标操作指令，并在目标虚拟场景中执行目标操作指令对应的虚拟对象动作响应，以在目标虚拟场景中恢复用户视频流及其之前时刻的虚拟对象动作响应。这样，不仅可确保后续录制的场景视频流与用户观看到的运行结果中包含相同的虚拟对象动作响应，更进一步确保场景视频流中的虚拟对象动作响应与用户观看到的运行结果中的虚拟对象动作响应之间的时间一致性，从而进一步提高场景视频流和用户视频流之间的同步性。Specifically, after receiving the user's video stream, the server extracts the first timestamp therein. Then, obtain the second timestamp of each user operation instruction from the cache space, compare the first timestamp with each second timestamp, and filter out at least one second timestamp that is less than or equal to the first timestamp. Then, the server uses the filtered user operation instructions corresponding to each second timestamp as the target operation instruction, and executes the virtual object action response corresponding to the target operation instruction in the target virtual scene to restore the user video in the target virtual scene. Stream and its virtual object action responses at previous moments. In this way, it is not only ensured that the subsequently recorded scene video stream contains the same virtual object action response as the running result viewed by the user, but also further ensures that the virtual object action response in the scene video stream is consistent with the virtual object action response in the running result viewed by the user. Temporal consistency between action responses, thereby further improving synchronization between the scene video stream and the user video stream.

图5是本公开实施例提供的又一种视频合成方法的流程图。该视频合成方法增加了根据视频模板来生成合成视频流的相关步骤。其中与上述各实施例相同或相应的术语的解释在此不再赘述。参见图5，该视频合成方法包括：Figure 5 is a flow chart of yet another video synthesis method provided by an embodiment of the present disclosure. The video synthesis method adds relevant steps to generate a synthesized video stream based on a video template. The explanations of terms that are the same as or corresponding to the above embodiments will not be repeated here. Referring to Figure 5, the video synthesis method includes:

S510、接收用户视频流。S510. Receive user video stream.

具体地，服务端可根据应用需求(如视频合成速度、视频合成精度等)来继续执行S520～S530，或者执行S540～S550。Specifically, the server may continue to execute S520-S530, or execute S540-S550 according to application requirements (such as video synthesis speed, video synthesis accuracy, etc.).

S520、利用独立于用户视角相机的目标视角相机，对目标虚拟场景进行录制，生成目标视角下的场景视频流。S520: Use a target perspective camera that is independent of the user perspective camera to record the target virtual scene and generate a scene video stream from the target perspective.

S530、融合用户视频流和场景视频流，生成合成视频流。S530: Fusion of the user video stream and the scene video stream to generate a composite video stream.

S540、基于模板筛选条件，从各预设视频模板中确定与目标虚拟场景对应的目标视频模板。S540. Based on the template filtering conditions, determine the target video template corresponding to the target virtual scene from each preset video template.

其中，模板筛选条件是预先设置的、对各预设视频模板进行筛选的维度。预设视频模板是预先设置的视频模板，其中包含可以融合外部视频的空白部分以及不可变化的视频部分，并且不可变化的视频部分中可以包含预设的人物形象、预设的特效组件等等。本公开实施例中，模板筛选条件包括用户视频流的视频时长、用户信息、用户操作指令和播放音频中的至少一个。用户信息是与用户相关的信息，例如该用户信息包括用户情绪和/或用户年龄，且用户信息用于匹配预设视频模板中的人物形象。用户操作指令用于匹配预设视频模板中的录制视角。播放音频用于匹配预设视频模板中的特效组件。Among them, the template filtering conditions are preset dimensions for filtering each preset video template. The preset video template is a preset video template, which contains a blank part that can be integrated with external videos and an immutable video part, and the immutable video part can contain preset characters, preset special effects components, etc. In the embodiment of the present disclosure, the template filtering conditions include at least one of the video duration of the user's video stream, user information, user operation instructions, and playing audio. User information is information related to the user. For example, the user information includes the user's emotion and/or the user's age, and the user information is used to match the character image in the preset video template. User operation instructions are used to match the recording angle in the preset video template. Playing audio is used to match the special effects components in the preset video template.

具体地，服务端中预先存储了多个预设视频模板。在接收到用户视频流之后，服务端可以根据模板筛选条件，从多个预设视频模板中筛选出适配的预设视频模板，作为目标视频模板。Specifically, multiple preset video templates are pre-stored in the server. After receiving the user's video stream, the server can select an adapted preset video template from multiple preset video templates according to the template filtering conditions as the target video template.

例如，模板筛选条件中包含了用户视频流的视频时长，那么可以根据该视频时长匹配预设视频模板中的空白部分的时长，以确保筛选出的目标视频模板中可融合入该用户视频流。For example, if the template filter condition includes the video duration of the user's video stream, then you can match the video based on the video duration. Set the duration of the blank part in the preset video template to ensure that the user's video stream can be integrated into the filtered target video template.

再如，模板筛选条件中包含了用户信息，那么服务端可以根据用户信息中的用户情绪和/或用户年龄，从各预设视频模板中筛选出视频风格与用户情绪相适配的目标视频模板，和/或，从各预设视频模板中筛选出与视频中的人物形象与用户年龄相适配的目标视频模板。For another example, if the template filtering conditions include user information, then the server can filter out the target video templates whose video style matches the user's emotions from each preset video template based on the user's emotion and/or user's age in the user information. , and/or, select a target video template that matches the characters in the video and the user's age from each preset video template.

又如，模板筛选条件中包含了用户操作指令，那么服务端可以根据该用户操作指令对应的用户视角来确定录制视角，并从各预设视频模板中筛选出与录制视角一致的目标视频模板。比如，对于上述三维虚拟讲堂场景的示例，搜集录制过程的用户操作指令，当用户操作指令为用户走到特定区域时，切换到该特定区域对应的录制视角，并切换选择该录制视角对应的预设视频模板，完成视频中的转场。For another example, if the template filtering condition includes a user operation instruction, the server can determine the recording perspective based on the user perspective corresponding to the user operation instruction, and filter out the target video template consistent with the recording perspective from each preset video template. For example, for the above example of the three-dimensional virtual lecture scene, collect user operation instructions for the recording process. When the user operation instruction is that the user walks to a specific area, switch to the recording perspective corresponding to the specific area, and switch to select the preset corresponding to the recording perspective. Set up a video template to complete the transition in the video.

又如，模板筛选条件中包含了播放音频，那么服务端根据该播放音频的音频停顿位置和停顿时长等音频特性，选择具有相同或相似的音频特性的目标视频模板，并且可以在该目标视频模板的相应位置处加入诸如礼花、掌声等特效组件，以优化目标视频模板。For another example, if the template filtering conditions include playing audio, then the server selects a target video template with the same or similar audio characteristics based on the audio pause position and pause duration of the played audio, and can select the target video template in the target video template. Add special effect components such as fireworks and applause at the corresponding positions to optimize the target video template.

S550、融合用户视频流和目标视频模板，生成合成视频流。S550: Fusion of the user video stream and the target video template to generate a synthetic video stream.

具体地，将用户视频流添加至目标视频模板的空白部分，或者将用户视频流嵌入目标视频模板的某一位置处，生成合成视频流。Specifically, the user video stream is added to a blank part of the target video template, or the user video stream is embedded in a certain position of the target video template to generate a synthesized video stream.

在一些实施例中，S550可通过以下步骤A和/或步骤B来实现。In some embodiments, S550 may be implemented through the following step A and/or step B.

步骤A、将用户视频流融合至目标视频模板中的绿幕位置处，生成合成视频流。Step A. Fuse the user video stream to the green screen position in the target video template to generate a synthetic video stream.

具体地，目标视频模板中预先设置了绿幕位置。那么服务端可将该用户视频流嵌入目标视频模板中的绿幕位置处，生成合成视频流。Specifically, the green screen position is preset in the target video template. Then the server can embed the user video stream at the green screen position in the target video template to generate a synthetic video stream.

步骤B、基于目标视频模板中的至少一个预设时间点，确定目标视频模板中的视频合成位置，并将用户视频流融合至目标视频模板中的视频合成位置处，生成合成视频流。Step B: Determine the video synthesis position in the target video template based on at least one preset time point in the target video template, and fuse the user video stream to the video synthesis position in the target video template to generate a synthesized video stream.

具体地，目标视频模板中可预置至少一个预设时间点，例如片头时间点、片中时间点和片尾时间点，并且每个预设时间点可对应设置一个用于嵌入视频流的位置(即视频合成位置)，例如，片头时间点对应左上角的视频合成位置、片中时间点对应中间的视频合成位置、片尾时间点对应右下角的视频合成位置。服务端在各个时段中，将用户视频流嵌入到相应预设时间点对应的视频合成位置处，生成合成视频流。Specifically, at least one preset time point can be preset in the target video template, such as the beginning time point, the mid-title time point and the end time point, and each preset time point can be set with a corresponding position for embedding the video stream ( That is, the video synthesis position), for example, the beginning time point corresponds to the video synthesis position in the upper left corner, the time point in the film corresponds to the video synthesis position in the middle, and the end time point corresponds to the video synthesis position in the lower right corner. In each time period, the server embeds the user video stream into the video synthesis position corresponding to the corresponding preset time point to generate a synthesized video stream.

本公开实施例提供的上述视频合成方法，根据模板筛选条件从各预设视频模板中确定与目标虚拟场景对应的目标视频模板，并融合用户视频流和目标视频模板，生成合成视频流；实现了通过预置的视频模板来合成用户真实画面和虚拟场景画面，降低了服务端的资源消耗，进一步提高了合成视频流的生成效率。The above video synthesis method provided by the embodiments of the present disclosure determines the target video template corresponding to the target virtual scene from each preset video template according to the template filtering conditions, and fuses the user video stream and the target video template to generate a synthesized video stream; achieved The preset video template is used to synthesize the user's real picture and the virtual scene picture, which reduces the resource consumption of the server and further improves the efficiency of generating the synthesized video stream.

图6是本公开实施例提供的一种视频合成装置的结构示意图。该视频合成装置配置于服务端中。参见图6，该视频合成装置600具体包括：FIG. 6 is a schematic structural diagram of a video synthesis device provided by an embodiment of the present disclosure. The video synthesis device is configured in the server. Referring to Figure 6, the video synthesis device 600 specifically includes:

用户视频流接收模块610，用于接收用户视频流；其中，用户视频流为通过用户终端的摄像头拍摄所得的视频流； The user video stream receiving module 610 is used to receive the user video stream; wherein the user video stream is a video stream captured by the camera of the user terminal;

场景视频流生成模块620，用于利用独立于用户视角相机的目标视角相机，对目标虚拟场景进行录制，生成目标视角下的场景视频流；其中，目标虚拟场景为用户终端中显示的主题虚拟空间对应的虚拟场景；The scene video stream generation module 620 is used to record the target virtual scene using a target perspective camera that is independent of the user perspective camera, and generate a scene video stream from the target perspective; wherein the target virtual scene is a theme virtual space displayed in the user terminal. Corresponding virtual scene;

第一合成视频流生成模块630，用于融合用户视频流和场景视频流，生成合成视频流。The first synthetic video stream generation module 630 is used to fuse the user video stream and the scene video stream to generate a synthetic video stream.

本公开实施例提供的上述视频合成装置，能够接收通过用户终端的摄像头拍摄所得的用户视频流，以及利用独立于用户视角相机的目标视角相机，对用户终端中显示的主题虚拟空间对应的目标虚拟场景进行录制，生成目标视角下的场景视频流；并且融合用户视频流和场景视频流，生成合成视频流；一方面，实现了在服务端中自动生成合成视频流，避免了人工合成视频存在的费时费力的问题；另一方面，通过服务端录制场景视频流，避免了在用户终端录制场景视频流并上传至服务端的过程中因设备性能和网络等原因而造成的合成视频卡顿的问题，既降低了对用户终端的设备性能和网络的要求，又解决了场景视频流上传慢和丢帧等问题，提高了视频合成的效率以及合成视频流的流畅性；又一方面，通过对目标虚拟场景进行录制而得到场景视频流，提高了合成视频流与目标虚拟场景的内容一致性。The above-mentioned video synthesis device provided by the embodiment of the present disclosure can receive the user video stream captured by the camera of the user terminal, and use a target perspective camera that is independent of the user perspective camera to perform target virtualization corresponding to the theme virtual space displayed in the user terminal. Record the scene and generate the scene video stream from the target perspective; and integrate the user video stream and the scene video stream to generate a synthetic video stream; on the one hand, it realizes the automatic generation of synthetic video streams in the server, avoiding the problem of artificially synthesized videos. Time-consuming and labor-intensive problem; on the other hand, recording the scene video stream through the server avoids the problem of synthetic video lagging caused by device performance and network reasons during the process of recording the scene video stream on the user terminal and uploading it to the server. It not only reduces the requirements for the equipment performance and network of the user terminal, but also solves the problems of slow scene video stream upload and frame loss, and improves the efficiency of video synthesis and the smoothness of the synthesized video stream; on the other hand, by virtualizing the target The scene is recorded to obtain a scene video stream, which improves the content consistency between the synthesized video stream and the target virtual scene.

在一些实施例中，视频合成装置600还包括用户操作指令接收模块，用于：In some embodiments, the video synthesis device 600 further includes a user operation instruction receiving module for:

在融合用户视频流和场景视频流，生成合成视频流之前，接收用户操作指令；Receive user operation instructions before merging the user video stream and the scene video stream to generate a composite video stream;

相应地，场景视频流生成模块620包括：Correspondingly, the scene video stream generation module 620 includes:

动作响应执行子模块，用于在目标虚拟场景中执行用户操作指令对应的虚拟对象动作响应；The action response execution submodule is used to execute the virtual object action response corresponding to the user operation instruction in the target virtual scene;

场景视频流生成子模块，用于利用目标视角相机，对目标虚拟场景进行录制，生成包含虚拟对象动作响应的场景视频流。The scene video stream generation submodule is used to record the target virtual scene using the target perspective camera and generate a scene video stream containing the virtual object's action response.

在一些实施例中，用户视频流中携带第一时间戳，且用户操作指令中携带第二时间戳；In some embodiments, the user video stream carries a first timestamp, and the user operation instructions carry a second timestamp;

相应地，视频合成装置600还包括用户操作指令缓存模块，用于：Correspondingly, the video synthesis device 600 also includes a user operation instruction cache module for:

在接收用户操作指令之后，缓存用户操作指令；After receiving the user operation instructions, cache the user operation instructions;

相应地，动作响应执行子模块具体用于：Correspondingly, the action response execution sub-module is specifically used to:

从各用户操作指令中筛选出第二时间戳小于或等于第一时间戳的目标操作指令；Filter out target operation instructions whose second timestamp is less than or equal to the first timestamp from each user operation instruction;

在目标虚拟场景中执行目标操作指令对应的虚拟对象动作响应。Execute the virtual object action response corresponding to the target operation instruction in the target virtual scene.

在一些实施例中，用户操作指令接收模块具体用于：In some embodiments, the user operation instruction receiving module is specifically used to:

创建虚拟用户，并将虚拟用户关联至主题虚拟空间；Create virtual users and associate the virtual users to the theme virtual space;

从主题虚拟空间中共享用户操作指令。Share user operation instructions from the topic virtual space.

在一些实施例中，目标虚拟场景中包括预置视图；In some embodiments, the target virtual scene includes a preset view;

相应地，第一合成视频流生成模块630具体用于：Correspondingly, the first synthesized video stream generating module 630 is specifically used to:

将用户视频流融合至场景视频流中的预置视图处，生成合成视频流。 Fusion of the user video stream to the preset view in the scene video stream to generate a composite video stream.

在一些实施例中，视频合成装置600还包括：In some embodiments, the video synthesis device 600 further includes:

目标视频模板确定模块，用于在接收用户视频流之后，基于模板筛选条件，从各预设视频模板中确定与目标虚拟场景对应的目标视频模板；其中，模板筛选条件包括用户视频流的视频时长、用户信息、用户操作指令和播放音频中的至少一个，用户信息包括用户情绪和/或用户年龄，且用户信息用于匹配预设视频模板中的人物形象；用户操作指令用于匹配预设视频模板中的录制视角；播放音频用于匹配预设视频模板中的特效组件；The target video template determination module is used to determine the target video template corresponding to the target virtual scene from each preset video template based on template filtering conditions after receiving the user video stream; wherein the template filtering conditions include the video duration of the user video stream , at least one of user information, user operation instructions and playing audio. The user information includes user emotions and/or user age, and the user information is used to match the characters in the preset video template; the user operation instructions are used to match the preset video. The recording perspective in the template; playing audio is used to match the special effects components in the preset video template;

第二合成视频流生成模块，用于融合用户视频流和目标视频模板，生成合成视频流。The second synthetic video stream generation module is used to fuse the user video stream and the target video template to generate a synthetic video stream.

进一步地，第二合成视频流生成模块具体用于：Further, the second synthetic video stream generation module is specifically used to:

将用户视频流融合至目标视频模板中的绿幕位置处，生成合成视频流；Fusion of the user video stream to the green screen position in the target video template to generate a synthetic video stream;

和/或，基于目标视频模板中的至少一个预设时间点，确定目标视频模板中的视频合成位置，并将用户视频流融合至目标视频模板中的视频合成位置处，生成合成视频流。And/or, based on at least one preset time point in the target video template, determine the video synthesis position in the target video template, and fuse the user video stream to the video synthesis position in the target video template to generate a synthesized video stream.

在一些实施例中，用户视频流接收模块610具体用于：In some embodiments, the user video stream receiving module 610 is specifically used to:

通过实时通信传输协议，从用户终端接收用户视频流。Receive user video streams from user terminals through real-time communication transmission protocols.

在一些实施例中，主题虚拟空间包括线上直播间、虚拟游戏房间或虚拟教育空间。In some embodiments, the theme virtual space includes an online live broadcast room, a virtual game room or a virtual education space.

本公开实施例所提供的视频合成装置可执行本公开任意实施例所提供的视频合成方法，具备执行方法相应的功能模块和有益效果。The video synthesis device provided by the embodiments of the present disclosure can execute the video synthesis method provided by any embodiment of the present disclosure, and has functional modules and beneficial effects corresponding to the execution method.

值得注意的是，上述视频合成装置的实施例中，所包括的各个模块和子模块只是按照功能逻辑进行划分的，但并不局限于上述的划分，只要能够实现相应的功能即可；另外，各功能模块/子模块的具体名称也只是为了便于相互区分，并不用于限制本公开的保护范围。It is worth noting that in the above embodiments of the video synthesis device, the various modules and sub-modules included are only divided according to functional logic, but are not limited to the above divisions, as long as the corresponding functions can be realized; in addition, each module The specific names of functional modules/sub-modules are only for the convenience of distinguishing each other and are not used to limit the scope of protection of the present disclosure.

本公开示例性实施例还提供一种电子设备，包括：至少一个处理器；以及与至少一个处理器通信连接的存储器。存储器存储有能够被至少一个处理器执行的计算机程序，计算机程序在被至少一个处理器执行时用于使电子设备执行一种视频合成方法，包括：Exemplary embodiments of the present disclosure also provide an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor. The memory stores a computer program that can be executed by at least one processor. When executed by at least one processor, the computer program is used to cause the electronic device to perform a video synthesis method, including:

接收用户视频流；其中，用户视频流为通过用户终端的摄像头拍摄所得的视频流；利用独立于用户视角相机的目标视角相机，对目标虚拟场景进行录制，生成目标视角下的场景视频流；其中，目标虚拟场景为用户终端中显示的主题虚拟空间对应的虚拟场景；融合用户视频流和场景视频流，生成合成视频流。Receive a user video stream; wherein the user video stream is a video stream captured by a camera of the user terminal; use a target perspective camera that is independent of the user perspective camera to record the target virtual scene and generate a scene video stream from the target perspective; where , the target virtual scene is the virtual scene corresponding to the theme virtual space displayed in the user terminal; the user video stream and the scene video stream are merged to generate a composite video stream.

在本公开的一些实施例中，计算机程序在被至少一个处理器执行时，还用于使电子设备执行：接收用户操作指令；利用独立于用户视角相机的目标视角相机，对目标虚拟场景进行录制，生成目标视角下的场景视频流包括：在目标虚拟场景中执行用户操作指令对应的虚拟对象动作响应；利用目标视角相机，对目标虚拟场景进行录制，生成包含虚拟对象动作响应的场景视频流。In some embodiments of the present disclosure, the computer program, when executed by at least one processor, is also used to cause the electronic device to: receive user operation instructions; record the target virtual scene using a target perspective camera that is independent of the user perspective camera. , Generating a scene video stream from the target perspective includes: executing the virtual object action response corresponding to the user operation instruction in the target virtual scene; using the target perspective camera to record the target virtual scene, and generating a scene video stream containing the virtual object action response.

在本公开的一些实施例中，用户视频流中携带第一时间戳，且用户操作指令中携带第二时间戳；计算机程序在被至少一个处理器执行时，还用于使电子设备执行：缓存用户操作指令；在目标虚拟场景中执行用户操作指令对应的虚拟对象动作响应包括：从各用户操作指令中筛选出第二时间戳小于或等于第一时间戳的目标操作指令；在目标虚拟场景中执行目标操作指令对应的虚拟对象动作响应。In some embodiments of the present disclosure, the user video stream carries a first timestamp, and the user operation instruction carries a second timestamp; when the computer program is executed by at least one processor, it is also used to cause the electronic device to execute: caching User operation instructions; the virtual object action response corresponding to the user operation instructions executed in the target virtual scene includes: from each Target operation instructions whose second timestamp is less than or equal to the first timestamp are screened out from the user operation instructions; and a virtual object action response corresponding to the target operation instruction is executed in the target virtual scene.

在本公开的一些实施例中，接收用户操作指令包括：创建虚拟用户，并将虚拟用户关联至主题虚拟空间；从主题虚拟空间中共享用户操作指令。In some embodiments of the present disclosure, receiving user operation instructions includes: creating a virtual user and associating the virtual user to the theme virtual space; and sharing the user operation instructions from the theme virtual space.

在本公开的一些实施例中，目标虚拟场景中包括预置视图；融合用户视频流和场景视频流，生成合成视频流包括：将用户视频流融合至场景视频流中的预置视图处，生成合成视频流。In some embodiments of the present disclosure, the target virtual scene includes a preset view; fusing the user video stream and the scene video stream to generate a synthetic video stream includes: fusing the user video stream to the preset view in the scene video stream to generate Synthetic video stream.

在本公开的一些实施例中，计算机程序在被至少一个处理器执行时，还用于使电子设备执行：基于模板筛选条件，从各预设视频模板中确定与目标虚拟场景对应的目标视频模板；其中，模板筛选条件包括用户视频流的视频时长、用户信息、用户操作指令和播放音频中的至少一个，用户信息包括用户情绪和/或用户年龄，且用户信息用于匹配预设视频模板中的人物形象；用户操作指令用于匹配预设视频模板中的录制视角；播放音频用于匹配预设视频模板中的特效组件；融合用户视频流和目标视频模板，生成合成视频流。In some embodiments of the present disclosure, the computer program, when executed by at least one processor, is also used to cause the electronic device to execute: based on the template filtering conditions, determine a target video template corresponding to the target virtual scene from each preset video template. ; Wherein, the template filtering conditions include at least one of the video duration of the user video stream, user information, user operation instructions and playback audio, the user information includes user emotions and/or user age, and the user information is used to match the preset video template character image; the user operation instructions are used to match the recording perspective in the preset video template; the audio playback is used to match the special effects components in the preset video template; the user video stream and the target video template are merged to generate a synthetic video stream.

在本公开的一些实施例中，融合用户视频流和目标视频模板，生成合成视频流包括：将用户视频流融合至目标视频模板中的绿幕位置处，生成合成视频流；和/或，基于目标视频模板中的至少一个预设时间点，确定目标视频模板中的视频合成位置，并将用户视频流融合至目标视频模板中的视频合成位置处，生成合成视频流。In some embodiments of the present disclosure, fusing the user video stream and the target video template to generate a synthetic video stream includes: fusing the user video stream to a green screen position in the target video template to generate a synthetic video stream; and/or based on Determine at least one preset time point in the target video template, determine the video synthesis position in the target video template, and fuse the user video stream to the video synthesis position in the target video template to generate a synthesized video stream.

在本公开的一些实施例中，接收用户视频流包括：通过实时通信传输协议，从用户终端接收用户视频流。In some embodiments of the present disclosure, receiving the user video stream includes: receiving the user video stream from the user terminal through a real-time communication transmission protocol.

在本公开的一些实施例中，主题虚拟空间包括线上直播间、虚拟游戏房间或虚拟教育空间。In some embodiments of the present disclosure, the theme virtual space includes an online live broadcast room, a virtual game room or a virtual education space.

本公开示例性实施例还提供一种存储有计算机程序的非瞬时计算机可读存储介质，其中，计算机程序在被计算机的处理器执行时用于使计算机执行一种视频合成方法，包括：Exemplary embodiments of the present disclosure also provide a non-transitory computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor of the computer, is used to cause the computer to perform a video synthesis method, including:

在本公开的一些实施例中，计算机程序在被计算机的处理器执行时，还用于使计算机执行：接收用户操作指令；利用独立于用户视角相机的目标视角相机，对目标虚拟场景进行录制，生成目标视角下的场景视频流包括：在目标虚拟场景中执行用户操作指令对应的虚拟对象动作响应；利用目标视角相机，对目标虚拟场景进行录制，生成包含虚拟对象动作响应的场景视频流。In some embodiments of the present disclosure, the computer program, when executed by the processor of the computer, is also used to cause the computer to: receive user operation instructions; record the target virtual scene using a target perspective camera that is independent of the user perspective camera, Generating a scene video stream from the target perspective includes: executing the virtual object action response corresponding to the user operation instruction in the target virtual scene; using the target perspective camera to record the target virtual scene and generating a scene video stream containing the virtual object action response.

在本公开的一些实施例中，用户视频流中携带第一时间戳，且用户操作指令中携带第二时间戳；计算机程序在被计算机的处理器执行时，还用于使计算机执行：缓存用户操作指令；在目标虚拟场景中执行用户操作指令对应的虚拟对象动作响应包括：从各用户操作指令中筛选出第二时间戳小于或等于第一时间戳的目标操作指令；在目标虚拟场景中执行目标操作指令对应的虚拟对象动作响应。In some embodiments of the present disclosure, the user video stream carries a first timestamp, and the user operation instructions carry a second timestamp; when the computer program is executed by the processor of the computer, it is also used to cause the computer to execute: cache the user Operation instructions; executing the virtual object action response corresponding to the user operation instructions in the target virtual scene includes: filtering out the target operation instructions whose second timestamp is less than or equal to the first timestamp from each user operation instruction; executing in the target virtual scene The virtual object action response corresponding to the target operation instruction.

在本公开的一些实施例中，计算机程序在被计算机的处理器执行时，还用于使计算机执行：基于模板筛选条件，从各预设视频模板中确定与目标虚拟场景对应的目标视频模板；其中，模板筛选条件包括用户视频流的视频时长、用户信息、用户操作指令和播放音频中的至少一个，用户信息包括用户情绪和/或用户年龄，且用户信息用于匹配预设视频模板中的人物形象；用户操作指令用于匹配预设视频模板中的录制视角；播放音频用于匹配预设视频模板中的特效组件；融合用户视频流和目标视频模板，生成合成视频流。In some embodiments of the present disclosure, the computer program, when executed by the processor of the computer, is also used to cause the computer to execute: based on the template filtering conditions, determine the target video template corresponding to the target virtual scene from each preset video template; Wherein, the template filtering conditions include at least one of the video duration of the user video stream, user information, user operation instructions and playback audio. The user information includes user emotions and/or user age, and the user information is used to match the preset video template. Character image; user operation instructions are used to match the recording perspective in the preset video template; audio playback is used to match the special effects components in the preset video template; the user video stream and the target video template are merged to generate a composite video stream.

在本公开的一些实施例中，融合用户视频流和目标视频模板，生成合成视频流包括：将用户视频流融合至目标视频模板中的绿幕位置处，生成合成视频流；和/或，基于目标视频模板中的至少一个预设时间点，确定目标视频模板中的视频合成位置，并将用户视频流融合至目标视频模板中的视频合成位置处，生成合成视频流。In some embodiments of the present disclosure, fusing the user video stream and the target video template to generate a synthetic video stream includes: fusing the user video stream to a green screen position in the target video template to generate a synthetic video stream; and/or, based on Determine at least one preset time point in the target video template, determine the video synthesis position in the target video template, and fuse the user video stream to the video synthesis position in the target video template to generate a synthesized video stream.

本公开示例性实施例还提供一种计算机程序产品，包括计算机程序，其中，计算机程序在被计算机的处理器执行时用于使计算机执行本公开任意实施例所说明的视频合成方法。Exemplary embodiments of the present disclosure also provide a computer program product, including a computer program, wherein the computer program, when executed by a processor of the computer, is used to cause the computer to execute the video synthesis method described in any embodiment of the present disclosure.

参考图7，现将描述可以作为本公开的服务器或客户端的电子设备700的结构框图，其是可以应用于本公开的各方面的硬件设备的示例。电子设备旨在表示各种形式的数字电子的计算机设备，诸如，膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置，诸如，个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例，并且不意在限制本文中描述的和/或者要求的本公开的实现。Referring to FIG. 7 , a structural block diagram of an electronic device 700 that may serve as a server or client of the present disclosure will now be described, which is an example of a hardware device that may be applied to aspects of the present disclosure. Electronic devices are intended to refer to various forms of digital electronic computing equipment, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.

如图7所示，电子设备700包括计算单元701，其可以根据存储在只读存储器(ROM)702中的计算机程序或者从存储单元708加载到随机访问存储器(RAM)703中的计算机程序，来执行各种适当的动作和处理。在RAM 703中，还可存储设备700操作所需的各种程序和数据。计算单元701、ROM 702以及RAM 703通过总线704彼此相连。输入/输出(I/O)接口705也连接至总线704。As shown in FIG. 7 , the electronic device 700 includes a computing unit 701 that can perform calculations according to a computer program stored in a read-only memory (ROM) 702 or loaded from a storage unit 708 into a random access memory (RAM) 703 . Perform various appropriate actions and processing. In the RAM 703, various data required for the operation of the device 700 can also be stored. programs and data. Computing unit 701, ROM 702 and RAM 703 are connected to each other via bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

电子设备700中的多个部件连接至I/O接口705，包括：输入单元706、输出单元707、存储单元708以及通信单元709。输入单元706可以是能向电子设备700输入信息的任何类型的设备，输入单元706可以接收输入的数字或字符信息，以及产生与电子设备的用户设置和/或功能控制有关的键信号输入。输出单元707可以是能呈现信息的任何类型的设备，并且可以包括但不限于显示器、扬声器、视频/音频输出终端、振动器和/或打印机。存储单元708可以包括但不限于磁盘、光盘。通信单元709允许电子设备700通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据，并且可以包括但不限于调制解调器、网卡、红外通信设备、无线通信收发机和/或芯片组，例如蓝牙TM设备、WiFi设备、WiMax设备、蜂窝通信设备和/或类似物。Multiple components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706, an output unit 707, a storage unit 708, and a communication unit 709. The input unit 706 may be any type of device capable of inputting information to the electronic device 700. The input unit 706 may receive input numeric or character information and generate key signal input related to user settings and/or function control of the electronic device. Output unit 707 may be any type of device capable of presenting information, and may include, but is not limited to, a display, speakers, video/audio output terminal, vibrator, and/or printer. The storage unit 708 may include, but is not limited to, a magnetic disk or an optical disk. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunications networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication transceiver and/or a chip Groups such as Bluetooth™ devices, WiFi devices, WiMax devices, cellular communications devices and/or the like.

计算单元701可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元701的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元701执行上文所描述的各个方法和处理。例如，在一些实施例中，本公开任意实施例所说明的视频合成方法可被实现为计算机软件程序，其被有形地包含于机器可读介质，例如存储单元707。在一些实施例中，计算机程序的部分或者全部可以经由ROM 702和/或通信单元709而被载入和/或安装到电子设备700上。在一些实施例中，计算单元701可以通过其他任何适当的方式(例如，借助于固件)而被配置为执行本公开任意实施例所说明的视频合成方法。Computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital signal processing processor (DSP), and any appropriate processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above. For example, in some embodiments, the video synthesis method described in any embodiment of the present disclosure may be implemented as a computer software program, which is tangibly included in a machine-readable medium, such as the storage unit 707. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. In some embodiments, the computing unit 701 may be configured in any other suitable manner (eg, by means of firmware) to perform the video synthesis method described in any embodiment of the present disclosure.

用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器，使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行，作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

在本公开的上下文中，机器可读介质可以是有形的介质，其可以包含或存储以供指令执行***、装置或设备使用或与指令执行***、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体***、装置或设备，或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of this disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

如本公开使用的，术语“机器可读介质”和“计算机可读介质”指的是用于将机器指令和/或数据提供给可编程处理器的任何计算机程序产品、设备、和/或装置(例如，磁盘、光盘、存储器、可编程逻辑装置(PLD))，包括，接收作为机器可读信号的机器指令的机器可读介质。术语“机器可读信号”指的是用于将机器指令和/或数据提供给可编程处理器的任何信号。As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or means for providing machine instructions and/or data to a programmable processor (For example, magnetic Disk, optical disk, memory, programmable logic device (PLD)), including, machine-readable media that receive machine instructions as machine-readable signals. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

为了提供与用户的交互，可以在计算机上实施此处描述的***和技术，该计算机具有：用于向用户显示信息的显示装置(例如，CRT(阴极射线管)或者LCD(液晶显示器)监视器)；以及键盘和指向装置(例如，鼠标或者轨迹球)，用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互；例如，提供给用户的反馈可以是任何形式的传感反馈(例如，视觉反馈、听觉反馈、或者触觉反馈)；并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.

可以将此处描述的***和技术实施在包括后台部件的计算***(例如，作为数据服务器)、或者包括中间件部件的计算***(例如，应用服务器)、或者包括前端部件的计算***(例如，具有图形用户界面或者网络浏览器的用户计算机，用户可以通过该图形用户界面或者该网络浏览器来与此处描述的***和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算***中。可以通过任何形式或者介质的数字数据通信(例如，通信网络)来将***的部件相互连接。通信网络的示例包括：局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.

计算机***可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。 Computer systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other.

Claims

一种视频合成方法，应用于服务端，包括：A video synthesis method, applied to the server, including:

接收用户视频流；其中，所述用户视频流为通过用户终端的摄像头拍摄所得的视频流；Receive a user video stream; wherein the user video stream is a video stream captured by a camera of a user terminal;

利用独立于用户视角相机的目标视角相机，对目标虚拟场景进行录制，生成目标视角下的场景视频流；其中，所述目标虚拟场景为所述用户终端中显示的主题虚拟空间对应的虚拟场景；Using a target perspective camera that is independent of the user perspective camera, record the target virtual scene and generate a scene video stream from the target perspective; wherein the target virtual scene is a virtual scene corresponding to the theme virtual space displayed in the user terminal;

融合所述用户视频流和所述场景视频流，生成合成视频流。The user video stream and the scene video stream are fused to generate a composite video stream.
根据权利要求1所述的方法，其中，在所述融合所述用户视频流和所述场景视频流，生成合成视频流之前，所述方法还包括：The method according to claim 1, wherein before fusing the user video stream and the scene video stream to generate a composite video stream, the method further includes:

接收用户操作指令；Receive user operation instructions;

所述利用独立于用户视角相机的目标视角相机，对目标虚拟场景进行录制，生成目标视角下的场景视频流包括：The use of a target perspective camera that is independent of the user perspective camera to record the target virtual scene and generate the scene video stream from the target perspective includes:

在所述目标虚拟场景中执行所述用户操作指令对应的虚拟对象动作响应；Execute the virtual object action response corresponding to the user operation instruction in the target virtual scene;

利用所述目标视角相机，对所述目标虚拟场景进行录制，生成包含所述虚拟对象动作响应的所述场景视频流。The target virtual scene is recorded using the target perspective camera, and the scene video stream containing the action response of the virtual object is generated.
根据权利要求2所述的方法，其中，所述用户视频流中携带第一时间戳，且所述用户操作指令中携带第二时间戳；The method according to claim 2, wherein the user video stream carries a first timestamp, and the user operation instruction carries a second timestamp;

在所述接收用户操作指令之后，所述方法还包括：After receiving the user operation instruction, the method further includes:

缓存所述用户操作指令；Cache the user operation instructions;

所述在所述目标虚拟场景中执行所述用户操作指令对应的虚拟对象动作响应包括：The virtual object action response corresponding to the execution of the user operation instruction in the target virtual scene includes:

从各所述用户操作指令中筛选出所述第二时间戳小于或等于所述第一时间戳的目标操作指令；Filter out target operation instructions whose second timestamp is less than or equal to the first timestamp from each of the user operation instructions;

在所述目标虚拟场景中执行所述目标操作指令对应的所述虚拟对象动作响应。Execute the virtual object action response corresponding to the target operation instruction in the target virtual scene.
根据权利要求2或3所述的方法，其中，所述接收用户操作指令包括：The method according to claim 2 or 3, wherein receiving user operation instructions includes:

创建虚拟用户，并将所述虚拟用户关联至所述主题虚拟空间；Create a virtual user and associate the virtual user to the theme virtual space;

从所述主题虚拟空间中共享所述用户操作指令。The user operation instructions are shared from the theme virtual space.
根据权利要求1-4中任一项所述的方法，其中，所述目标虚拟场景中包括预置视图；The method according to any one of claims 1-4, wherein the target virtual scene includes a preset view;

所述融合所述用户视频流和所述场景视频流，生成合成视频流包括：The fusing of the user video stream and the scene video stream to generate a composite video stream includes:

将所述用户视频流融合至所述场景视频流中的所述预置视图处，生成所述合成视频流。The user video stream is merged into the preset view in the scene video stream to generate the composite video stream.
根据权利要求1-5中任一项所述的方法，其中，在所述接收用户视频流之后，所述方法还包括：The method according to any one of claims 1-5, wherein after receiving the user video stream, the method further includes:

基于模板筛选条件，从各预设视频模板中确定与所述目标虚拟场景对应的目标视频模板；其中，所述模板筛选条件包括所述用户视频流的视频时长、用户信息、用户操作指令和播放音频中的至少一个，所述用户信息包括用户情绪和/或用户年龄，且所述用户信息用于匹配预设视频模板中的人物形象；所述用户操作指令用于匹配所述预设视频模板中的录制视角；所述播放音频用于匹配所述预设视频模板中的特效组件；Based on the template filtering conditions, determine the target video template corresponding to the target virtual scene from each preset video template; wherein the template filtering conditions include the video duration of the user video stream, user information, and user operations At least one of instructions and playing audio, the user information includes user emotion and/or user age, and the user information is used to match the character image in the preset video template; the user operation instruction is used to match the preset video template. Assume the recording angle in the video template; the playback audio is used to match the special effects components in the preset video template;

融合所述用户视频流和所述目标视频模板，生成所述合成视频流。The user video stream and the target video template are fused to generate the synthesized video stream.
根据权利要求6所述的方法，其中，所述融合所述用户视频流和所述目标视频模板，生成所述合成视频流包括：The method according to claim 6, wherein said fusing the user video stream and the target video template to generate the synthetic video stream includes:

将所述用户视频流融合至所述目标视频模板中的绿幕位置处，生成所述合成视频流；Fusion of the user video stream to the green screen position in the target video template to generate the synthetic video stream;

和/或，and / or,

基于所述目标视频模板中的至少一个预设时间点，确定所述目标视频模板中的视频合成位置，并将所述用户视频流融合至所述目标视频模板中的所述视频合成位置处，生成所述合成视频流。determining a video synthesis position in the target video template based on at least one preset time point in the target video template, and fusing the user video stream to the video synthesis position in the target video template, Generate the composite video stream.
根据权利要求1-7中任一项所述的方法，其中，所述接收用户视频流包括：The method according to any one of claims 1-7, wherein receiving the user video stream includes:

通过实时通信传输协议，从所述用户终端接收所述用户视频流。The user video stream is received from the user terminal through a real-time communication transmission protocol.
根据权利要求1-8中任一项所述的方法，其中，所述主题虚拟空间包括线上直播间、虚拟游戏房间或虚拟教育空间。The method according to any one of claims 1 to 8, wherein the theme virtual space includes an online live broadcast room, a virtual game room or a virtual education space.
一种视频合成装置，配置于服务端，包括：A video synthesis device, configured on the server side, including:

用户视频流接收模块，用于接收用户视频流；其中，所述用户视频流为通过用户终端的摄像头拍摄所得的视频流；A user video stream receiving module, configured to receive a user video stream; wherein the user video stream is a video stream captured by a camera of a user terminal;

场景视频流生成模块，用于利用独立于用户视角相机的目标视角相机，对目标虚拟场景进行录制，生成目标视角下的场景视频流；其中，所述目标虚拟场景为所述用户终端中显示的主题虚拟空间对应的虚拟场景；A scene video stream generation module, configured to use a target perspective camera that is independent of the user perspective camera to record the target virtual scene and generate a scene video stream from the target perspective; wherein the target virtual scene is the one displayed in the user terminal. The virtual scene corresponding to the theme virtual space;

第一合成视频流生成模块，用于融合所述用户视频流和所述场景视频流，生成合成视频流。The first synthetic video stream generation module is used to fuse the user video stream and the scene video stream to generate a synthetic video stream.
根据权利要求10所述的装置，其中，所述装置还包括：The device of claim 10, wherein the device further comprises:

用户操作指令接收模块，用于接收用户操作指令；A user operation instruction receiving module is used to receive user operation instructions;

其中，所述场景视频流生成模块包括：Wherein, the scene video stream generation module includes:

动作响应执行子模块，用于在目标虚拟场景中执行用户操作指令对应的虚拟对象动作响应；The action response execution submodule is used to execute the virtual object action response corresponding to the user operation instruction in the target virtual scene;

场景视频流生成子模块，用于利用目标视角相机，对目标虚拟场景进行录制，生成包含虚拟对象动作响应的场景视频流。The scene video stream generation submodule is used to record the target virtual scene using the target perspective camera and generate a scene video stream containing the virtual object's action response.
根据权利要求11所述的装置，其中，所述用户视频流中携带第一时间戳，且所述用户操作指令中携带第二时间戳，所述装置还包括：The device according to claim 11, wherein the user video stream carries a first timestamp, and the user operation instruction carries a second timestamp, and the device further includes:

用户操作指令缓存模块，用于在缓存用户操作指令；The user operation instruction cache module is used to cache user operation instructions;

所述动作响应执行子模块还用于：The action response execution sub-module is also used to:

从各所述用户操作指令中筛选出所述第二时间戳小于或等于所述第一时间戳的目标操作指令； Filter out target operation instructions whose second timestamp is less than or equal to the first timestamp from each of the user operation instructions;

在所述目标虚拟场景中执行所述目标操作指令对应的所述虚拟对象动作响应。Execute the virtual object action response corresponding to the target operation instruction in the target virtual scene.
根据权利要求11或12所述的装置，其中，所述用户操作指令接收模块用于：The device according to claim 11 or 12, wherein the user operation instruction receiving module is used for:

创建虚拟用户，并将所述虚拟用户关联至所述主题虚拟空间；Create a virtual user and associate the virtual user to the theme virtual space;

从所述主题虚拟空间中共享所述用户操作指令。The user operation instructions are shared from the theme virtual space.
根据权利要求10-13中任一项所述的装置，其中，所述目标虚拟场景中包括预置视图，所述第一合成视频流生成模块还用于：The device according to any one of claims 10-13, wherein the target virtual scene includes a preset view, and the first synthetic video stream generation module is also used to:

将所述用户视频流融合至所述场景视频流中的所述预置视图处，生成所述合成视频流。The user video stream is merged into the preset view in the scene video stream to generate the composite video stream.
根据权利要求10-14中任一项所述的装置，其中，所述装置还包括：The device according to any one of claims 10-14, wherein the device further includes:

目标视频模板确定模块，用于基于模板筛选条件，从各预设视频模板中确定与所述目标虚拟场景对应的目标视频模板；其中，所述模板筛选条件包括所述用户视频流的视频时长、用户信息、用户操作指令和播放音频中的至少一个，所述用户信息包括用户情绪和/或用户年龄，且所述用户信息用于匹配预设视频模板中的人物形象；所述用户操作指令用于匹配所述预设视频模板中的录制视角；所述播放音频用于匹配所述预设视频模板中的特效组件；A target video template determination module, configured to determine a target video template corresponding to the target virtual scene from each preset video template based on template filtering conditions; wherein the template filtering conditions include the video duration of the user video stream, At least one of user information, user operation instructions and playing audio, the user information includes user emotions and/or user age, and the user information is used to match characters in preset video templates; the user operation instructions are To match the recording angle in the preset video template; the playback audio is used to match the special effects components in the preset video template;

第二合成视频流生成模块，用于融合所述用户视频流和所述目标视频模板，生成所述合成视频流。The second synthetic video stream generation module is used to fuse the user video stream and the target video template to generate the synthetic video stream.
根据权利要求15所述的装置，其中，所述第二合成视频流生成模块还用于：The device according to claim 15, wherein the second synthesized video stream generating module is further configured to:

将所述用户视频流融合至所述目标视频模板中的绿幕位置处，生成所述合成视频流；Fusion of the user video stream to the green screen position in the target video template to generate the synthetic video stream;

和/或，and / or,

基于所述目标视频模板中的至少一个预设时间点，确定所述目标视频模板中的视频合成位置，并将所述用户视频流融合至所述目标视频模板中的所述视频合成位置处，生成所述合成视频流。determining a video synthesis position in the target video template based on at least one preset time point in the target video template, and fusing the user video stream to the video synthesis position in the target video template, Generate the composite video stream.
根据权利要求10-16中任一项所述的装置，其中，所述用户视频流接收模块还用于：The device according to any one of claims 10-16, wherein the user video stream receiving module is also used for:

通过实时通信传输协议，从所述用户终端接收所述用户视频流。The user video stream is received from the user terminal through a real-time communication transmission protocol.
根据权利要求10-17中任一项所述的装置，其中，所述主题虚拟空间包括线上直播间、虚拟游戏房间或虚拟教育空间。The device according to any one of claims 10 to 17, wherein the theme virtual space includes an online live broadcast room, a virtual game room or a virtual education space.
一种电子设备，包括：An electronic device including:

处理器；以及processor; and

存储程序的存储器，memory for storing programs,

其中，所述程序包括指令，所述指令在由所述处理器执行时使所述处理器执行根据权利要求1-9中任一项所述的视频合成方法。Wherein, the program includes instructions that, when executed by the processor, cause the processor to execute the video synthesis method according to any one of claims 1-9.
一种存储有计算机指令的非瞬时计算机可读存储介质，所述计算机指令用于使所述计算机执行根据权利要求1-9中任一项所述的视频合成方法。 A non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the video synthesis method according to any one of claims 1-9.