TW202230291A

TW202230291A - Direct clothing modeling for a drivable full-body avatar

Info

Publication number: TW202230291A
Application number: TW111103481A
Authority: TW
Inventors: 尼諾法比安安德烈斯普拉達; 吳城磊; 提姆爾巴高特汀諾夫; 徐維鵬; 潔西卡哈德金斯; 向東來
Original assignee: 美商菲絲博克科技有限公司
Priority date: 2021-01-27
Filing date: 2022-01-27
Publication date: 2022-08-01
Also published as: EP4285333A1; WO2022164995A1

Abstract

A method for training a real-time, direct clothing modeling for animating an avatar for a subject is provided. The method includes collecting multiple images of a subject, forming a three-dimensional clothing mesh and a three-dimensional body mesh based on the images of the subject, and aligning the three-dimensional clothing mesh to the three-dimensional body mesh to form a skin-clothing boundary and a garment texture. The method also includes determining a loss factor based on a predicted cloth position and garment texture and an interpolated position and garment texture from the images of the subject, and updating a three-dimensional model including the three-dimensional clothing mesh and the three-dimensional body mesh according to the loss factor. A system and a non-transitory, computer-readable medium storing instructions to cause the system to execute the above method are also provided.

Description

用於可駕駛全身虛擬化身之直接服裝模型化Direct clothing modeling for drivable full-body avatars

本揭示內容大體上係關於產生用於視訊捕捉之個體之三維電腦模型的領域。更具體言之，本揭示內容係關於自視訊序列準確且即時地以三維的方式呈現個人，包括該個人之服裝。The present disclosure generally relates to the field of generating three-dimensional computer models of individuals for video capture. More specifically, the present disclosure relates to accurately and real-time three-dimensional representation of a person, including the person's clothing, from a video sequence.

本揭示內容係關於且主張在35 U.S.C. §119(e)下於2021年1月27日由向（Xiang）等人申請之名稱為“EXPLICIT CLOTHING MODELING FOR A DRIVABLE FULL-BODY AVATAR”之第63/142,460號美國臨時申請案及於2022年1月14日申請之第17/576,787號美國非臨時申請案的的優先權，出於所有目的，該些申請案之內容以全文引用之方式併入本文中做為參考。This disclosure is related to and asserts Section 63/2021 entitled "EXPLICIT CLOTHING MODELING FOR A DRIVABLE FULL-BODY AVATAR" filed on January 27, 2021 by Xiang et al. Priority to US Provisional Application No. 142,460 and US Non-Provisional Application No. 17/576,787, filed January 14, 2022, the contents of which are hereby incorporated by reference in their entirety for all purposes as a reference.

可動畫逼真的數位人類為用於實現社交遙現之關鍵組成部分，其中有可能為人們在不受空間及時間限制的情況下開創一種新的聯絡方式。採用來自商品感測器之駕駛信號之輸入，模型需要不僅針對身體而且還針對回應於身體之運動而移動之服裝產生高保真度變形幾何形狀以及逼真紋理。用於使身體及服裝模型化之技術在很大程度上為分開發展的。身體模型化主要集中於幾何形狀，其可產生有說服力之幾何表面但無法產生逼真的呈現結果。即使僅針對幾何形狀，服裝模型化亦為更具挑戰性之話題。此處大部分進展僅為針對實體合理性而進行之模擬，而無忠實於真實資料之約束。此差距係至少部分地由於自真實世界資料捕捉三維（3D）服裝之挑戰。即使使用神經網路之最近資料駕駛方法，亦難以讓服裝動畫化能夠逼真地呈現。Animatable digital humans are a key component for enabling social telepresence, which has the potential to open up new ways for people to connect without the constraints of space and time. Using the input of driving signals from commodity sensors, the model needs to generate high fidelity deformable geometry and realistic textures not only for the body but also for the garment that moves in response to the body's motion. The techniques used to model the body and clothing have been largely developed separately. Body modeling mainly focuses on geometric shapes, which can produce convincing geometric surfaces but cannot produce realistic rendering results. Garment modeling is a more challenging topic, even for geometric shapes. Much of the progress here is merely a simulation of physical plausibility, without the constraints of fidelity to real data. This gap is due, at least in part, to the challenges of capturing three-dimensional (3D) garments from real-world data. Even with the recent data driving methods of neural networks, it is difficult to make clothing animations realistic.

在第一實施例中，一種電腦實施方法包括收集個體之多個影像，來自個體之影像包括個體之一或多個不同視角。該電腦實施方法亦包括：基於個體之影像而形成三維服裝網格及三維身體網格，將三維服裝網格與三維身體網格對準以形成皮膚-服裝邊界及衣物紋理，基於經預測服裝位置及衣物紋理以及來自個體之影像的經內插位置及衣物紋理而判定損耗因數，以及根據損耗因數更新包括三維服裝網格及三維身體網格之三維模型。In a first embodiment, a computer-implemented method includes collecting a plurality of images of an individual, the images from the individual including one or more different perspectives of the individual. The computer-implemented method also includes forming a 3D garment mesh and a 3D body mesh based on the image of the individual, aligning the 3D garment mesh with the 3D body mesh to form a skin-garment boundary and garment texture, based on the predicted garment position and clothing texture and the interpolated position and clothing texture from the image of the individual to determine a loss factor, and update a 3D model including a 3D clothing mesh and a 3D body mesh based on the loss factor.

在第二實施例中，一種系統包括：記憶體，其儲存多個指令；以及一或多個處理器，其經組態以執行指令以使系統執行操作。操作包括：收集個體之多個影像，來自個體之影像包含來自個體之不同輪廓之一或多個視圖；基於個體之影像而形成三維服裝網格及三維身體網格；以及將三維服裝網格與三維身體網格對準以形成皮膚服裝邊界及衣物紋理。操作亦包括：基於經預測服裝位置及紋理以及來自個體之影像的經內插位置及紋理而判定損耗因數，以及根據損耗因數更新包括三維服裝網格及三維身體網格之三維模型，其中收集個體之多個影像包含藉由同步的多攝影機系統捕捉來自個體之影像。In a second embodiment, a system includes: a memory that stores a plurality of instructions; and one or more processors configured to execute the instructions to cause the system to perform operations. The operations include: collecting a plurality of images of the individual, the images from the individual containing one or more views of different silhouettes from the individual; forming a 3D garment mesh and a 3D body mesh based on the images of the individual; and combining the 3D garment mesh with 3D body meshes are aligned to form skin garment boundaries and garment textures. The operations also include: determining a loss factor based on the predicted clothing position and texture and the interpolated position and texture from the image of the individual, and updating the 3D model including the 3D clothing mesh and the 3D body mesh based on the loss factor, wherein the individual is collected The multiple images include capturing images from individuals by a synchronized multi-camera system.

在第三實施例中，一種電腦實施方法包括自個體收集影像以及自影像選擇多個二維關鍵點。電腦實施方法亦包括：自影像識別與每一二維關鍵點相關聯之三維關鍵點，以及藉由三維模型判定三維服裝網格及三維身體網格，三維服裝網格及三維身體網格錨定於一或多個三維骨架姿態中。電腦實施方法亦包括：產生包括三維服裝網格、三維身體網格及紋理的個體之三維表示，以及將個體之三維表示即時地嵌入於虛擬實境環境中。In a third embodiment, a computer-implemented method includes collecting images from an individual and selecting a plurality of two-dimensional keypoints from the images. The computer-implemented method also includes identifying, from the image, 3D keypoints associated with each 2D keypoint, and determining, from the 3D model, the 3D garment mesh and the 3D body mesh, the 3D garment mesh and the 3D body mesh anchoring in one or more 3D skeletal poses. The computer-implemented method also includes generating a three-dimensional representation of the individual including a three-dimensional clothing mesh, a three-dimensional body mesh, and textures, and embedding the three-dimensional representation of the individual in a virtual reality environment in real-time.

在另一實施例中，一種儲存指令的非暫時性電腦可讀取媒體，該些指令在由處理器執行時使電腦執行方法。方法包括：收集個體之多個影像，來自個體之影像包括個體之一或多個不同視角；基於個體之影像而形成三維服裝網格及三維身體網格；以及將三維服裝網格與三維身體網格對準以形成皮膚-服裝邊界及衣物紋理。方法亦包括：基於經預測服裝位置及衣物紋理以及來自個體之影像的經內插位置及衣物紋理而判定損耗因數，以及根據損耗因數更新包括三維服裝網格及三維身體網格之三維模型。In another embodiment, a non-transitory computer-readable medium storing instructions that, when executed by a processor, cause a computer to perform a method. The method includes: collecting a plurality of images of an individual, the images from the individual including one or more different perspectives of the individual; forming a three-dimensional clothing mesh and a three-dimensional body mesh based on the images of the individual; and combining the three-dimensional clothing mesh with the three-dimensional body mesh grids are aligned to form skin-garment boundaries and garment textures. The method also includes determining a loss factor based on the predicted clothing location and clothing texture and the interpolated location and clothing texture from the imagery of the individual, and updating a three-dimensional model including a three-dimensional clothing mesh and a three-dimensional body mesh based on the loss factor.

在另一其他實施例中，一種系統包括用於儲存指令的構件及用以執行指令以執行方法之構件，方法包括：收集個體之多個影像，來自個體之影像包括個體之一或多個不同視角；基於個體之影像而形成三維服裝網格及三維身體網格；以及將三維服裝網格與三維身體網格對準以形成皮膚-服裝邊界及衣物紋理。方法亦包括：基於經預測服裝位置及衣物紋理以及來自個體之影像的經內插位置及衣物紋理而判定損耗因數，以及根據損耗因數更新包括三維服裝網格及三維身體網格之三維模型。In yet other embodiments, a system includes means for storing instructions and means for executing instructions to perform a method, the method comprising: collecting a plurality of images of an individual, the images from the individual including one or more different perspective; forming a 3D garment mesh and a 3D body mesh based on the image of the individual; and aligning the 3D garment mesh with the 3D body mesh to form the skin-garment boundary and garment texture. The method also includes determining a loss factor based on the predicted clothing location and clothing texture and the interpolated location and clothing texture from the imagery of the individual, and updating a three-dimensional model including a three-dimensional clothing mesh and a three-dimensional body mesh based on the loss factor.

在以下詳細描述中，闡述諸多具體細節以提供對本揭示內容之充分理解。然而，對於所屬技術領域中具有通常知識者將顯而易見，可在並無這些具體細節中之一些的情況下實踐本揭示內容之實施例。在其他例子中，並未詳細展示熟知結構及技術以免混淆本揭示內容。一般綜述 In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. However, it will be apparent to one of ordinary skill in the art that embodiments of the present disclosure may be practiced without some of these specific details. In other instances, well-known structures and techniques have not been shown in detail in order not to obscure the present disclosure. General overview

提供一種用於來自雙目視訊之包括服裝之高保真度三維動畫的即時系統。系統可在其適應於個體之身體運動時追蹤服裝之運動及重塑（例如，不同照明條件）。使用深度生成模型來同時使幾何形狀及紋理兩者模型化為實現高保真度人臉虛擬化身之有效方式。然而，使用深度生成模型來呈現穿著衣服之身體存在挑戰。由於較大變形、較多遮擋以及服裝與身體之間不斷改變之邊界，應用多視圖身體資料來獲取時間同調的身體網格與時間同調的服裝網格具有挑戰性。此外，由於身體姿態之大變化及其服裝狀態之動態改變，用於人臉之網路結構不可直接應用於穿著衣服之身體模型化。A real-time system for high-fidelity three-dimensional animation including clothing from binocular video is provided. The system can track the movement and remodeling of the garment as it adapts to the individual's body movement (eg, different lighting conditions). Simultaneous modeling of both geometry and texture using a deep generative model is an efficient way to achieve high-fidelity face avatars. However, using deep generative models to represent a clothed body presents challenges. Applying multi-view body data to obtain time-coherent body meshes and time-coherent clothing meshes is challenging due to large deformations, more occlusions, and changing boundaries between clothing and body. Furthermore, due to the large changes in body poses and the dynamic changes in clothing states, the network structure used for human faces cannot be directly applied to modeling the body in clothing.

因此，直接服裝模型化意謂如本文中所揭示之實施例產生與個體之服裝相關聯之三維網格，包括形狀及衣物紋理，與三維身體網格分開進行。因此，模型可按任何沉浸實境環境需要而調整、改變及修改虛擬化身之服裝及衣物，而不損耗個體之逼真再現。Thus, direct garment modeling means generating a three-dimensional mesh, including shapes and garment textures, associated with an individual's garment, separate from the three-dimensional body mesh, as embodiments disclosed herein. Thus, the model can adjust, alter, and modify the avatar's clothing and clothing as required by any immersive reality environment without sacrificing the lifelike representation of the individual.

為了解決在電腦網路、電腦模擬及沉浸實境應用之領域中產生的這些技術問題，如本文中所揭示之實施例將身體及服裝表示為單獨網格且包括新框架，自捕捉至模型化，以用於產生深度生成模型。此深度生成模型對於直接身體及服裝表示是完全可動畫化及可編輯的。To address these technical problems arising in the fields of computer networking, computer simulation, and immersive reality applications, embodiments as disclosed herein represent the body and clothing as separate meshes and include a new framework, from capture to modeling , for generating deep generative models. This deep generative model is fully animatable and editable for direct body and clothing representation.

在一些實施例中，基於幾何形狀之配準方法將身體及服裝表面與模板對準，其中在身體與服裝之間具有直接約束。另外，一些實施例包括具有逆呈現之光度追蹤方法，以將服裝紋理與參考對準，並且產生精確時間同調的網格以用於學習。在雙層網格作為輸入之情況下，一些實施例包括變分自動編碼器以在規範姿態中分別使身體及服裝模型化。模型經由時間模型，例如，時間卷積網路（TCN）來學習姿態與服裝之間的互動，以自作為駕駛信號的身體姿態之序列推斷服裝狀態。時間模型充當資料駕駛模擬機器以演進與身體狀態之移動一致的服裝狀態。服裝之直接模型化實現穿著衣服之身體模型之編輯，例如，藉由改變服裝紋理，開闢改變虛擬化身上之服裝的可能性且因此開闢虛擬試穿之可能性。In some embodiments, a geometry-based registration method aligns the body and garment surfaces with the template, with direct constraints between the body and garment. Additionally, some embodiments include a photometric tracing method with inverse rendering to align the garment texture to a reference and produce a mesh that is accurate time coherent for learning. With a two-layer mesh as input, some embodiments include a variational autoencoder to separately model the body and clothing in canonical poses. The model learns the interaction between pose and clothing via a temporal model, such as a temporal convolutional network (TCN), to infer clothing states from sequences of body poses as driving signals. The temporal model acts as a data driving simulation machine to evolve clothing states consistent with the movement of body states. Direct modeling of clothing enables editing of the body model of the clothing, eg by changing the texture of the clothing, opening up the possibility of changing the clothing on the avatar and thus opening up the possibility of virtual fitting.

更具體言之，如本文中所揭示之實施例包括雙層編解碼器虛擬化身模型，其用於逼真全身遙現，以在視訊個體之三維重現中更具表現力地呈現服裝外觀。虛擬化身具有更清晰皮膚-服裝邊界、更清晰衣物紋理及更穩固遮擋處置。另外，如本文中所揭示之虛擬化身模型包括光度追蹤演算法，其對準顯著服裝紋理，實現虛擬化身服裝之直接編輯及處置，而與身體移動、姿勢及手勢無關。如本文中所揭示之雙層編解碼器虛擬化身模型可用於具有高品質位準的虛擬化身之逼真姿態駕駛動畫及服裝紋理之編輯中。實例系統架構 More specifically, embodiments as disclosed herein include a two-layer codec avatar model for realistic full-body telepresence to more expressively render the appearance of clothing in three-dimensional reproductions of video individuals. Avatars have sharper skin-clothing boundaries, clearer clothing textures, and more robust occlusion handling. In addition, avatar models as disclosed herein include photometric tracking algorithms that target salient clothing textures, enabling direct editing and manipulation of avatar clothing, independent of body movement, posture, and gestures. A two-layer codec avatar model as disclosed herein can be used in the editing of realistic pose driving animations and clothing textures for avatars with high quality levels. Example System Architecture

圖1說明根據一些實施例的適合於存取模型訓練引擎之實例架構100。架構100包括伺服器130，該些伺服器經由網路150與用戶端裝置110及至少一個資料庫152通信耦接。多個伺服器130中之一者經組態以主控記憶體，該記憶體包括在由處理器執行時使伺服器130執行如本文中所揭示之方法中之步驟中的至少一些的指令。在一些實施例中，處理器經組態以控制用於存取模型訓練引擎之用戶端裝置110中之一者的使用者之圖形使用者介面（GUI）。模型訓練引擎可經組態以訓練機器學習模型以用於解決具體應用。因此，處理器可包括儀錶板工具，該儀錶板工具經組態以經由GUI向使用者顯示組件及圖形結果。出於負載平衡之目的，多個伺服器130可主控記憶體，其包括傳送至一或多個處理器之指令，並且多個伺服器130可主控歷史日誌以及包括用於模型訓練引擎之多個訓練檔案庫的資料庫152。此外，在一些實施例中，用戶端裝置110之多個使用者可存取相同模型訓練引擎以運行一或多個機器學習模型。在一些實施例中，具有單個用戶端裝置110之單個使用者可訓練在一或多個伺服器130中並行地運行之多個機器學習模型。因此，用戶端裝置110可經由網路150及經由存取一或多個伺服器130及位於其中之資源而彼此通信。1 illustrates an example architecture 100 suitable for accessing a model training engine, according to some embodiments. Architecture 100 includes servers 130 that are communicatively coupled to client device 110 and at least one database 152 via network 150 . One of the plurality of servers 130 is configured to host memory that includes instructions that, when executed by the processor, cause the server 130 to perform at least some of the steps in the methods as disclosed herein. In some embodiments, the processor is configured to control a graphical user interface (GUI) for a user of one of the client devices 110 for accessing the model training engine. The model training engine can be configured to train machine learning models for solving specific applications. Accordingly, the processor may include a dashboard tool configured to display components and graphical results to the user via the GUI. For load balancing purposes, multiple servers 130 may host memory, which includes instructions sent to one or more processors, and multiple servers 130 may host historical logs and include data for model training engines. Repository 152 of a plurality of training archives. Furthermore, in some embodiments, multiple users of the client device 110 may access the same model training engine to run one or more machine learning models. In some embodiments, a single user with a single client device 110 can train multiple machine learning models running in parallel in one or more servers 130 . Thus, client devices 110 may communicate with each other via network 150 and by accessing one or more servers 130 and the resources located therein.

伺服器130可包括具有適當處理器、記憶體及用於主控模型訓練引擎（包括與其相關聯之多個工具）之通信能力的任何裝置。模型訓練引擎可經由網路150由各種用戶端110來存取。用戶端110可為例如桌上型電腦、行動電腦、平板電腦（例如，包括電子書閱讀器）、行動裝置（例如，智慧型手機或PDA），或具有適當處理器、記憶體及用於存取伺服器130中之一或多者上之模型訓練引擎之通信能力的任何其他裝置。網路150可包括例如區域工具（LAN）、廣域工具（WAN）、網際網路及其類似者中之任一或多者。此外，網路150可包括但不限於以下工具拓樸中之任一或多者，包括：匯流排網路、星形網路、環形網路、網格網路、星形匯流排網路、樹或階層式網路及其類似者。Server 130 may include any device with suitable processors, memory, and communication capabilities for hosting a model training engine, including tools associated therewith. The model training engine may be accessed by various clients 110 via the network 150 . The client 110 can be, for example, a desktop computer, a mobile computer, a tablet computer (eg, including an e-book reader), a mobile device (eg, a smart phone or a PDA), or has a suitable processor, memory, and memory for storage. Any other device that takes the communication capabilities of the model training engine on one or more of the servers 130. Network 150 may include, for example, any or more of a local area tool (LAN), a wide area tool (WAN), the Internet, and the like. Additionally, network 150 may include, but is not limited to, any one or more of the following tool topologies, including: bus network, star network, ring network, mesh network, star bus network, A tree or hierarchical network and the like.

圖2為說明根據本揭示內容之某些態樣的來自架構100之實例伺服器130及用戶端裝置110的方塊圖200。用戶端裝置110及伺服器130經由各別通信模組218-1及218-2（在下文中，被集體地稱作「通信模組218」）經由網路150通信耦接。通信模組218經組態以與網路150介接以經由網路150將諸如資料、請求、回應及命令之資訊發送至其他裝置及接收該些資訊。通信模組218可例如為數據機或乙太網路卡。使用者可經由輸入裝置214及輸出裝置216與用戶端裝置110互動。輸入裝置214可包括滑鼠、鍵盤、指標、觸控螢幕、麥克風及其類似者。輸出裝置216可為螢幕顯示器、觸控螢幕、揚聲器及其類似者。用戶端裝置110可包括記憶體220-1及處理器212-1。記憶體220-1可包括應用程式222及GUI 225，該應用程式及該GUI經組態以在用戶端裝置110中運行且與輸入裝置214及輸出裝置216耦接。應用程式222可由使用者自伺服器130下載且可由伺服器130主控。2 is a block diagram 200 illustrating an example server 130 and client device 110 from architecture 100, according to certain aspects of the present disclosure. Client device 110 and server 130 are communicatively coupled via network 150 via respective communication modules 218-1 and 218-2 (hereinafter collectively referred to as "communication modules 218"). Communication module 218 is configured to interface with network 150 to send and receive information such as data, requests, responses, and commands to other devices via network 150 . The communication module 218 can be, for example, a modem or an Ethernet card. The user can interact with the client device 110 via the input device 214 and the output device 216 . Input device 214 may include a mouse, keyboard, pointer, touch screen, microphone, and the like. The output device 216 can be a screen display, a touch screen, a speaker, and the like. The client device 110 may include a memory 220-1 and a processor 212-1. The memory 220-1 may include an application 222 and a GUI 225 configured to run in the client device 110 and coupled to the input device 214 and the output device 216. The application 222 can be downloaded by the user from the server 130 and can be hosted by the server 130 .

伺服器130包括記憶體220-2、處理器212-2及通信模組218-2。在下文中，處理器212-1及212-2以及記憶體220-1及220-2將分別被集體地稱作「處理器212」及「記憶體220」。處理器212經組態以執行儲存於記憶體220中之指令。在一些實施例中，記憶體220-2包括模型訓練引擎232。模型訓練引擎232可共用或向GUI 225提供特徵及資源，包括與訓練及使用用於沉浸實境應用程式之三維虛擬化身呈現模型相關聯的多個工具。使用者可經由安裝於用戶端裝置110之記憶體220-1中的GUI 225存取模型訓練引擎232。因此，GUI 225可由伺服器130安裝且執行由伺服器130經由多個工具中之任一者提供之指令碼及其他常式。GUI 225之執行可由處理器212-1控制。The server 130 includes a memory 220-2, a processor 212-2 and a communication module 218-2. Hereinafter, processors 212-1 and 212-2 and memories 220-1 and 220-2 will be collectively referred to as "processor 212" and "memory 220", respectively. Processor 212 is configured to execute instructions stored in memory 220 . In some embodiments, memory 220-2 includes model training engine 232. Model training engine 232 may share or provide features and resources to GUI 225, including tools associated with training and rendering models using three-dimensional avatars for immersive reality applications. The user can access the model training engine 232 via the GUI 225 installed in the memory 220 - 1 of the client device 110 . Thus, GUI 225 may be installed by server 130 and execute scripting and other routines provided by server 130 through any of a number of tools. Execution of GUI 225 may be controlled by processor 212-1.

就此而言，如本文中所揭示，模型訓練引擎232可經組態以產生、儲存、更新及維持即時直接服裝動畫模型240。服裝動畫模型240可包括諸如身體解碼器242、服裝解碼器244、分段工具246及時間卷積工具248之編碼器、解碼器及工具。在一些實施例中，模型訓練引擎232可存取儲存於訓練資料庫252中之一或多個機器學習模型。訓練資料庫252包括可由模型訓練引擎232根據使用者經由GUI 225之輸入而訓練機器學習模型的訓練檔案庫及其他資料檔案。此外，在一些實施例中，至少一或多個訓練檔案庫或機器學習模型可儲存於記憶體220中之任一者中，並且使用者可經由GUI 225對其進行存取。In this regard, model training engine 232 can be configured to generate, store, update, and maintain real-time direct garment animation models 240 as disclosed herein. Clothing animation model 240 may include encoders, decoders, and tools such as body decoder 242 , clothing decoder 244 , segmentation tool 246 , and temporal convolution tool 248 . In some embodiments, model training engine 232 may access one or more machine learning models stored in training database 252 . Training database 252 includes training files and other data files that can be used by model training engine 232 to train machine learning models based on user input via GUI 225 . Additionally, in some embodiments, at least one or more training archives or machine learning models may be stored in any of the memories 220 and accessible to the user via the GUI 225 .

身體解碼器242基於來自個體之輸入影像而判定骨架姿態，並且根據藉由訓練習得之分類方案將具有表面變形之蒙皮網格添加至骨架姿態。服裝解碼器244藉由幾何形狀分支判定三維服裝網格以界定形狀。在一些實施例中，服裝解碼器244亦可使用解碼器中之紋理分支來判定衣物紋理。分段工具246包括服裝分段層及身體分段層。分段工具246提供服裝區段及身體區段以實現三維服裝網格與三維身體網格之對準。如本文中所揭示，時間卷積工具248執行即時虛擬化身模型之姿態駕駛動畫的時間模型化。因此，時間卷積工具248包括使個體之多個骨架姿態（例如，在預選擇時間窗內串接）與三維服裝網格同調的時間編碼器。The body decoder 242 determines the skeletal pose based on the input images from the individual, and adds a skinned mesh with surface deformations to the skeletal pose according to the classification scheme learned through training. The garment decoder 244 determines the three-dimensional garment mesh to define the shape by the geometry branch. In some embodiments, the garment decoder 244 may also use the texture branch in the decoder to determine the garment texture. The segmentation tool 246 includes a garment segmentation layer and a body segmentation layer. Segmentation tool 246 provides garment segments and body segments to achieve alignment of the three-dimensional garment mesh with the three-dimensional body mesh. As disclosed herein, the temporal convolution tool 248 performs temporal modeling of the pose driving animation of the real-time avatar model. Thus, the temporal convolution tool 248 includes a temporal encoder that brings multiple skeletal poses of an individual (eg, concatenated within a preselected time window) into synchrony with a three-dimensional garment mesh.

模型訓練引擎232可包括出於其中所包括之引擎及工具之具體目的而訓練的演算法。演算法可包括利用任何線性或非線性演算法之機器學習或人工智慧演算法，諸如神經網路演算法或多變量回歸演算法。在一些實施例中，機器學習模型可包括神經網路（NN）、卷積神經網路（CNN）、生成對抗神經網路（GAN）、深度增強式學習（DRL）演算法、深度遞回神經網路（DRNN）、經典機器學習演算法，諸如隨機森林、k最近相鄰法（KNN）演算法、k均值叢集演算法或其任何組合。更一般而言，機器學習模型可包括涉及訓練步驟及最佳化步驟之任何機器學習模型。在一些實施例中，訓練資料庫252可包括用以根據機器學習模型之所要結果修改係數之訓練檔案庫。因此，在一些實施例中，模型訓練引擎232經組態以存取訓練資料庫252以擷取文件及檔案庫作為用於機器學習模型之輸入。在一些實施例中，模型訓練引擎232、其中含有之工具及訓練資料庫252之至少部分可主控於可由伺服器130存取之不同伺服器中。Model training engine 232 may include algorithms trained for the specific purposes of the engines and tools included therein. Algorithms may include machine learning or artificial intelligence algorithms utilizing any linear or non-linear algorithm, such as neural network road algorithms or multivariate regression algorithms. In some embodiments, machine learning models may include neural networks (NN), convolutional neural networks (CNN), generative adversarial neural networks (GAN), deep reinforcement learning (DRL) algorithms, deep recurrent neural networks Networks (DRNN), classical machine learning algorithms such as random forests, k-nearest neighbors (KNN) algorithms, k-means clustering algorithms, or any combination thereof. More generally, a machine learning model can include any machine learning model that involves training steps and optimization steps. In some embodiments, the training database 252 may include a training file library used to modify coefficients according to the desired results of the machine learning model. Thus, in some embodiments, model training engine 232 is configured to access training database 252 to retrieve files and archives as input for the machine learning model. In some embodiments, model training engine 232 , tools contained therein, and at least a portion of training database 252 may be hosted on a different server accessible by server 130 .

圖3說明根據一些實施例的穿著衣服之身體流水線300。原始影像301經收集（例如，經由攝影機或視訊裝置），並且資料預處理步驟302呈現3D重建342，包括關鍵點344及分段呈現346。影像301可包括視訊序列中或來自自一或多個攝影機收集之多個視訊序列的多個影像或圖框，該些攝影機經定向以形成個體303之多方向視圖（「多視圖」）。FIG. 3 illustrates a clothed body pipeline 300 in accordance with some embodiments. Raw imagery 301 is collected (eg, via a camera or video device), and a data preprocessing step 302 presents a 3D reconstruction 342 , including keypoints 344 and segment presentation 346 . Image 301 may include multiple images or frames in a video sequence or from multiple video sequences collected from one or more cameras oriented to form a multi-directional view of individual 303 (“multi-view”).

單層表面追蹤（SLST）操作304識別網格354。SLST操作304使用動態身體模型來非剛性地配準經重建網格354。在一些實施例中，動態身體模型包括

= 159個關節、

= 614, 118個頂點及用於所有頂點之預定義線性混合蒙皮（LBS）權重。LBS函數W(・, ・)為使與骨架結構一致之網格354變形的轉型。LBS函數W(・, ・)採用靜止姿態頂點及關節角度作為輸入，並且輸出目標姿態頂點。SLST操作304藉由計算最佳擬合手動選擇之峰值姿態之集合的靜止狀態形狀

來估計個人化模型。接著，對於每一圖框，吾人估計關節角度

之集合，使得經蒙皮模型

=

與網格354及關鍵點344具有最小距離。SLST操作304使用

作為初始化且使幾何對應誤差最小化以及拉普拉斯正則化來計算每圖框頂點偏移以配準網格354。網格354與分段呈現346組合以在網格分段306中形成經分段網格356。內部層形狀估計（ILSE）操作308產生身體網格321-1。 A single layer surface tracking (SLST) operation 304 identifies a mesh 354 . The SLST operation 304 non-rigidly registers the reconstructed mesh 354 using the dynamic body model. In some embodiments, the dynamic body model includes

= 159 joints,

= 614, 118 vertices and predefined Linear Blend Skinning (LBS) weights for all vertices. The LBS function W(·, ·) is a transformation that deforms the mesh 354 conforming to the skeletal structure. The LBS function W(・,・) takes the static pose vertex and joint angle as input, and outputs the target pose vertex. SLST operation 304 by computing a rest-state shape that best fits a set of manually selected peak poses

to estimate the personalization model. Next, for each frame, we estimate the joint angles

A collection of skinned models such that

=

There is a minimum distance from grid 354 and keypoints 344 . SLST operation 304 uses

Per-frame vertex offsets are computed as initialization and to minimize geometric correspondence error and Laplace regularization to register mesh 354 . Grid 354 is combined with segment rendering 346 to form segmented grid 356 in grid segments 306 . An Internal Layer Shape Estimation (ILSE) operation 308 produces a body mesh 321-1.

對於序列中之每一影像301，流水線300使用經分段網格356來識別上部服裝之目標區域。在一些實施例中，經分段網格356與服裝模板364（例如，包括具體服裝紋理、顏色、圖案及其類似者）組合，以在服裝配準310中形成服裝網格321-2。在下文中，身體網格321-1及服裝網格321-2將被集體地稱作「網格321」。服裝配準310使服裝模板364變形以匹配目標服裝網格。在一些實施例中，為了產生服裝模板364，其中產生較大群體資料集包含評估由統計參數調節之生物標記值的隨機變數以及將隨機變數與生物標記資料集合之間的差與由傾向卡尺導出之距離度量進行比較，流水線300在SLST操作304中選擇（例如，手動或自動選擇）一個圖框，並且使用在網格分段306中識別之上部服裝區域，以產生服裝模板364。流水線300在2D UV座標中為服裝模板364產生圖。因此，服裝模板364中之每一頂點與來自身體網格321-1之頂點相關聯，並且可使用模型

來蒙皮。流水線300在身體網格321-1中重複使用三角剖分以為服裝模板364產生拓樸。 For each image 301 in the sequence, the pipeline 300 uses the segmented grid 356 to identify the target area for the upper garment. In some embodiments, segmented mesh 356 is combined with garment template 364 (eg, including specific garment textures, colors, patterns, and the like) to form garment mesh 321 - 2 in garment registration 310 . Hereinafter, the body mesh 321-1 and the clothing mesh 321-2 will be collectively referred to as "mesh 321". The garment registration 310 deforms the garment template 364 to match the target garment mesh. In some embodiments, to generate the garment template 364, where generating the larger population data set includes evaluating a random variable of the biomarker values adjusted by statistical parameters and comparing the difference between the random variable and the set of biomarker data with a propensity caliper derived To compare the distance metrics, pipeline 300 selects (eg, manually or automatically selects) a frame in SLST operation 304 and uses the upper garment region identified in grid segment 306 to generate garment template 364 . Pipeline 300 generates maps for garment template 364 in 2D UV coordinates. Thus, each vertex in garment template 364 is associated with a vertex from body mesh 321-1 and the model can be used

Come to skin. Pipeline 300 reuses triangulation in body mesh 321-1 to generate topology for garment template 364.

為了提供變形之較佳初始化，服裝配準310可應用雙諧波變形場以尋找將服裝模板364之邊界與目標服裝網格邊界對準之每頂點變形，同時使內部失真保持儘可能低。此允許服裝模板364之形狀匯聚至較佳的本端最小化。To provide better initialization of deformations, garment registration 310 may apply a dual harmonic deformation field to find per-vertex deformations that align the boundaries of garment template 364 with the boundaries of the target garment mesh, while keeping internal distortions as low as possible. This allows the shape of the garment template 364 to converge to a better local minimization.

ILSE 308包括估計由上部服裝覆蓋之不可見身體區域，以及估計可直接自身體網格321-1獲得之任何其他可見身體區域（例如，不由服裝覆蓋）。在一些實施例中，ILSE 308根據一系列3D穿著衣服之人類掃描估計底層身體形狀。ILSE 308 includes estimating the invisible body area covered by the upper garment, as well as estimating any other visible body area (eg, not covered by the garment) available directly from the body mesh 321-1. In some embodiments, ILSE 308 estimates the underlying body shape from a series of 3D scans of clothed humans.

ILSE 308基於來自所捕捉序列之30個影像301之樣本而產生個體之跨圖框內部層身體模板

，並且將這些圖框之靜止姿態

中之全身追蹤表面融合成單個形狀 V ^Fu。在一些實施例中，ILSE 308使用經融合形狀 V ^Fu之以下性質：（1）： V ^Fu中之所有上部服裝頂點應位於內部層身體形狀

外部。並且（2）： V ^Fu中不屬於上部服裝區域之頂點應接近

^。ILSE 308藉由對以下最佳化方程式求解來求解

：

(1) ILSE 308 generates cross-frame inner layer body templates of individuals based on samples from 30 images 301 of the captured sequence

, and the rest pose of these frames

The whole body tracking surface in the fuses into a single shape V ^Fu . In some embodiments, ILSE 308 uses the following properties of fused shape V ^Fu : (1): All upper garment vertices in V ^Fu should be in the inner layer body shape

external. And (2): Vertices in V ^Fu that do not belong to the upper garment region should be close to

^. ILSE 308 is solved by solving the following optimization equation

:

(1)

特定言之，

out懲罰（penalize） V ^Fu之位於

內部的任何上部服裝頂點，其量自以下判定：

(2) Specifically,

out penalty (penalize) where V ^Fu is located

Any upper garment vertices inside, the amount of which is determined from:

(2)

其中

(·, ·)為自頂點

至表面

之帶正負號距離，若

位於

外部，則其採用正值，並且若

位於內部，則其採用負值。係數

係由網格分段306提供。若

經標記為上部服裝，則係數

採用值1，並且

以其他方式標記，則該係數採用值0。為了避免過度薄之內部層， E ^t _fi _t懲罰 V ^Fu與 V ^t之間的過大距離，如同：

(3) in

(·, ·) is the self vertex

to the surface

The signed distance, if

lie in

external, it takes a positive value, and if

is inside, it takes a negative value. coefficient

is provided by grid segment 306 . like

is marked as the upper garment, the coefficient

takes the value 1, and

Otherwise marked, the coefficient takes the value 0. To avoid excessively thin inner layers, E ^t _fi _t penalizes excessive distance between V ^Fu and V ^t as:

(3)

其中此項之權重小於「out」項

。在一些實施例中，其中

= 0之 V ^Fu之頂點應非常接近於

之可見區域。此約束係藉由 E ^t _vis 強制執行：

(4) The weight of this item is less than the "out" item

. In some embodiments, wherein

The vertex of V ^Fu with = 0 should be very close to

the visible area. This constraint is enforced by E ^t _vis :

(4)

另外，為了使內部層模板正則化，ILSE 308強加耦接項及拉普拉斯項。吾人之內部層模板之拓樸與SMPL模型拓樸不相容，因此吾人不可將SMPL身體形狀空間用於正則化。替代地，吾人之耦接項

強制執行

與身體網格321-1之間的類似性。拉普拉斯項

懲罰所估計內部層模板

中之大拉普拉斯值。在一些實施例中，ILSE 308可使用以下損耗權重：

out = 1.0、

fit = 0.03、

vis = 1.0、

cpl = 500.0、

lpl = 10000.0。 Additionally, ILSE 308 imposes coupling terms and Laplacian terms in order to regularize the inner layer templates. The topology of our inner layer template is not compatible with the SMPL model topology, so we cannot use the SMPL body shape space for regularization. Alternatively, our coupling term

enforce

Similarity to body mesh 321-1. Laplace term

Penalize the estimated inner layer template

Large Laplacian value. In some embodiments, ILSE 308 may use the following loss weights:

out = 1.0,

fit = 0.03,

vis = 1.0,

cpl = 500.0,

lpl = 10000.0.

ILSE 308在靜止姿態

中獲得身體模型（例如，身體網格321-1）。此模板表示上部服裝下之平均身體形狀，以及具有褲子之下部身體形狀及各種經曝露皮膚區域，諸如人臉、臂及手。靜止姿態在估計圖框特定內部層身體形狀之前為強項。ILSE 308接著產生影像301之序列中之其他圖框的個別姿態估計。對於每一圖框，靜止姿態與服裝網格356組合以形成身體網格321-1（

），並且允許吾人呈現個人之全身外觀。出於此目的，需要身體網格321-1完全在經分段網格356中之服裝下，而兩個層之間無相交。對於每一圖框

，在影像301之序列中，ILSE 308估計靜止姿態中之內部層形狀

。ILSE 308使用LBS函數

來將 V _i 轉型成目標姿態。接著，ILSE 308對以下最佳化方程式進行求解：

(5) ILSE 308 at rest attitude

body model (eg, body mesh 321-1). This template represents the average body shape under the upper garment, and has the lower body shape under the pants and various exposed skin areas, such as the face, arms, and hands. Rest pose is a strong point before estimating the frame-specific inner layer body shape. ILSE 308 then generates individual pose estimates for other frames in the sequence of images 301 . For each frame, the rest pose is combined with the garment mesh 356 to form the body mesh 321-1 (

) and allow us to take on the full body appearance of the individual. For this purpose, body mesh 321-1 is required to be completely under the garment in segmented mesh 356, with no intersection between the two layers. for each frame

, in the sequence of images 301, the ILSE 308 estimates the inner layer shape in the rest pose

. ILSE 308 Using the LBS function

to transform Vi into the target pose _. Next, ILSE 308 solves the following optimization equation:

(5)

雙層調配有利於網格354保持在上部服裝內部。因此，ILSE 308引入上部服裝中之任何頂點應保持遠離內部層形狀的最小距離ε（例如，1

左右），並且使用其中產生較大群體資料集包含評估由統計參數調節之生物標記值的隨機變數以及將隨機變數與生物標記資料集合之間的差與由傾向卡尺導出之距離度量進行比較。

(6) The double layer configuration facilitates the retention of the mesh 354 inside the upper garment. Therefore, any vertex introduced into the upper garment by ILSE 308 should be kept a minimum distance ε away from the shape of the inner layer (eg, 1

left and right), and using random variables in which the larger population data sets were generated consisted of evaluating biomarker values adjusted by statistical parameters and comparing the differences between the random variables and the biomarker data sets to a distance metric derived from a propensity caliper.

(6)

其中

標示網格

中之頂點

之分段結果，其針對上部服裝中之頂點具有值1且以其他方式具有值0。類似地，對於內部層中之直接可見區域（不由服裝覆蓋）：

(7) in

Mark the grid

the apex of

The segmentation result of , which has a value of 1 for vertices in the upper garment and a value of 0 otherwise. Similarly, for directly visible areas in inner layers (not covered by clothing):

(7)

ILSE 308亦將圖框特定靜止姿態形狀與身體網格321-1耦接以利用模板中之強先前編碼：

(8) ILSE 308 also couples the frame specific rest pose shape with body mesh 321-1 to take advantage of the strong prior encoding in the template:

(8)

其中下標

標示對兩個網格321-1及321-2之邊緣執行耦接。在一些實施例中，方程式（5）可藉由以下損耗權重實施：

= 1.0、

= 1.0、

= 500.0。方程式5之解為序列中之每一圖框提供經配準拓樸中之身體網格321-1之估計。內部層網格321-1及外部層網格321-2用作個體之虛擬化身模型。另外，對於序列中之每一個圖框，流水線300自由攝影機系統捕捉之多視圖影像301提取用於網格321之圖框特定UV紋理。兩個網格321之幾何形狀及紋理用於訓練如本文中所揭示之雙層編解碼器虛擬化身。 where subscript

Indicates that coupling is performed on the edges of the two grids 321-1 and 321-2. In some embodiments, equation (5) may be implemented with the following loss weights:

= 1.0,

= 500.0. The solution of Equation 5 provides an estimate of the body mesh 321-1 in the registered topology for each frame in the sequence. The inner layer mesh 321-1 and the outer layer mesh 321-2 are used as avatar models for individuals. Additionally, for each frame in the sequence, the pipeline 300 extracts frame-specific UV textures for mesh 321 from the multi-view image 301 captured by the camera system. The geometry and texture of the two meshes 321 are used to train the dual layer codec avatar as disclosed herein.

圖4說明根據一些實施例的用於架構100及流水線300中之網路元件及可操作塊400A、400B及400C（在下文中，被集體地稱作「塊400」）。資料張量402包括作為n×H×W之張量維度，其中「n」為輸入影像或圖框（例如，影像301）之數目，並且H及W為圖框之高度及寬度。卷積運算404、408及410為二維運算，其典型地作用於影像圖框之2D維度（H及W）。漏泄ReLU（LReLU）運算406及412應用於卷積運算404、406及410中之每一者之間。4 illustrates network elements and operational blocks 400A, 400B, and 400C (hereinafter collectively referred to as "blocks 400") for use in architecture 100 and pipeline 300, according to some embodiments. Data tensor 402 includes tensor dimensions as n×H×W, where “n” is the number of input images or frames (eg, image 301 ), and H and W are the height and width of the frame. Convolution operations 404, 408, and 410 are two-dimensional operations that typically operate on the 2D dimensions (H and W) of the image frame. Leaky ReLU (LReLU) operations 406 and 412 are applied between each of convolution operations 404 , 406 and 410 .

塊400A為降頻轉換塊，其中具有尺寸n×H×W之輸入張量402變為具有尺寸out×H/2×W/2之輸出張量414A。Block 400A is a down-conversion block in which an input tensor 402 of size nxHxW becomes an output tensor 414A of size outxH/2xW/2.

塊400B為升頻轉換塊，其中在上取樣操作403C之後，具有尺寸n×H×W之輸入張量402變為具有尺寸out×2·H×2·W之輸出張量414B。Block 400B is an up-conversion block in which, after upsampling operation 403C, an input tensor 402 of size nxHxW becomes an output tensor 414B of size outx2·Hx2·W.

塊400C為卷積塊，其維持輸入塊402之2D維度，但可改變圖框（及其內容）之數目。輸出張量414C具有尺寸out × H × W。Block 400C is a convolutional block that maintains the 2D dimension of input block 402, but can vary the number of frames (and their contents). The output tensor 414C has dimensions out×H×W.

圖5A至圖5D說明根據一些實施例的用於即時穿著衣服之個體動畫模型的編碼器500A、解碼器500B及500C，以及陰影網路500D架構（在下文中，被集體地稱作「架構500」）。5A-5D illustrate an encoder 500A, decoders 500B, and 500C, and a shadow network 500D architecture (hereafter, collectively referred to as "architecture 500") for an animated model of an individual wearing clothing in real-time, according to some embodiments ).

編碼器500A包括輸入張量501A-1、分別作用於張量502A-1、504A-1、504A-2、504A-3、504A-4、504A-5、504A-6及504A-7之降頻轉換塊503A-1、503A-2、503A-3、503A-4、503A-5、503A-6及503A-7（在下文中，被集體地稱作「降頻轉換塊503A」）。卷積塊505A-1及505A-2（在下文中，被集體地稱作「卷積塊505A」）將張量504A-7轉換成張量506A-1及張量506A-2（在下文中，被集體地稱作「張量506A」）。張量506A經組合成隱性程式碼507A-1及噪聲塊507A-2（在下文中，被集體地稱作「編碼器輸出507A」）。應注意，在所說明之特定實例中，編碼器500A採用包括例如具有像素尺寸1024 × 1024之8個影像圖框的輸入張量501A-1，並且產生具有大小為8 × 8之128個圖框的編碼器輸出507A。The encoder 500A includes an input tensor 501A-1, down-conversion applied to the tensors 502A-1, 504A-1, 504A-2, 504A-3, 504A-4, 504A-5, 504A-6, and 504A-7, respectively Transform blocks 503A-1, 503A-2, 503A-3, 503A-4, 503A-5, 503A-6, and 503A-7 (hereinafter collectively referred to as "downconvert block 503A"). Convolution blocks 505A-1 and 505A-2 (hereinafter collectively referred to as "convolution blocks 505A") convert tensor 504A-7 into tensor 506A-1 and tensor 506A-2 (hereinafter, referred to as "convolution block 505A") Collectively referred to as "Tensor 506A"). Tensor 506A is combined into implicit code 507A-1 and noise block 507A-2 (hereinafter collectively referred to as "encoder output 507A"). It should be noted that in the particular example illustrated, encoder 500A takes an input tensor 501A-1 comprising, for example, 8 image frames of size 1024×1024, and produces 128 frames of size 8×8 The encoder outputs 507A.

解碼器500B包括作用於輸入張量501B以形成張量502B-3之卷積塊502B-1及502B-2（在下文中，被集體地稱作「卷積塊502」）。升頻轉換塊503B-1、503B-2、503B-3、503B-4、503B-5及503B-6（在下文中，被集體地稱作「升頻轉換塊503B」）作用於張量504B-1、504B-2、504B-3、504B-4、504B-5、及504B-6（在下文中，被集體地稱作「張量504B」）。作用於張量504B-6之卷積505B產生紋理張量506B及幾何形狀張量507B。Decoder 500B includes convolution blocks 502B-1 and 502B-2 (hereinafter collectively referred to as "convolution blocks 502") that act on input tensor 501B to form tensor 502B-3. Upconversion blocks 503B-1, 503B-2, 503B-3, 503B-4, 503B-5, and 503B-6 (hereinafter, collectively referred to as "upconversion blocks 503B") act on tensors 504B- 1. 504B-2, 504B-3, 504B-4, 504B-5, and 504B-6 (hereinafter collectively referred to as "tensors 504B"). Convolution 505B acting on tensors 504B-6 produces texture tensor 506B and geometry tensor 507B.

解碼器500C包括作用於輸入張量501C以形成張量502C-2之卷積塊502C-1。升頻轉換塊503C-1、503C-2、503C-3、503C-4、503C-5及503C-6（在下文中，被集體地稱作「升頻轉換塊503C」）作用於張量502C-2、504C-1、504C-2、504C-3、504C-4、504C-5及504C-6（在下文中，被集體地稱作「張量504C」）。作用於張量504C之卷積505C產生紋理張量506C。Decoder 500C includes a convolution block 502C-1 that operates on input tensor 501C to form tensor 502C-2. Upconversion blocks 503C-1, 503C-2, 503C-3, 503C-4, 503C-5, and 503C-6 (hereinafter, collectively referred to as "upconversion blocks 503C") act on tensors 502C- 2. 504C-1, 504C-2, 504C-3, 504C-4, 504C-5, and 504C-6 (hereinafter, collectively referred to as "tensor 504C"). Convolution 505C applied to tensor 504C produces texture tensor 506C.

陰影網路500D包括卷積塊504D-1、504D-2、504D-3、504D-4、504D-5、504D-6、504D-7、504D-8及504D-9（在下文中，被集體地稱作「卷積塊504D」），其在下取樣502D-1及502D-2以及上取樣502D-3、502D-4、502D-5、502D-6及502D-7（在下文中，被集體地稱作「上下取樣操作502D」）之後，以及在LReLU運算505D-1、505D-2、505D-3、505D-4、505D-5及505D-6（在下文中，被集體地稱作「LReLU運算505D」）之後，作用於張量503D-1、503D-2、503D-3、503D-4、503D-5、503D-6、503D-7、503D-8及503D-9（在下文中，被集體地稱作「張量503D」）。在沿著陰影網路500D之不同階段處，序連連接510-1、510-2及510-3（在下文中，被集體地稱作「序連連接610」）將張量503D-2結合至張量503D-8，將張量503D-3結合至張量503D-7，以及將張量503D-4結合至張量503D-6。陰影網路500D之輸出為陰影圖511。Shadow network 500D includes convolution blocks 504D-1, 504D-2, 504D-3, 504D-4, 504D-5, 504D-6, 504D-7, 504D-8, and 504D-9 (hereinafter, collectively Referred to as "convolution block 504D"), it downsamples 502D-1 and 502D-2 and upsamples 502D-3, 502D-4, 502D-5, 502D-6, and 502D-7 (hereinafter, collectively referred to as After performing "up and down sampling operations 502D"), and after LReLU operations 505D-1, 505D-2, 505D-3, 505D-4, 505D-5, and 505D-6 (hereinafter collectively referred to as "LReLU operations 505D" "), acting on tensors 503D-1, 503D-2, 503D-3, 503D-4, 503D-5, 503D-6, 503D-7, 503D-8, and 503D-9 (hereinafter, collectively called "Tensor 503D"). At different stages along shadow network 500D, sequential connections 510-1, 510-2, and 510-3 (hereinafter collectively referred to as "sequenced connections 610") combine tensor 503D-2 to Tensor 503D-8, joins tensor 503D-3 to tensor 503D-7, and joins tensor 503D-4 to tensor 503D-6. The output of shadow network 500D is shadow map 511 .

圖6A至圖6B說明根據一些實施例的用於即時穿著衣服之個體動畫模型的身體網路600A及服裝網路600B（在下文中，被集體地稱作「網路600」）之架構。一旦服裝與身體解耦，骨架姿態及人臉關鍵點就含有足夠資訊以描述身體狀態（包括相對緊密之褲子）。6A-6B illustrate the architecture of a body network 600A and a clothing network 600B (hereafter, collectively referred to as "network 600") for animated models of individuals wearing clothing in real time, according to some embodiments. Once the clothing is decoupled from the body, the skeletal pose and face keypoints contain enough information to describe the state of the body (including relatively tight pants).

身體網路600A接受骨架姿態601A-1、人臉關鍵點601A-2及視圖調節601A-3作為（在下文中，被集體地稱作「輸入601A」）至升頻轉換塊603A-1（視圖無關）及603A-2（視圖相關）之輸入，在下文中，被集體地稱作「解碼器603A」，產生2D UV座標圖604A-1、身體平均視圖紋理604A-2、身體殘餘紋理604A-3及身體環境遮擋604A-4中未使用之幾何形狀。身體平均視圖紋理604A-2與身體殘餘紋理604A-3混配以產生身體之身體紋理607A-1作為輸出。接著在陰影網路605A（參見陰影網路500D）中將LBS轉型應用於自UV圖恢復之未擺姿態之網格，以產生最終輸出網格607A-2。訓練身體網路之損耗函數經定義為：

(9) Body network 600A accepts skeleton pose 601A-1, face keypoints 601A-2, and view adjustments 601A-3 as (hereafter, collectively referred to as "inputs 601A") to upconversion block 603A-1 (view independent ) and 603A-2 (view dependent) inputs, hereinafter collectively referred to as "decoders 603A", produce 2D UV coordinate map 604A-1, body average view texture 604A-2, body residual texture 604A-3 and Geometry not used in body ambient occlusion 604A-4. The body average view texture 604A-2 is blended with the body residual texture 604A-3 to produce the body's body texture 607A-1 as output. The LBS transformation is then applied to the unposed mesh recovered from the UV map in shadow net 605A (see shadow net 500D) to produce final output mesh 607A-2. The loss function for training the body network is defined as:

(9)

其中

為在UV座標中自經預測位置圖內插之頂點位置，並且

為來自內部層配準之頂點。 L(・)為拉普拉斯運算子，

為經預測紋理，

為每視圖經重建紋理，並且

為指示有效UV區域之遮罩。 in

is the vertex position interpolated from the predicted position map in UV coordinates, and

are the vertices from the internal layer registration. L (・) is the Laplace operator,

is the predicted texture,

reconstructed textures for each view, and

A mask that indicates the effective UV area.

服裝網路600B包括條件變分自動編碼器（Conditional Variational Autoencoder；cVAE）603B-1，其採用未擺姿態之服裝幾何形狀601B-1及平均視圖紋理601B-2（在下文中，被集體地稱作「服裝輸入601B」）作為輸入，並且產生高斯分佈之參數，隱性程式碼604B-1 (z)在塊604B-2中自該些參數上取樣以形成隱性調節張量604B-3。除隱性調節張量604B-3之外，cVAE 603B-1亦產生空間變化視圖調節張量604B-4作為輸入至視圖無關之解碼器605B-1及視圖相關之解碼器605B-2，並且預測服裝幾何形狀606B-1、服裝紋理606B-2及服裝殘餘紋理606B-3。訓練損耗可描述為：

(10) Apparel network 600B includes Conditional Variational Autoencoder (cVAE) 603B-1, which employs unposed apparel geometry 601B-1 and average view texture 601B-2 (hereinafter, collectively referred to as "Apparel Input 601B") as input, and generates the parameters of the Gaussian distribution from which implicit code 604B-1(z) samples in block 604B-2 to form implicit adjustment tensor 604B-3. In addition to implicit adjustment tensor 604B-3, cVAE 603B-1 also generates spatially varying view adjustment tensor 604B-4 as input to view independent decoder 605B-1 and view dependent decoder 605B-2, and predicts Garment Geometry 606B-1, Garment Texture 606B-2 and Garment Residual Texture 606B-3. The training loss can be described as:

(10)

其中

為在UV座標中自經預測位置圖內插之服裝幾何形狀606B-1之頂點位置，並且

為來自內部層配準之頂點。 L(・)為拉普拉斯運算子，並且

為經預測紋理606B-2，

為每視圖經重建紋理608B-1，並且

為指示有效UV區域之遮罩。並且

為庫爾貝克-萊布勒（Kullbar-Leibler；KL）發散損耗。陰影網路605B（參見陰影網路500D及605A）使用服裝模板606B-4來形成服裝陰影圖608B-2。 in

is the vertex position of the garment geometry 606B-1 interpolated from the predicted position map in UV coordinates, and

are the vertices from the internal layer registration. L (・) is the Laplace operator, and

is the predicted texture 606B-2,

reconstructed texture 608B-1 for each view, and

A mask that indicates the effective UV area. and

is the Kullbar-Leibler (KL) divergence loss. Shadow net 605B (see

shadow nets

500D and 605A) uses garment template 606B-4 to form garment shadow map 608B-2.

圖7說明根據一些實施例的用於提供即時穿著衣服之個體動畫的雙層模型之紋理編輯結果。虛擬化身721A-1、721A-2及721A-3（在下文中，被集體地稱作「虛擬化身721A」）對應於個體303之三個不同姿態，並且使用第一組服裝764A。虛擬化身721B-1、721B-2及721B-3（在下文中，被集體地稱作「虛擬化身721B」）對應於個體303之三個不同姿態，並且使用第二組服裝764B。虛擬化身721C-1、721C-2及721C-3（在下文中，被集體地稱作「虛擬化身721C」）對應於個體303之三個不同姿態，並且使用第三組服裝764C。虛擬化身721D-1、721D-2及721D-3（在下文中，被集體地稱作「虛擬化身721D」）對應於個體303之三個不同姿態，並且使用第四組服裝764D。7 illustrates the results of texture editing of a two-layer model for providing instant animation of a clothed individual, according to some embodiments. Avatars 721A-1, 721A-2, and 721A-3 (hereinafter collectively referred to as "avatars 721A") correspond to three different poses of individual 303, and use a first set of clothing 764A. Avatars 721B-1, 721B-2, and 721B-3 (hereinafter collectively referred to as "avatars 721B") correspond to three different poses of individual 303, and use a second set of clothing 764B. Avatars 721C-1, 721C-2, and 721C-3 (hereinafter collectively referred to as "avatars 721C") correspond to three different poses of individual 303, and use a third set of clothing 764C. Avatars 721D-1, 721D-2, and 721D-3 (hereinafter collectively referred to as "avatars 721D") correspond to three different poses of individual 303, and use a fourth set of clothing 764D.

圖8說明根據一些實施例的基於逆呈現之光度對準方法800。方法800校正經配準身體及服裝網格（例如，網格321）中之對應誤差，此顯著提高解碼器品質，尤其對於動態服裝。方法800為以可微分方式將經預測幾何形狀（例如，身體幾何形狀604A-1及服裝幾何形狀606B-1）及紋理（例如，身體紋理604A-2及服裝紋理606B-2）鏈結至輸入多視圖影像（例如，影像301）的網路訓練階段。為此目的，方法800聯合地訓練身體及服裝網路（例如，網路600），包括VAE 803A以及在初始化815之後的VAE 803B（在下文中，被集體地稱作「VAE 803」）。VAE 803藉由可微分呈現器呈現輸出。在一些實施例中，方法800使用以下損耗函數：

(11) 8 illustrates an inverse rendering-based photometric alignment method 800 in accordance with some embodiments. Method 800 corrects for corresponding errors in the registered body and clothing meshes (eg, mesh 321 ), which significantly improves decoder quality, especially for dynamic clothing. Method 800 is to link predicted geometries (eg, body geometry 604A-1 and garment geometry 606B-1) and textures (eg, body texture 604A-2 and garment texture 606B-2) to an input in a differentiable manner Network training phase for multi-view imagery (eg, image 301). To this end, method 800 jointly trains a body and clothing network (eg, network 600 ), including VAE 803A and VAE 803B after initialization 815 (hereinafter collectively referred to as "VAE 803"). The VAE 803 renders the output via a differentiable renderer. In some embodiments, method 800 uses the following loss function:

(11)

其中I ^R及I ^C為所呈現影像及所捕捉影像，M ^R及M ^C為所呈現前景遮罩及所捕捉前景網格，並且

為拉普拉斯幾何形狀損耗（參見方程式9及10）。

為軟可視性損耗，其處置身體與服裝之間的深度推理，使得梯度可反向傳播以校正深度次序。詳細地，吾人將具體像素之軟可視性定義為：

(12) where IR and ^IC are the rendered image and captured image, ^MR and ^MC are the rendered ^foreground mask and captured foreground mesh, and

is the Laplace geometry loss (see Equations 9 and 10).

is a soft visibility loss, which handles depth inference between body and clothing so that gradients can be back-propagated to correct the depth order. In detail, we define the soft visibility of a specific pixel as:

(12)

其中σ(・)為S型函數，

及

為自服裝及身體層之當前視點呈現之深度，並且

為縮放常數。接著，軟可視性損耗經定義為：

(13) where σ(・) is a sigmoid function,

and

is the depth rendered from the current viewpoint of the garment and body layers, and

is the scaling constant. Next, the soft visibility loss is defined as:

(13)

在

＞ 0.5時，當前像素根據2D衣物分段指派至服裝。否則，

經設定為0。 exist

> 0.5, the current pixel is assigned to the garment according to the 2D garment segment. otherwise,

is set to 0.

在一些實施例中，方法800可藉由預測在圖框上具有較少差異之紋理以及經變形幾何形狀以將呈現輸出與地面實況影像對準來提高光度對應性。在一些實施例中，方法800使用逆呈現損耗（參見方程式11-13）同時訓練VAE 803，並且在產生用於駕駛即時動畫之生成模型時校正對應性。為了尋找良好最小值，方法800理想上避免初始網格821中光度對應性之大變化。此外，方法800理想上避免VAE 803調整視圖相關紋理以補償幾何形狀偏差，此調整可產生假影。In some embodiments, method 800 may improve photometric correspondence by predicting textures with less variance across the frame and deformed geometry to align presentation output with ground truth imagery. In some embodiments, the method 800 simultaneously trains the VAE 803 using the inverse rendering loss (see Equations 11-13) and corrects the correspondence when generating the generative model for the driving real-time animation. In order to find good minima, method 800 ideally avoids large changes in photometric correspondence in initial grid 821 . Furthermore, method 800 ideally avoids VAE 803 adjusting view-dependent textures to compensate for geometric biases, which can produce artifacts.

為了解決上述挑戰，方法800將輸入錨圖框（A）811A-1至811A-n（在下文中，被集體地稱作「輸入錨圖框811A」）分成50個相鄰圖框之組塊（B）：輸入組塊圖框811B-1至811B-n（在下文中，被集體地稱作「輸入組塊圖框811B」）。方法800使用輸入錨圖框811A來訓練VAE 803A以獲得經對準錨圖框813A-1至813A-n（在下文中，被集體地稱作「經對準錨圖框813A」）。並且方法800使用組塊圖框811B來訓練VAE 803B以獲得經對準組塊圖框813B-1至813B-n（在下文中，被集體地稱作「經對準組塊圖框813B」）。在一些實施例中，方法800選擇第一組塊811B-1作為錨圖框811A-1，並且針對此組塊訓練VAE 803。在匯聚之後，經訓練網路參數初始化其他組塊（B）之訓練。為了避免組塊B與錨圖框A之對準漂移，方法800可設定小學習速率（例如，用於最佳化器之0.0001），並且在訓練期間將錨圖框A與每一其他組塊B混合。在一些實施例中，方法800使用單個紋理預測以用於在來自個體之多視圖中之一或多者或全部中進行逆呈現。相較於輸入錨圖框811A及輸入組塊圖框811B，經對準錨圖框813A及經對準組塊圖框813B（在下文中，被集體地稱作「經對準圖框813」）具有跨越圖框之更一致的對應性。在一些實施例中，經對準網格825可用於訓練身體網路及服裝網路（參見網路600）。To address the challenges described above, method 800 divides input anchor frames (A) 811A-1 through 811A-n (hereinafter collectively referred to as "input anchor frames 811A") into blocks of 50 adjacent frames ( B): Input chunk frames 811B- 1 to 811B-n (hereinafter, collectively referred to as "input chunk frames 811B"). Method 800 uses input anchor frame 811A to train VAE 803A to obtain aligned anchor frames 813A-1 through 813A-n (hereinafter collectively referred to as "aligned anchor frames 813A"). And method 800 uses chunk frame 811B to train VAE 803B to obtain aligned chunk frames 813B- 1 through 813B-n (hereinafter collectively referred to as "aligned chunk frames 813B"). In some embodiments, method 800 selects first chunk 811B-1 as anchor frame 811A-1, and trains VAE 803 for this chunk. After aggregation, the training of the other chunks (B) is initialized with the trained network parameters. To avoid drift in alignment of chunk B and anchor box A, method 800 may set a small learning rate (eg, 0.0001 for the optimizer) and compare anchor box A with every other chunk during training B mix. In some embodiments, method 800 uses a single texture prediction for inverse rendering in one or more or all of multiple views from an individual. Compared to input anchor frame 811A and input chunk frame 811B, aligned anchor frame 813A and aligned chunk frame 813B (hereinafter collectively referred to as "aligned frames 813") Has a more consistent correspondence across the plot frame. In some embodiments, the aligned grid 825 may be used to train a body network and a clothing network (see network 600).

方法800將光度損耗（參見方程式11-13）應用於可微分呈現器820A以分別自初始網格821A-1至821A-n（在下文中，被集體地稱作「初始網格821A」）獲得經對準網格825A-1至825A-n（在下文中，被集體地稱作「經對準網格825A」）。獨立於VAE 803A初始化單獨VAE 803B。方法800使用輸入組塊圖框811B來訓練VAE 803B以獲得經對準組塊圖框813B。方法800將相同損耗函數（參見方程式11-13）應用於可微分呈現器820B以分別自初始網格821B-1至821B-n（在下文中，被集體地稱作「初始網格821B」）獲得經對準網格825B-1至825B-n（在下文中，被集體地稱作「經對準網格825B」）。The method 800 applies the photometric loss (see Equations 11-13) to the differentiable renderer 820A to obtain processed Alignment grids 825A-1 to 825A-n (hereinafter collectively referred to as "aligned grids 825A"). A separate VAE 803B is initialized independently of the VAE 803A. Method 800 uses input chunk frame 811B to train VAE 803B to obtain aligned chunk frame 813B. Method 800 applies the same loss function (see Equations 11-13) to differentiable renderer 820B to obtain from initial grids 821B-1 to 821B-n (hereinafter collectively referred to as "initial grid 821B"), respectively Aligned grids 825B- 1 to 825B-n (hereinafter collectively referred to as "aligned grids 825B").

在像素標記為「服裝」，但自此視點看身體層在服裝層之上時，軟可視性損耗將反向傳播資訊以更新表面，直至實現校正深度次序。在此逆呈現階段中，吾人亦使用在給定環境遮擋圖之情況下計算身體及服裝之準陰影圖的陰影網路。在一些實施例中，方法800可在LBS轉型之後用身體模板近似環境遮擋。在一些實施例中，方法800可使用來自身體及服裝解碼器之輸出幾何形狀來計算準確環境遮擋，以對比可自身體變形上之LBS函數搜集之更詳細的服裝變形進行模型化。在應用可微分呈現器820之前，接著將準陰影圖與視圖相關紋理相乘。When a pixel is labeled "clothing", but the body layer is above the clothing layer from this viewpoint, the soft visibility loss will back-propagate information to update the surface until a corrected depth order is achieved. In this inverse rendering stage, we also use a shadow network that computes quasi-shadow maps of the body and clothing given the ambient occlusion map. In some embodiments, method 800 may approximate ambient occlusion with a body template after LBS transformation. In some embodiments, method 800 can use the output geometry from the body and clothing decoders to calculate accurate ambient occlusion to model against more detailed clothing deformations that can be gleaned from LBS functions on body deformations. The quasi-shadow map is then multiplied with the view-dependent texture before applying the differentiable renderer 820.

圖9說明根據一些實施例的在不同姿態A、B及C（例如，姿態之時間序列）下在單層神經網路模型921A-1、921B-1及921C-1（在下文中，被集體地稱作「單層模型921-1」）與雙層神經網路模型921A-2、921B-2及921C-2（在下文中，被集體地稱作「雙層模型921-2」）之間的個體之即時三維穿著衣服之模型900之比較。網路模型921包括身體輸出942A-1、942B-1及942C-1（在下文中，被集體地稱作「單層身體輸出942-1」）以及身體輸出942A-2、942B-2及942C-2（在下文中，被集體地稱作「身體輸出942-2」）。網路模型921亦分別包括服裝輸出944A-1、944B-1及944C-1（在下文中，被集體地稱作「單層服裝輸出944-1」）以及服裝輸出944A-2、944B-2及944C-2（在下文中，被集體地稱作「雙層服裝輸出944-2」）。9 illustrates single-layer neural network models 921A-1, 921B-1, and 921C-1 (hereafter, collectively referred to as the following) under different poses A, B, and C (eg, time series of poses), according to some embodiments referred to as “single-layer model 921-1”) and two-layer neural network models 921A-2, 921B-2 and 921C-2 (hereinafter collectively referred to as “two-layer model 921-2”) Comparison of real-time three-dimensional clothed models 900 of individuals. Network model 921 includes body outputs 942A-1, 942B-1, and 942C-1 (hereinafter, collectively referred to as "single-layer body output 942-1") and body outputs 942A-2, 942B-2, and 942C- 2 (hereinafter collectively referred to as "body output 942-2"). The network model 921 also includes garment outputs 944A-1, 944B-1 and 944C-1 (hereinafter collectively referred to as "single layer garment outputs 944-1") and garment outputs 944A-2, 944B-2 and 944C-1, respectively. 944C-2 (hereinafter collectively referred to as "Double Garment Output 944-2").

雙層身體輸出942-2以骨架姿態及人臉關鍵點之單個圖框為條件，並且雙層服裝輸出944-2係藉由隱性程式碼判定。為了使圖框A、B及C之間的服裝動畫化，模型900包括時間卷積網路（TCN）以學習身體動態與服裝變形之間的相關性。TCN接受骨架姿態之時間序列（例如，A、B及C）且推斷隱性服裝狀態。TCN在通向目標圖框之 L個圖框之窗中將關節角度

作為輸入，並且傳遞通過若干一維（1D）時間卷積層以預測當前圖框C之服裝隱性程式碼（例如，雙層服裝輸出944C-2）。為了訓練TCN，模型900使以下損耗函數最小化：

(14) The double body output 942-2 is conditioned on a single frame of skeleton pose and face key points, and the double clothing output 944-2 is determined by implicit code. To animate the clothing between frames A, B, and C, the model 900 includes a temporal convolutional network (TCN) to learn the correlation between body dynamics and clothing deformation. The TCN accepts a time series of skeletal poses (eg, A, B, and C) and infers implicit clothing states. TCN converts joint angles in the window of L frames leading to the target frame

as input, and passed through several one-dimensional (1D) temporal convolutional layers to predict the clothing implicit code for the current frame C (eg, double clothing output 944C-2). To train the TCN, the model 900 minimizes the following loss function:

(14)

其中zc為自經訓練服裝VAE（例如，cVAE 603B-1）獲得之地面實況隱性程式碼。在一些實施例中，模型900調節對不僅先前身體狀態而且還先前服裝狀態之預測。因此，需要先前圖框（例如，姿態A及B）中之服裝頂點位置及速度來計算當前服裝狀態（姿態C）。在一些實施例中，至TCN之輸入為不包括先前服裝狀態之骨架姿態之時間窗。在一些實施例中，模型900包括TCN之訓練損耗，以確保經預測服裝不與身體相交。在一些實施例中，模型900解析雙層身體輸出942-2與雙層服裝輸出944-2之間的相交，作為後處理步驟。在一些實施例中，模型900將相交雙層服裝輸出944-2投影回至雙層身體輸出942-2之表面上，其中在正常身體方向上具有額外容限。此操作將解決大部分相交假影，並且確保雙層服裝輸出942-2及雙層身體輸出942-2處於正確深度次序以用於呈現。相交解析問題之實例可見於姿態B之部分944B-2及946B-2以及姿態C中之部分944C-2及946C-2中。藉由比較，姿態B之部分944B-1及946B-1以及姿態C中之部分944C-1及946C-1展示身體輸出942B-1（942C-1）與服裝輸出944B-1（944C-1）之間的相交及混合假影。where zc is the ground truth implicit code obtained from a trained garment VAE (eg, cVAE 603B-1). In some embodiments, the model 900 adjusts predictions of not only previous body states but also previous garment states. Therefore, the clothing vertex positions and velocities in previous frames (eg, poses A and B) are needed to calculate the current clothing state (pose C). In some embodiments, the input to the TCN is a time window of skeletal poses that do not include previous garment states. In some embodiments, the model 900 includes a training loss for the TCN to ensure that the predicted garment does not intersect the body. In some embodiments, the model 900 resolves the intersection between the two-layer body output 942-2 and the two-layer garment output 944-2 as a post-processing step. In some embodiments, the model 900 projects the intersecting two-layer garment output 944-2 back onto the surface of the two-layer body output 942-2 with additional tolerance in the normal body orientation. This operation will resolve most of the intersection artifacts and ensure that the double-layer garment output 942-2 and double-layer body output 942-2 are in the correct depth order for rendering. Examples of intersection resolution problems can be found in portions 944B-2 and 946B-2 in pose B and portions 944C-2 and 946C-2 in pose C. By comparison, parts 944B-1 and 946B-1 of pose B and parts 944C-1 and 946C-1 of pose C show body output 942B-1 (942C-1) and garment output 944B-1 (944C-1) Intersection and blending artifacts between.

圖10說明根據一些實施例的即時三維穿著衣服之個體再現模型1000的動畫虛擬化身1021A-1（單層，非隱性，姿態A）、1021A-2（單層，隱性，姿態A）、1021A-3（雙層、姿態A）、1021B-1（單層，非隱性，姿態B）、1021B-2（單層，隱性，姿態B）及1021B-3（雙層，姿態B）。10 illustrates animated avatars 1021A-1 (single layer, non-recessive, pose A), 1021A-2 (single layer, recessive, pose A), 1021A-3 (Double Layer, Pose A), 1021B-1 (Single Layer, Non-Invisible, Pose B), 1021B-2 (Single Layer, Recessive, Pose B) and 1021B-3 (Double Layer, Pose B) .

雙層虛擬化身1021A-3及1021B-3（在下文中，被集體地稱作「雙層虛擬化身1021-3」）係由3D骨架姿態及人臉關鍵點駕駛。模型1000將當前圖框（例如，姿態A或B）之骨架姿態及人臉關鍵點饋送至身體解碼器（例如，身體解碼器603A）。服裝解碼器（例如，服裝解碼器603B）係由隱性服裝程式碼（例如，隱性程式碼604B-1）經由TCN駕駛，該TCN採用歷史及當前姿態之時間窗作為輸入。模型1000經由單元高斯分佈（例如，服裝輸入604B）之隨機取樣而使單層虛擬化身1021A-1、1021A-2、1021B-1及1021B-2（在下文中，被集體地稱作「單層虛擬化身1021-1及1021-2」）動畫化，並且在可用之情況下使用所得噪聲值來插補隱性程式碼。對於虛擬化身1021A-2及1021-B-2中之所取樣隱性程式碼，模型1000將骨架姿態及人臉關鍵點一起饋送至解碼器網路（例如，網路600）中。模型1000在雙層虛擬化身1021-3中移除動畫輸出中之服裝區域中，尤其服裝邊界周圍之嚴重假影。實情為，由於身體及服裝一起模型化，單層虛擬化身1021-1及1021-2依賴於隱性程式碼以描述對應於相同身體姿態之許多可能服裝狀態。在動畫期間，儘管努力將隱性空間與駕駛信號分離，但地面實況隱性程式碼之不存在會導致輸出之降級。Two-layer avatars 1021A-3 and 1021B-3 (hereinafter collectively referred to as "two-layer avatars 1021-3") are driven by 3D skeletal poses and face key points. Model 1000 feeds the skeletal pose and face keypoints of the current frame (eg, pose A or B) to a body decoder (eg, body decoder 603A). A clothing decoder (eg, clothing decoder 603B) is driven by implicit clothing code (eg, implicit code 604B-1) via a TCN that takes as input a time window of historical and current poses. Model 1000 makes single-layer virtual avatars 1021A-1, 1021A-2, 1021B-1, and 1021B-2 (hereinafter collectively referred to as "Single Avatars 1021-1 and 1021-2") are animated and the resulting noise values are used to interpolate implicit code when available. For the sampled implicit code in avatars 1021A-2 and 1021-B-2, model 1000 feeds the skeletal pose and face keypoints together into a decoder network (eg, network 600). Model 1000 removes severe artifacts in clothing areas in the animation output, especially around clothing boundaries, in two-layer avatar 1021-3. Indeed, since the body and clothing are modeled together, the single-layered avatars 1021-1 and 1021-2 rely on implicit code to describe many possible clothing states corresponding to the same body pose. During the animation, the absence of ground truth covert code results in degraded output despite efforts to separate covert space from driving signals.

雙層虛擬化身1021-3藉由將身體及服裝分離成不同模組來實現較佳動畫品質，如藉由將單層虛擬化身1021-1及1021-2中之邊界區域1044A-1、1044A-2、1044B-1、1044B-2、1046A-1、1046A-2、1046B-1及1046B-2與雙層虛擬化身1021-3中之邊界區域1044A-3、1046A-3、1044B-3及1046B-3（例如，包括穿著衣服之部分及裸露身體部分之區域，在下文中，被集體地稱作邊界區域1044及1046）進行比較可見。因此，身體解碼器（例如，身體解碼器603A）可在給定當前圖框之駕駛信號之情況下判定身體狀態，TCN學習在更長週期內自身體動態推斷最合理之服裝狀態，並且服裝解碼器（例如，服裝解碼器605B）在給定其習知平滑隱性流形之情況下確保合理服裝輸出。另外，雙層虛擬化身1021-3展示這些定性影像中具有更清晰服裝邊界及更清晰起皺圖案之結果。對動畫輸出之定量分析包括相對於所捕捉地面實況影像評估輸出影像。模型1000可報告關於前景像素上之均方誤差（MSE）及結構類似性指數量測（SSIM）的評估度量。雙層虛擬化身1021-3典型地在所有三個序列及兩個評估度量上工作性能優於單層虛擬化身1021-1及1021-2。The two-layer avatar 1021-3 achieves better animation quality by separating the body and clothing into different modules, such as by dividing the border areas 1044A-1, 1044A- in the single-layer avatars 1021-1 and 1021-2 2. Boundary areas 1044A-3, 1046A-3, 1044B-3 and 1046B in 1044B-1, 1044B-2, 1046A-1, 1046A-2, 1046B-1 and 1046B-2 and the two-layer avatar 1021-3 -3 (eg, the regions including the clothed portion and the bare body portion, hereinafter collectively referred to as border regions 1044 and 1046) are visible for comparison. Thus, a body decoder (eg, body decoder 603A) can determine the body state given the driving signal of the current frame, the TCN learns to infer the most plausible clothing state from the body dynamics over a longer period, and the clothing decodes A decoder (eg, clothing decoder 605B) ensures reasonable clothing output given its known smooth implicit manifold. Additionally, the two-layer avatar 1021-3 shows the result of having sharper clothing boundaries and sharper wrinkling patterns in these qualitative images. Quantitative analysis of the animation output includes evaluating the output imagery against the captured ground truth imagery. Model 1000 may report evaluation metrics regarding mean squared error (MSE) and structural similarity index measure (SSIM) on foreground pixels. The two-tier avatar 1021-3 typically performs better than the single-tier avatars 1021-1 and 1021-2 on all three sequences and two evaluation metrics.

圖11說明根據一些實施例的在第一姿態中個體303的不同即時三維穿著衣服之虛擬化身1121A-1、1121B-1、1121C-1、1121D-1、1121E-1及1121F-1（在下文中，被集體地稱作「虛擬化身1121-1」）與在第二姿態中個體303的穿著衣服之虛擬化身1121A-2、1121B-2、1121C-2、1121D-2、1121E-2及1121F-2（在下文中，被集體地稱作「虛擬化身1121-1」）之間的機率相關性之比較1100。11 illustrates different real-time three-dimensional clothed avatars 1121A-1, 1121B-1, 1121C-1, 1121D-1, 1121E-1, and 1121F-1 of individual 303 in a first pose (hereinafter , collectively referred to as "avatar 1121-1") and the clothed avatars 1121A-2, 1121B-2, 1121C-2, 1121D-2, 1121E-2, and 1121F- of individual 303 in the second pose- Comparison 1100 of probability correlations between 2 (hereinafter collectively referred to as "avatars 1121-1").

在無隱性編碼之單層模型中獲得虛擬化身1121A-1、1121D-1及1121A-2、1121D-2。在使用隱性編碼之單層模型中獲得虛擬化身1121B-1、1121E-1及1121B-2、1121E-2。並且在雙層模型中獲得虛擬化身1121C-1、1121F-1及1121C-2、1121F-2。Avatars 1121A-1, 1121D-1 and 1121A-2, 1121D-2 are obtained in a single-layer model without implicit coding. Avatars 1121B-1, 1121E-1 and 1121B-2, 1121E-2 are obtained in a single-layer model using implicit coding. And in the two-layer model, avatars 1121C-1, 1121F-1 and 1121C-2, 1121F-2 are obtained.

虛線1110A-1、1110A-2及1110A-3（在下文中，被集體地稱作「虛線1110A」）指示區域1146A、1146B、1146C、1146D、1146E及1146F（在下文中，被集體地稱作「邊界區域1146」）周圍個體303中之服裝區域之改變。Dashed lines 1110A-1, 1110A-2, and 1110A-3 (hereinafter collectively referred to as "dashed lines 1110A") indicate regions 1146A, 1146B, 1146C, 1146D, 1146E and 1146F (hereinafter collectively referred to as "boundaries") Area 1146") changes in the clothing area in the surrounding individual 303.

圖12說明根據一些實施例的用於直接服裝模型化1200之消融分析。圖框1210A說明藉由模型1200在無隱性空間之情況下獲得的虛擬化身1221A、藉由包括雙層網路之模型1200獲得的虛擬化身1221-1，以及對應地面實況影像1201-1。自作為輸入之一系列骨架姿態直接回歸服裝幾何形狀及紋理而獲得虛擬化身1221A。圖框1210B說明相較於包括雙層網路之模型1200中之虛擬化身1221-2，藉由模型1200在無紋理對準步驟之情況下獲得的虛擬化身1221B具有對應地面實況影像1201-2。虛擬化身1221-1及1221-2展示更清晰紋理圖案。圖框1210C說明藉由模型1200在無視圖調節效應之情況下獲得的虛擬化身1221C。應注意藉由包括視圖調節步驟之模型1200獲得的虛擬化身1221-3中個體之輪廓附近的照明之強反射率。12 illustrates ablation analysis for direct garment modeling 1200, according to some embodiments. Box 1210A illustrates the avatar 1221A obtained by the model 1200 without the hidden space, the avatar 1221-1 obtained by the model 1200 including the two-layer network, and the corresponding ground truth image 1201-1. The avatar 1221A is obtained from direct regression of the garment geometry and texture from a series of skeletal poses as one of the inputs. Box 1210B illustrates that the avatar 1221B obtained by the model 1200 without the texture alignment step has a corresponding ground truth image 1201-2 compared to the avatar 1221-2 in the model 1200 including the dual layer network. Avatars 1221-1 and 1221-2 show clearer texture patterns. Frame 1210C illustrates the avatar 1221C obtained by the model 1200 without the view adjustment effect. Note the strong reflectivity of the lighting near the silhouette of the individual in the avatar 1221-3 obtained by the model 1200 including the view adjustment step.

此設計之一個替代方案為將身體及服裝網路（例如，網路600）之功能性組合為一個：訓練採用一系列骨架姿態作為輸入且預測服裝幾何形狀及紋理作為輸出（例如，虛擬化身1221-1）的解碼器。虛擬化身1221A在個體之胸部附近之標誌區域周圍模糊。實情為，甚至一系列骨架姿態不含有足夠資訊以完全地判定服裝狀態。因此，將回歸自變數自資訊不足輸入（例如，無隱性空間）直接訓練至最終服裝輸出會導致藉由模型對資料的擬合不足。相比之下，包括雙層網路之模型1200可藉由產生隱性空間對不同服裝狀態進行詳細模型化，而時間模型化網路推斷最可能服裝狀態。以此方式，雙層網路可產生具有清晰細節之高品質動畫輸出。An alternative to this design is to combine the functionality of a body and clothing network (eg, network 600 ) into one: training takes a series of skeletal poses as input and predicts clothing geometry and texture as output (eg, avatar 1221 -1) of the decoder. The avatar 1221A is blurred around the marked area near the individual's chest. The truth is that even a series of skeletal poses does not contain enough information to fully determine the state of the garment. Therefore, training the regression independent variables directly from the deficient input (eg, no latent space) to the final garment output results in underfitting the data by the model. In contrast, the model 1200 comprising a two-layer network can model different clothing states in detail by generating an implicit space, while the temporal modeling network infers the most probable clothing state. In this way, a two-layer network can produce high-quality animation output with crisp detail.

相對於在無紋理對準之情況下對資料進行訓練的基線模型（虛擬化身1221B），模型1200藉由在有紋理對準之情況下對經配準身體及服裝資料進行訓練來產生虛擬化身1221-2。因此，光度紋理對準有助於在動畫輸出中產生更清晰細節，此係因為較佳紋理對準使資料對於網路而言更易於消化。另外，來自包括雙層網路之模型1200的虛擬化身1221-3包括視圖相關效應且與無紋理對準之虛擬化身1221C相比在視覺上更類似於地面實況1201-3。在入射角接近90時，觀測到個體之輪廓附近的差異，其中虛擬化身1221-3由於菲涅爾反射率而更亮，此因素使視圖相關輸出更逼真。在一些實施例中，時間模型傾向於產生具有小時間窗之抖動的輸出。TCN中之較長時間窗實現視覺時間一致性與模型效率之間的所要取捨。Model 1200 generates avatar 1221 by training on the registered body and clothing data with texture alignment, relative to a baseline model (avatar 1221B) trained on the data without texture alignment -2. As a result, photometric texture alignment helps produce sharper details in animation output because better texture alignment makes the data easier for the web to digest. Additionally, the avatar 1221-3 from the model 1200 comprising the dual layer network includes view dependent effects and is more visually similar to the ground truth 1201-3 than the avatar 1221C without texture alignment. As the angle of incidence approaches 90, differences are observed around the silhouette of the individual, where the avatar 1221-3 is brighter due to Fresnel reflectivity, a factor that makes the view-dependent output more realistic. In some embodiments, the temporal model tends to produce jittered outputs with small time windows. Longer time windows in TCN enable the desired trade-off between visual temporal consistency and model efficiency.

圖13為說明根據一些實施例的用於訓練直接服裝模型以自雙目視訊產生即時個體動畫之方法1300中之步驟的流程圖。在一些實施例中，方法1300可至少部分地藉由執行如本文中所揭示之用戶端裝置或伺服器中之指令的處理器執行（參見處理器212及記憶體220、用戶端裝置110及伺服器130）。在一些實施例中，方法1300中之步驟中之至少一或多者可藉由安裝於用戶端裝置中之應用程式或包括服裝動畫模型之模型訓練引擎（例如，應用程式222、模型訓練引擎232及服裝動畫模型240）來執行。使用者可經由如本文中所揭示之輸入及輸出元件以及GUI（參見輸入裝置214、輸出裝置216及GUI 225）與用戶端裝置中之應用程式互動。服裝動畫模型可包括如本文中所揭示之身體解碼器、服裝解碼器、分段工具及時間卷積工具（例如，身體解碼器242、服裝解碼器244、分段工具246及時間卷積工具248）。在一些實施例中，與本揭示內容一致之方法可包括方法1300中之至少一或多個步驟，該至少一或多個步驟按不同次序、同時、幾乎同時或時間上重疊地執行。13 is a flowchart illustrating steps in a method 1300 for training a direct clothing model to generate real-time individual animation from binocular video, according to some embodiments. In some embodiments, method 1300 may be performed, at least in part, by a processor executing instructions in a client device or server as disclosed herein (see processor 212 and memory 220, client device 110 and server device 130). In some embodiments, at least one or more of the steps in method 1300 may be performed by an application installed in the client device or a model training engine (eg, application 222, model training engine 232) that includes an animation model of clothing and clothing animation model 240) to perform. A user may interact with applications in the client device via input and output elements and GUIs (see input device 214, output device 216, and GUI 225) as disclosed herein. The clothing animation model may include body decoders, clothing decoders, segmentation tools, and temporal convolution tools (eg, body decoder 242 , clothing decoder 244 , segmentation tools 246 , and temporal convolution tools 248 ) as disclosed herein ). In some embodiments, methods consistent with the present disclosure may include at least one or more steps of method 1300 performed in a different order, simultaneously, nearly simultaneously, or overlapping in time.

步驟1302包括收集個體之多個影像，來自個體之影像包括個體之一或多個不同視角。Step 1302 includes collecting a plurality of images of the individual, the images from the individual including one or more different viewpoints of the individual.

步驟1304包括基於個體之影像而形成三維服裝網格及三維身體網格。Step 1304 includes forming a 3D garment mesh and a 3D body mesh based on the image of the individual.

步驟1306包括將三維服裝網格與三維身體網格對準以形成皮膚-服裝邊界及衣物紋理。Step 1306 includes aligning the three-dimensional garment mesh with the three-dimensional body mesh to form the skin-garment boundary and garment texture.

步驟1308包括基於經預測服裝位置及衣物紋理以及來自個體之影像的經內插位置及衣物紋理而判定損耗因數。Step 1308 includes determining a loss factor based on the predicted clothing position and clothing texture and the interpolated position and clothing texture from the imagery of the individual.

步驟1310包括根據損耗因數更新包括三維服裝網格及三維身體網格之三維模型。Step 1310 includes updating the 3D model including the 3D garment mesh and the 3D body mesh according to the loss factor.

圖14為說明根據一些實施例的用於將即時穿著衣服之個體動畫嵌入於虛擬實境環境中之方法1400中之步驟的流程圖。在一些實施例中，方法1400可至少部分地藉由執行如本文中所揭示之用戶端裝置或伺服器中之指令的處理器執行（參見處理器212及記憶體220、用戶端裝置110及伺服器130）。在一些實施例中，方法1400中之步驟中之至少一或多者可藉由安裝於用戶端裝置中之應用程式或包括服裝動畫模型之模型訓練引擎（例如，應用程式222、模型訓練引擎232及服裝動畫模型240）來執行。使用者可經由如本文中所揭示之輸入及輸出元件以及GUI（參見輸入裝置214、輸出裝置216及GUI 225）與用戶端裝置中之應用程式互動。服裝動畫模型可包括如本文中所揭示之身體解碼器、服裝解碼器、分段工具及時間卷積工具（例如，身體解碼器242、服裝解碼器244、分段工具246及時間卷積工具248）。在一些實施例中，與本揭示內容一致之方法可包括方法1400中之至少一或多個步驟，該至少一或多個步驟按不同次序、同時、幾乎同時或時間上重疊地執行。FIG. 14 is a flowchart illustrating steps in a method 1400 for embedding an animation of a real-time clothed individual in a virtual reality environment, according to some embodiments. In some embodiments, method 1400 may be performed, at least in part, by a processor executing instructions in a client device or server as disclosed herein (see processor 212 and memory 220, client device 110 and server device 130). In some embodiments, at least one or more of the steps in method 1400 may be performed by an application installed in the client device or a model training engine (eg, application 222, model training engine 232) that includes an animation model of a garment and clothing animation model 240) to perform. A user may interact with applications in the client device via input and output elements and GUIs (see input device 214, output device 216, and GUI 225) as disclosed herein. The clothing animation model may include body decoders, clothing decoders, segmentation tools, and temporal convolution tools (eg, body decoder 242 , clothing decoder 244 , segmentation tools 246 , and temporal convolution tools 248 ) as disclosed herein ). In some embodiments, methods consistent with the present disclosure may include at least one or more steps of method 1400 performed in a different order, simultaneously, nearly simultaneously, or overlapping in time.

步驟1402包括自個體收集影像。在一些實施例中，步驟1402包括自個體收集立體或雙目影像。在一些實施例中，步驟1402包括同時或幾乎同時自個體之不同視圖收集多個影像。Step 1402 includes collecting images from the individual. In some embodiments, step 1402 includes collecting stereoscopic or binocular imagery from the individual. In some embodiments, step 1402 includes collecting multiple images simultaneously or nearly simultaneously from different views of the individual.

步驟1404包括自影像選擇多個二維關鍵點。Step 1404 includes selecting a plurality of two-dimensional keypoints from the image.

步驟1406包括識別與影像中之每一二維關鍵點相關聯的三維骨架姿態。Step 1406 includes identifying a 3D skeletal pose associated with each 2D keypoint in the imagery.

步驟1408包括藉由三維模型判定三維服裝網格及三維身體網格，三維服裝網格及三維身體網格錨定於一或多個三維骨架姿態中。Step 1408 includes determining a 3D garment mesh and a 3D body mesh from the 3D model, the 3D garment mesh and the 3D body mesh being anchored in one or more 3D skeletal poses.

步驟1410包括產生包括三維服裝網格、三維身體網格及紋理的個體之三維表示。Step 1410 includes generating a three-dimensional representation of an individual including a three-dimensional garment mesh, a three-dimensional body mesh, and textures.

步驟1412包括將個體之三維表示即時地嵌入於虛擬實境環境中。硬體綜述 Step 1412 includes embedding the three-dimensional representation of the individual in the virtual reality environment in real-time. Hardware Overview

圖15為說明例示性電腦系統1500之方塊圖，藉由該電腦系統可實施圖1及圖2之用戶端及伺服器以及圖13及圖14之方法。在某些態樣中，電腦系統1500可使用在專屬伺服器中或整合至另一實體中或跨越多個實體而分佈的硬體或軟體與硬體之組合來實施。15 is a block diagram illustrating an exemplary computer system 1500 by which the clients and servers of FIGS. 1 and 2 and the methods of FIGS. 13 and 14 may be implemented. In some aspects, computer system 1500 may be implemented using hardware or a combination of software and hardware, either on a dedicated server or integrated into another entity or distributed across multiple entities.

電腦系統1500（例如，用戶端110及伺服器130）包括用於傳達資訊之匯流排1508或其他通信機構，以及與匯流排1508耦接以用於處理資訊之處理器1502（例如，處理器212）。藉助於實例，電腦系統1500可藉由一或多個處理器1502實施。處理器1502可為通用微處理器、微控制器、數位信號處理器（DSP）、特殊應用積體電路（ASIC）、場可程式化閘陣列（FPGA）、可程式化邏輯裝置（PLD）、控制器、狀態機、閘控邏輯、離散硬體組件或可執行資訊之計算或其他操控的任何其他合適實體。Computer system 1500 (eg, client 110 and server 130 ) includes a bus 1508 or other communication mechanism for communicating information, and a processor 1502 (eg, processor 212 ) coupled to bus 1508 for processing information ). By way of example, computer system 1500 may be implemented with one or more processors 1502 . The processor 1502 may be a general purpose microprocessor, microcontroller, digital signal processor (DSP), application specific integrated circuit (ASIC), field programmable gate array (FPGA), programmable logic device (PLD), A controller, state machine, gating logic, discrete hardware component, or any other suitable entity that can perform computation or other manipulation of information.

除硬體之外，電腦系統1500亦可包括產生用於所討論之電腦程式之執行環境的程式碼，例如構成以下的程式碼：處理器韌體、協定堆疊、資料庫管理系統、作業系統或其在以下中儲存中之一或多者的組合：所包括記憶體1504（例如，記憶體220）（諸如隨機存取記憶體（RAM）、快閃記憶體、唯讀記憶體（ROM）、可程式化唯讀記憶體（PROM）、可抹除可程式化唯讀記憶體（EPROM））、暫存器、硬碟、可移磁碟、CD-ROM、DVD或與匯流排1508耦接以用於儲存待藉由處理器1502執行之資訊及指令的任何其他合適儲存裝置。處理器1502及記憶體1504可由專用邏輯電路補充或併入於專用邏輯電路中。In addition to hardware, computer system 1500 may also include code that generates an execution environment for the computer program in question, such as code that constitutes the following: processor firmware, protocol stack, database management system, operating system, or It is stored in a combination of one or more of: included memory 1504 (eg, memory 220) (such as random access memory (RAM), flash memory, read only memory (ROM), Programmable Read Only Memory (PROM), Erasable Programmable Read Only Memory (EPROM), Scratchpad, Hard Disk, Removable Disk, CD-ROM, DVD, or coupled to bus 1508 to any other suitable storage device for storing information and instructions to be executed by the processor 1502. Processor 1502 and memory 1504 may be supplemented by or incorporated in special purpose logic circuitry.

指令可儲存於記憶體1504中且在一或多個電腦程式產品中實施，例如在電腦可讀取媒體上編碼以供電腦系統1500執行或控制該電腦系統之操作的電腦程式指令之一或多個模組，並且根據所屬技術領域中具有通常知識者熟知之任何方法，該些指令包括但不限於諸如以下之電腦語言：資料導向語言（例如，SQL、dBase）、系統語言（例如，C、Objective-C、C++、組合程式）、架構語言（例如，Java、.NET）及應用程式語言（例如，PHP、Ruby、Perl、Python）。指令亦可以電腦語言實施，諸如陣列語言、特性導向語言、組合程式語言、製作語言、命令行介面語言、編譯語言、並行語言、波形括號語言、資料流語言、資料結構式語言、宣告式語言、深奧語言、擴展語言、***語言、函數語言、互動模式語言、解譯語言、反覆語言、串列為基的語言、小語言、以邏輯為基的語言、機器語言、巨集語言、元程式設計語言、多重範型語言（multiparadigm language）、數值分析、非英語語言、物件導向分類式語言、物件導向基於原型的語言、場外規則語言、程序語言、反射語言、基於規則的語言、指令碼處理語言、基於堆疊的語言、同步語言、語法處置語言、視覺語言、wirth語言及基於xml的語言。記憶體1504亦可用於在待由處理器1502執行之指令之執行期間儲存暫時變數或其他中間資訊。Instructions may be stored in memory 1504 and implemented in one or more computer program products, such as one or more computer program instructions encoded on a computer-readable medium for execution by computer system 1500 or to control the operation of the computer system. The instructions include, but are not limited to, computer languages such as: data-oriented languages (eg, SQL, dBase), system languages (eg, C, Objective-C, C++, Composer), architecture languages (eg, Java, .NET), and application languages (eg, PHP, Ruby, Perl, Python). Instructions can also be implemented in computer languages such as array languages, feature-oriented languages, combinatorial programming languages, authoring languages, command-line interface languages, compiled languages, parallel languages, curly bracket languages, data stream languages, data-structured languages, declarative languages, Esoteric languages, extended languages, fourth-generation languages, functional languages, interactive pattern languages, interpreted languages, iterative languages, string-based languages, small languages, logic-based languages, machine languages, macro languages, meta-languages programming languages, multiparadigm languages, numerical analysis, non-English languages, object-oriented categorical languages, object-oriented prototype-based languages, off-site rule languages, programming languages, reflective languages, rule-based languages, scripting Processing languages, stack-based languages, synchronous languages, grammar-processing languages, visual languages, wirth languages, and xml-based languages. Memory 1504 may also be used to store temporary variables or other intermediate information during execution of instructions to be executed by processor 1502.

如本文中所論述之電腦程式未必對應於檔案系統中之檔案。程式可儲存於保持其他程式或資料（例如，儲存於標記語言文件中之一或多個指令碼）的檔案的部分中、儲存於專用於所討論之程式的單個檔案中，或儲存於多個經協調檔案（例如，儲存一或多個模組、子程式或程式碼之部分的檔案）中。電腦程式可經部署以在一台電腦上或在位於一個地點或分佈在多個地點且由通信網路互連的多台電腦上執行。本說明書中所描述之程序及邏輯流程可由一或多個可程式化處理器執行，該一或多個可程式化處理器執行一或多個電腦程式以藉由對輸入資料進行操作且產生輸出來執行功能。Computer programs as discussed herein do not necessarily correspond to files in a file system. Programs may be stored in sections of files that hold other programs or data (eg, one or more scripts stored in a markup language file), in a single file dedicated to the program in question, or in multiple In coordinated files (eg, files that store portions of one or more modules, subprograms, or code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communications network. The procedures and logic flows described in this specification can be executed by one or more programmable processors executing one or more computer programs to operate on input data and generate output to perform the function.

電腦系統1500進一步包括與匯流排1508耦接以用於儲存資訊及指令的資料儲存裝置1506，諸如磁碟或光碟。電腦系統1500可經由輸入/輸出模組1510耦接至各種裝置。輸入/輸出模組1510可為任何輸入/輸出模組。例示性輸入/輸出模組1510包括諸如USB埠之資料埠。輸入/輸出模組1510經組態以連接至通信模組1512。例示性通信模組1512（例如，通信模組218）包括網路連接介面卡，諸如乙太網路卡及數據機。在某些態樣中，輸入/輸出模組1510經組態以連接至複數個裝置，諸如輸入裝置1514（例如，輸入裝置214）及/或輸出裝置1516（例如，輸出裝置216）。例示性輸入裝置1514包括鍵盤及指標裝置，例如滑鼠或軌跡球，使用者可藉由該指標裝置將輸入提供至電腦系統1500。其他種類之輸入裝置1514亦可用於提供與使用者的互動，諸如觸覺輸入裝置、視覺輸入裝置、音訊輸入裝置或腦機介面裝置。舉例言之，提供給使用者之回饋可為任何形式之感測回饋，例如視覺回饋、聽覺回饋或觸覺回饋；並且可自使用者接收任何形式之輸入，包括聲輸入、語音輸入、觸覺輸入或腦波輸入。例示性輸出裝置1516包括用於向使用者顯示資訊之顯示裝置，諸如液晶顯示器（LCD）監視器。Computer system 1500 further includes a data storage device 1506, such as a magnetic or optical disk, coupled to bus 1508 for storing information and instructions. Computer system 1500 may be coupled to various devices via input/output module 1510 . The input/output module 1510 can be any input/output module. Exemplary input/output modules 1510 include data ports such as USB ports. Input/output module 1510 is configured to connect to communication module 1512. Exemplary communication modules 1512 (eg, communication module 218) include network connection interface cards, such as Ethernet cards and modems. In some aspects, input/output module 1510 is configured to connect to a plurality of devices, such as input device 1514 (eg, input device 214) and/or output device 1516 (eg, output device 216). Exemplary input devices 1514 include keyboards and pointing devices, such as a mouse or trackball, by which a user may provide input to computer system 1500 . Other types of input devices 1514 can also be used to provide interaction with the user, such as tactile input devices, visual input devices, audio input devices, or brain-computer interface devices. For example, the feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and can receive any form of input from the user, including acoustic input, voice input, tactile input, or Brainwave input. Exemplary output device 1516 includes a display device, such as a liquid crystal display (LCD) monitor, for displaying information to a user.

根據本揭示內容之一個態樣，可回應於處理器1502執行記憶體1504中所含有之一或多個指令之一或多個序列而使用電腦系統1500來實施用戶端110及伺服器130。此類指令可自諸如資料儲存裝置1506等另一機器可讀取媒體讀取至記憶體1504中。主記憶體1504中含有之指令序列的執行使處理器1502執行本文中所描述之程序步驟。呈多處理配置之一或多個處理器亦可用於執行記憶體1504中含有之指令序列。在替代態樣中，硬佈線電路可代替軟體指令使用或與軟體指令組合使用，以實施本揭示內容之各種態樣。因此，本揭示內容之態樣不限於硬體電路及軟體之任何具體組合。According to one aspect of the present disclosure, client 110 and server 130 may be implemented using computer system 1500 in response to processor 1502 executing one or more sequences of one or more instructions contained in memory 1504 . Such instructions may be read into memory 1504 from another machine-readable medium, such as data storage 1506. Execution of the sequences of instructions contained in main memory 1504 causes processor 1502 to perform the program steps described herein. One or more processors in a multiprocessing configuration may also be used to execute sequences of instructions contained in memory 1504 . In alternative aspects, hard-wired circuitry may be used in place of or in combination with software instructions to implement various aspects of the present disclosure. Accordingly, aspects of the present disclosure are not limited to any specific combination of hardware circuitry and software.

本說明書中所描述之主題的各種態樣可於計算系統中實施，該計算系統包括後端組件，例如資料伺服器，或包括中間軟體組件，例如應用程式伺服器，或包括前端組件，例如具有使用者可與本說明書中所描述之主題之實施互動所經由的圖形使用者介面或網路瀏覽器的用戶端電腦，或一或多個這些後端組件、中間軟體組件或前端組件的任何組合。系統之組件可藉由數位資料通信之任何形式或媒體（例如，通信網路）互連。通信網路（例如，網路150）可包括例如LAN、WAN、網際網路及其類似者中之任一或多者。此外，通信網路可包括但不限於例如以下工具拓樸中之任一或多者，包括：匯流排網路、星形網路、環形網路、網格網路、星形匯流排網路、樹或階層式網路或其類似者。通信模組可例如為數據機或乙太網路卡。Various aspects of the subject matter described in this specification can be implemented in computing systems that include back-end components, such as data servers, or intermediate software components, such as application servers, or front-end components, such as A client computer through which a user may interact with implementations of the subject matter described in this specification through a graphical user interface or web browser, or any combination of one or more of these back-end components, intermediate software components, or front-end components . The components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). A communication network (eg, network 150) may include, for example, any or more of a LAN, WAN, the Internet, and the like. Additionally, communication networks may include, but are not limited to, for example, any or more of the following tool topologies, including: bus networks, star networks, ring networks, mesh networks, star bus networks , tree or hierarchical network or the like. The communication module can be, for example, a modem or an Ethernet card.

電腦系統1500可包括用戶端及伺服器。用戶端與伺服器通常彼此遠離且典型地經由通信網路互動。用戶端與伺服器之關係藉助於在各別電腦上運行且彼此具有主從式關係的電腦程式產生。電腦系統1500可為例如但不限於桌上型電腦、膝上型電腦或平板電腦。電腦系統1500亦可嵌入於另一裝置中，例如但不限於行動電話、PDA、行動音訊播放器、全球定位系統（GPS）接收器、視訊遊戲控制台及/或電視機上盒。The computer system 1500 may include a client and a server. Clients and servers are usually remote from each other and typically interact via a communication network. The relationship between the client and the server is created by means of computer programs running on the respective computers and having a master-slave relationship with each other. Computer system 1500 may be, for example, but not limited to, a desktop computer, laptop computer, or tablet computer. Computer system 1500 may also be embedded in another device, such as, but not limited to, a mobile phone, PDA, mobile audio player, global positioning system (GPS) receiver, video game console, and/or television set-top box.

如本文中所使用之術語「機器可讀取儲存媒體」或「電腦可讀取媒體」係指參與將指令提供至處理器1502以供執行之任一或多個媒體。此類媒體可呈許多形式，包括但不限於非揮發性媒體、揮發性媒體及傳輸媒體。非揮發性媒體包括例如光碟或磁碟，諸如資料儲存裝置1506。揮發性媒體包括動態記憶體，諸如記憶體1504。傳輸媒體包括同軸纜線、銅線及光纖，包括形成匯流排1508之電線。機器可讀取媒體之常見形式包括例如軟碟、軟性磁碟、硬碟、磁帶、任何其他磁性媒體、CD-ROM、DVD、任何其他光學媒體、打孔卡、紙帶、具有孔圖案之任何其他實體媒體、RAM、PROM、EPROM、FLASH EPROM、任何其他記憶體晶片或卡匣，或可供電腦讀取之任何其他媒體。機器可讀取儲存媒體可為機器可讀取儲存裝置、機器可讀取儲存基板、記憶體裝置、影響機器可讀取傳播信號之物質的組合物，或其中之一或多者的組合。The term "machine-readable storage medium" or "computer-readable medium" as used herein refers to any medium or mediums that participate in providing instructions to processor 1502 for execution. Such media can take many forms, including but not limited to non-volatile media, volatile media, and transmission media. Non-volatile media include, for example, optical or magnetic disks, such as data storage device 1506 . Volatile media includes dynamic memory, such as memory 1504 . Transmission media includes coaxial cables, copper wire, and fiber optics, including the wires that form the bus bar 1508. Common forms of machine-readable media include, for example, floppy disks, floppy disks, hard disks, magnetic tapes, any other magnetic media, CD-ROMs, DVDs, any other optical media, punched cards, paper tape, any Other physical media, RAM, PROM, EPROM, FLASH EPROM, any other memory chips or cartridges, or any other media that can be read by a computer. A machine-readable storage medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter that affects a machine-readable propagated signal, or a combination of one or more thereof.

為了說明硬體與軟體之互換性，諸如各種說明性區塊、模組、組件、方法、操作、指令及演算法之項目已大體按其功能性加以了描述。將此類功能性實施為硬體、軟體抑或硬體與軟體之組合取決於外加於整個系統上之特定應用及設計約束。所屬技術領域中具有通常知識者可針對每一特定應用以不同方式實施所描述功能性。To illustrate this interchangeability of hardware and software, items such as various illustrative blocks, modules, components, methods, operations, instructions, and algorithms have been described generally in terms of their functionality. Implementing such functionality as hardware, software, or a combination of hardware and software depends on the particular application and design constraints imposed on the overall system. One of ordinary skill in the art may implement the described functionality in different ways for each particular application.

如本文中所使用，在一系列項目之前的藉由術語「及」或「或」分離該些項目中之任一者的片語「…中之至少一者」修改清單整體，而非清單中之每一成員（即，每一項目）。片語「…中之至少一者」不需要選擇至少一個項目；實情為，該片語允許包括該些項目中之任一者中的至少一者及/或該些項目之任何組合中的至少一者及/或該些項目中之每一者中的至少一者的涵義。藉助於實例，片語「A、B及C中之至少一者」或「A、B或C中之至少一者」各自指僅A、僅B或僅C；A、B及C之任何組合；及/或A、B及C中之每一者中的至少一者。As used herein, the phrase "at least one of" preceding a list of items by the terms "and" or "or" to separate any of those items modifies the list as a whole, rather than the list of each member (ie, each item). The phrase "at least one of" does not require selection of at least one item; instead, the phrase allows the inclusion of at least one of any of the items and/or at least one of any combination of the items Meaning of at least one of one and/or each of these items. By way of example, the phrases "at least one of A, B, and C" or "at least one of A, B, or C" each mean only A, only B, or only C; any combination of A, B, and C ; and/or at least one of each of A, B, and C.

就術語「包括」、「具有」或其類似者用於實施方式或申請專利範圍中而言，此術語意欲以類似於術語「包含」在「包含」作為過渡詞用於技術方案中時所解譯之方式而為包括性的。詞語｢例示性｣在本文中用於意謂｢充當一實例、例子或說明｣。本文中描述為「例示性」之任何實施例未必解釋為比其他實施例更佳或更有利。Insofar as the terms "include", "have" or the like are used in the implementation or the scope of the patent application, this term is intended to be understood similarly to the term "comprising" when "comprising" is used as a transition word in the technical solution The way it is translated is inclusive. The word "exemplary" is used herein to mean "serving as an instance, instance, or illustration." Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

除非具體陳述，否則以單數形式對元件的提及並不意欲意謂「一個且僅一個」，而是意謂「一或多個」。所屬技術領域中具有通常知識者已知或稍後將知曉的貫穿本揭示內容而描述的各種組態之元件的所有結構及功能等效物以引用方式明確地併入本文中，且意欲由本主題技術涵蓋。此外，本文中所揭示之任何內容均不意欲專用於公眾，無論在以上描述中是否直接地敍述此揭示內容。不應依據35 U.S.C. §112第六段的規定解釋任何條項要素，除非使用片語「用於…之構件」來明確地敍述該要素或者在方法條項之狀況下使用片語「用於…之步驟」來敍述該要素。Reference to an element in the singular is not intended to mean "one and only one" unless specifically stated, but rather "one or more." All structural and functional equivalents to elements of the various configurations described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be taken by the subject matter technology covered. Furthermore, nothing disclosed herein is intended to be dedicated to the public, whether or not such disclosure is directly recited in the above description. No element of a clause shall be construed in accordance with the sixth paragraph of 35 U.S.C. §112 unless the element is expressly recited using the phrase "a member for" or, in the case of a method clause, the phrase "for... steps" to describe this element.

雖然本說明書含有許多特性，但這些特性不應被解釋為限制可能主張之內容的範圍，而是應解釋為對主題之特定實施方式的描述。在分離實施例之上下文中描述於本說明書中的某些特徵亦可在單個實施例中以組合形式實施。相反，在單個實施例之上下文中描述的各種特徵亦可分別或以任何適合子組合於多個實施例中實施。此外，雖然上文可將特徵描述為以某些組合起作用且甚至最初按此來主張，但來自所主張組合之一或多個特徵在一些情況下可自該組合刪除，並且所主張之組合可針對子組合或子組合之變化。While this specification contains many features, these should not be construed as limiting the scope of what may be claimed, but rather as descriptions of particular embodiments of the subject matter. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as functioning in certain combinations and even originally claimed as such, one or more features from a claimed combination may in some cases be deleted from the combination and the claimed combination Variations of sub-combinations or sub-combinations may be targeted.

本說明書之主題已關於特定態樣加以描述，但其他態樣可經實施且在以下申請專利範圍之範圍內。舉例言之，雖然在圖式中以特定次序來描繪操作，但不應將此理解為需要以所展示之特定次序或以依序次序執行此類操作，或執行所有所說明操作以實現合乎需要之結果。可以不同次序執行請求項中所列舉之動作且仍實現合乎需要之結果。作為一個實例，附圖中描繪之程序未必需要展示之特定次序，或依序次序，以實現合乎需要之結果。在某些情形中，多任務及並行處理可為有利的。此外，不應將上文所描述之態樣中之各種系統組件的分離理解為在所有態樣中皆要求此分離，並且應理解，所描述之程式組件及系統可大體上一起整合於單個軟體產品中或封裝至多個軟體產品中。其他變化係在以下申請專利範圍之範圍內。The subject matter of this specification has been described in terms of specific aspects, but other aspects can be implemented and are within the scope of the following claims. For example, although operations are depicted in the figures in a particular order, this should not be construed as requiring performance of such operations, or performance of all illustrated operations, in the particular order shown or in a sequential order to achieve desirable the result. The actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain situations, multitasking and parallel processing may be advantageous. Furthermore, the separation of the various system components in the aspects described above should not be construed as requiring such separation in all aspects, and it should be understood that the described program components and systems may be generally integrated together in a single software product or packaged into multiple software products. Other changes are within the scope of the following patent application.

100:架構 110:用戶端裝置/用戶端 130:伺服器 150:網路 152:資料庫 200:方塊圖 212-1:處理器 212-2:處理器 214:輸入裝置 216:輸出裝置 218-1:通信模組 218-2:通信模組 220-1:記憶體 220-2:記憶體 222:應用程式 225:圖形使用者介面 232:模型訓練引擎 240:即時直接服裝動畫模型/服裝動畫模型 242:身體解碼器 244:服裝解碼器 246:分段工具 248:時間卷積工具 252:訓練資料庫 301:原始影像/影像/多視圖影像 302:資料預處理步驟 303:個體 304:單層表面追蹤操作 306:網格分段 308:內部層形狀估計操作/內部層形狀估計 310:服裝配準 321-1:身體網格/內部層網格 321-2:服裝網格/外部層網格 342:3D重建 344:關鍵點 346:分段呈現 354:網格/經重建網格 356:經分段網格/服裝網格 364: 服裝模板 400A:可操作塊/塊 400B:可操作塊/塊 400C:可操作塊/塊 402:資料張量/輸入張量/輸入塊 403C:上取樣操作 404:卷積運算 406:漏泄ReLU運算 408:卷積運算 410:卷積運算 412:漏泄ReLU運算 414A:輸出張量 414B:輸出張量 414C:輸出張量 500A:編碼器 500B:解碼器 500C:解碼器 500D:陰影網路 501A-1:輸入張量 501B:輸入張量 501C:輸入張量 502A-1:張量 502B-1:卷積塊 502B-2:卷積塊 502B-3:張量 502C-1:卷積塊 502C-2:張量 502D-1:下取樣 502D-2:下取樣 502D-3:上取樣 502D-4:上取樣 502D-5:上取樣 502D-6:上取樣 502D-7:上取樣 503A-1:降頻轉換塊 503A-2:降頻轉換塊 503A-3:降頻轉換塊 503A-4:降頻轉換塊 503A-5:降頻轉換塊 503A-6:降頻轉換塊 503A-7:降頻轉換塊 503B-1:升頻轉換塊 503B-2:升頻轉換塊 503B-3:升頻轉換塊 503B-4:升頻轉換塊 503B-5:升頻轉換塊 503B-6:升頻轉換塊 503C-1:升頻轉換塊 503C-2:升頻轉換塊 503C-3:升頻轉換塊 503C-4:升頻轉換塊 503C-5:升頻轉換塊 503C-6:升頻轉換塊 503D-1:張量 503D-2:張量 503D-3:張量 503D-4:張量 503D-6:張量 503D-7:張量 503D-8:張量 503D-9:張量 504A-1:張量 504A-2:張量 504A-3:張量 504A-4:張量 504A-5:張量 504A-6:張量 504A-7:張量 504B-1:張量 504B-2:張量 504B-3:張量 504B-4:張量 504B-5:張量 504B-6:張量 504C-1:張量 504C-2:張量 504C-3:張量 504C-4:張量 504C-5:張量 504C-6:張量 504D-1:卷積塊 504D-3:卷積塊 504D-4:卷積塊 504D-5:卷積塊 504D-7:卷積塊 504D-8:卷積塊 504D-9:卷積塊 505A-1:卷積塊 505A-2:卷積塊 505B:卷積 505C:卷積 505D-1:漏泄ReLU運算 505D-2:漏泄ReLU運算 505D-3:漏泄ReLU運算 505D-4:漏泄ReLU運算 505D-5:漏泄ReLU運算 505D-6:漏泄ReLU運算 506A-1:張量 506A-2:張量 506B:紋理張量 506C:紋理張量 507A-1:隱性程式碼 507A-2:噪聲塊 507B:幾何形狀張量 510-1:序連連接 510-2:序連連接 510-3:序連連接 511:陰影圖 600A:身體網路 600B:服裝網路 601A-1:骨架姿態 601A-2:人臉關鍵點 601A-3:視圖調節 601B-1:未擺姿態之服裝幾何形狀 601B-2:平均視圖紋理 603A-1:升頻轉換塊 603A-2:升頻轉換塊 603B-1:條件變分自動編碼器 604A-1:2D UV座標圖/身體幾何形狀 604A-2:身體平均視圖紋理/身體紋理 604A-3:身體殘餘紋理 604A-4:身體環境遮擋 604B-1:隱性程式碼 604B-2:塊 604B-3:隱性調節張量 604B-4:空間變化視圖調節張量 605A:陰影網路 605B:陰影網路/服裝解碼器 605B-1:視圖無關之解碼器 605B-2:視圖相關之解碼器 606B-1:服裝幾何形狀 606B-2:服裝紋理/經預測紋理 606B-3:服裝殘餘紋理 606B-4:服裝模板 607A-1:身體紋理 607A-2:最終輸出網格 608B-1:經重建紋理 608B-2:服裝陰影圖 721A-1:虛擬化身 721A-2:虛擬化身 721A-3:虛擬化身 721B-1:虛擬化身 721B-2:虛擬化身 721B-3:虛擬化身 721C-1:虛擬化身 721C-2:虛擬化身 721C-3:虛擬化身 721D-1:虛擬化身 721D-2:虛擬化身 721D-3:虛擬化身 764A:第一組服裝 764B:第二組服裝 764C:第一組服裝 764D:第一組服裝 800:方法 803A:變分自動編碼器 803B:變分自動編碼器 811A-1:輸入錨圖框/錨圖框 811A-n:輸入錨圖框 811B-1:輸入組塊圖框/第一組塊 811B-n:輸入組塊圖框 813A-1:經對準錨圖框 813A-n:經對準錨圖框 813B-1:經對準組塊圖框 813B-n:經對準組塊圖框 815:初始化 820A:可微分呈現器 820B:可微分呈現器 821A-1:初始網格 821A-n:初始網格 821B-1:初始網格 821B-n:初始網格 825A-1:經對準網格 825A-n:經對準網格 825B-1:經對準網格 825B-n:經對準網格 900:模型 921A-1:單層神經網路模型 921A-2:雙層神經網路模型 921B-1:單層神經網路模型 921B-2:雙層神經網路模型 921C-1:單層神經網路模型 921C-2:雙層神經網路模型 942A-1:身體輸出 942A-2:身體輸出 942B-1:身體輸出 942B-2:身體輸出 942C-1:身體輸出 942C-2:身體輸出 944A-1:服裝輸出 944A-2:服裝輸出 944B-1:服裝輸出/部分 944B-2:服裝輸出/部分 944C-1:服裝輸出/部分 944C-2:服裝輸出/雙層服裝輸出/部分 946B-1:部分 946B-2:部分 946C-1:部分 946C-2:部分 1000:即時三維穿著衣服之個體再現模型/模型 1021A-1:動畫虛擬化身/單層虛擬化身 1021A-2:動畫虛擬化身/單層虛擬化身 1021A-3:動畫虛擬化身/雙層虛擬化身 1021B-1:動畫虛擬化身/單層虛擬化身 1021B-2:動畫虛擬化身/單層虛擬化身 1021B-3:動畫虛擬化身/雙層虛擬化身 1044A-1:邊界區域 1044A-2:邊界區域 1044A-3:邊界區域 1044B-1:邊界區域 1044B-2:邊界區域 1044B-3:邊界區域 1046A-1:邊界區域 1046A-2:邊界區域 1046A-3:邊界區域 1046B-1:邊界區域 1046B-2:邊界區域 1046B-3:邊界區域 1100:比較 1110A-1:虛線 1110A-2:虛線 1110A-3:虛線 1121A-1:即時三維穿著衣服之虛擬化身/虛擬化身 1121A-2:穿著衣服之虛擬化身/虛擬化身 1121B-1:即時三維穿著衣服之虛擬化身/虛擬化身 1121B-2:穿著衣服之虛擬化身/虛擬化身 1121C-1:即時三維穿著衣服之虛擬化身/虛擬化身 1121C-2:穿著衣服之虛擬化身/虛擬化身 1121D-1:即時三維穿著衣服之虛擬化身/虛擬化身 1121D-2:穿著衣服之虛擬化身/虛擬化身 1121E-1:即時三維穿著衣服之虛擬化身/虛擬化身 1121E-2:穿著衣服之虛擬化身/虛擬化身 1121F-1:即時三維穿著衣服之虛擬化身/虛擬化身 1121F-2:穿著衣服之虛擬化身/虛擬化身 1146A:區域 1146B:區域 1146C:區域 1146D:區域 1146E:區域 1146F:區域 1200:直接服裝模型化/模型 1201-1:地面實況影像 1201-2:地面實況影像 1201-3:地面實況 1210A:圖框 1210B:圖框 1210C:圖框 1221-1:虛擬化身 1221-2:虛擬化身 1221-3:虛擬化身 1221A:虛擬化身 1221B:虛擬化身 1221C:虛擬化身 1300:方法 1302:步驟 1304:步驟 1306:步驟 1308:步驟 1310:步驟 1400:方法 1402:步驟 1404:步驟 1406:步驟 1408:步驟 1410:步驟 1412:步驟 1500:電腦系統 1502:處理器 1504:記憶體 1506:資料儲存裝置 1508:匯流排 1510:輸入/輸出模組 1512:通信模組 1514:輸入裝置 1516:輸出裝置 A:錨圖框 B:組塊 100: Architecture 110: Client Device/Client 130: Server 150: Internet 152:Database 200: Block Diagram 212-1: Processor 212-2: Processor 214: Input device 216: Output device 218-1: Communication module 218-2: Communication module 220-1: Memory 220-2: Memory 222: Apps 225: Graphical User Interface 232: Model training engine 240: Instant Direct Garment Animation Models/Clothing Animation Models 242: Body Decoder 244: Clothing Decoder 246: Segmentation Tool 248: Time Convolution Tool 252: Training database 301: Original image/image/multiview image 302: Data preprocessing steps 303: Individual 304: Single layer surface tracking operation 306: Grid Segmentation 308: Internal Layer Shape Estimation Operation / Internal Layer Shape Estimation 310: Apparel Registration 321-1: Body Mesh/Inner Layer Mesh 321-2: Clothing Mesh/External Layer Mesh 342: 3D Reconstruction 344: Key Points 346: Segmented rendering 354: Mesh/Rebuilt mesh 356: Segmented Mesh / Garment Mesh 364: Apparel Template 400A: Operable Block/Block 400B: Operable Block/Block 400C: Operable block/block 402: data tensor/input tensor/input block 403C: Upsampling operation 404: Convolution operation 406: Leaky ReLU operation 408: Convolution operation 410: Convolution operation 412: Leaky ReLU operation 414A: Output Tensor 414B: output tensor 414C: Output Tensor 500A: Encoder 500B: Decoder 500C: Decoder 500D: Shadow Network 501A-1: Input Tensor 501B: Input Tensor 501C: Input Tensor 502A-1: Tensor 502B-1: Convolutional Block 502B-2: Convolutional Block 502B-3: Tensor 502C-1: Convolution Block 502C-2: Tensor 502D-1: Downsampling 502D-2: Down Sampling 502D-3: Upsampling 502D-4: Upsampling 502D-5: Upsampling 502D-6: Upsampling 502D-7: Upsampling 503A-1: Down Converter Block 503A-2: Down Converter Block 503A-3: Down Converter Block 503A-4: Downconverter Block 503A-5: Down Converter Block 503A-6: Down Converter Block 503A-7: Down Converter Block 503B-1: Upconverter Block 503B-2: Upconverter Block 503B-3: Upconverter Block 503B-4: Upconverter Block 503B-5: Upconverter Block 503B-6: Upconverter Block 503C-1: Upconverter Block 503C-2: Upconverter Block 503C-3: Upconverter Block 503C-4: Upconverter Block 503C-5: Upconverter Block 503C-6: Upconverter Block 503D-1: Tensor 503D-2: Tensor 503D-3: Tensor 503D-4: Tensor 503D-6: Tensor 503D-7: Tensor 503D-8: Tensor 503D-9: Tensor 504A-1: Tensor 504A-2: Tensor 504A-3: Tensor 504A-4: Tensor 504A-5: Tensor 504A-6: Tensor 504A-7: Tensor 504B-1: Tensor 504B-2: Tensor 504B-3: Tensor 504B-4: Tensor 504B-5: Tensor 504B-6: Tensor 504C-1: Tensor 504C-2: Tensor 504C-3: Tensor 504C-4: Tensor 504C-5: Tensor 504C-6: Tensor 504D-1: Convolution Block 504D-3: Convolution Block 504D-4: Convolution Block 504D-5: Convolution Block 504D-7: Convolution Block 504D-8: Convolution Block 504D-9: Convolution Block 505A-1: Convolution Block 505A-2: Convolution Block 505B: Convolution 505C: Convolution 505D-1: Leaky ReLU operation 505D-2: Leaky ReLU operation 505D-3: Leaky ReLU operation 505D-4: Leaky ReLU operation 505D-5: Leaky ReLU operation 505D-6: Leaky ReLU operation 506A-1: Tensor 506A-2: Tensor 506B: Texture Tensor 506C: Texture Tensor 507A-1: Implicit Code 507A-2: Noise Block 507B: Geometry Tensors 510-1: Sequence connection 510-2: Sequence connection 510-3: Sequence connection 511: Shadow Map 600A: Body Network 600B: Apparel Network 601A-1: Skeleton Pose 601A-2: Face Key Points 601A-3: View Adjustment 601B-1: Unposed Garment Geometry 601B-2: Average View Texture 603A-1: Upconverter Block 603A-2: Upconverter Block 603B-1: Conditional Variational Autoencoder 604A-1: 2D UV Plot/Body Geometry 604A-2: Body Average View Texture/Body Texture 604A-3: Body Residual Texture 604A-4: Body Ambient Occlusion 604B-1: Implicit Code 604B-2: Block 604B-3: Implicit conditioning tensors 604B-4: Spatially varying view conditioning tensors 605A: Shadow Network 605B: Shadow Network/Clothing Decoder 605B-1: View Independent Decoder 605B-2: View-dependent decoder 606B-1: Garment Geometry 606B-2: Apparel Textures/Predicted Textures 606B-3: Garment Residual Texture 606B-4: Apparel Template 607A-1: Body Texture 607A-2: Final output mesh 608B-1: Retextured 608B-2: Apparel Shadow Map 721A-1: Avatar 721A-2: Avatar 721A-3: Avatar 721B-1: Virtual Avatar 721B-2: Virtual Avatar 721B-3: Virtual Avatar 721C-1: Virtual Avatar 721C-2: Virtual Avatar 721C-3: Virtual Avatar 721D-1: Avatar 721D-2: Avatar 721D-3: Avatar 764A: First set of clothing 764B: Second set of clothing 764C: First set of clothing 764D: First set of costumes 800: Method 803A: Variational Autoencoder 803B: Variational Autoencoder 811A-1: Input Anchor Frame/Anchor Frame 811A-n: Input anchor frame 811B-1: Input Block Diagram/First Block 811B-n: Input block diagram frame 813A-1: Aligned Anchor Frame 813A-n: Aligned Anchor Frame 813B-1: Aligned Block Frame 813B-n: Aligned Block Frame 815: Initialize 820A: Differentiable Renderer 820B: Differentiable Renderer 821A-1: Initial grid 821A-n: Initial grid 821B-1: Initial grid 821B-n: Initial grid 825A-1: Aligned grid 825A-n: Aligned grid 825B-1: Aligned Grid 825B-n: Aligned grid 900: Model 921A-1: Single Layer Neural Network Model 921A-2: Two-layer Neural Network Model 921B-1: Single Layer Neural Network Model 921B-2: Two-layer Neural Network Model 921C-1: Single Layer Neural Network Model 921C-2: Two-layer Neural Network Model 942A-1: Body output 942A-2: Body Output 942B-1: Body output 942B-2: Body output 942C-1: Body output 942C-2: Body output 944A-1: Garment output 944A-2: Garment output 944B-1: Apparel Output/Part 944B-2: Apparel Output/Part 944C-1: Garment output/section 944C-2: Garment Output/Double Garment Output/Part 946B-1: Section 946B-2: Section 946C-1: Section 946C-2: Section 1000: Real-time 3D representation of a clothed individual model/model 1021A-1: Animated Avatars/Single Layer Avatars 1021A-2: Animated Avatars/Single Layer Avatars 1021A-3: Animated Avatars/Double Layered Avatars 1021B-1: Animated Avatars/Single Layer Avatars 1021B-2: Animated Avatars/Single Layer Avatars 1021B-3: Animated Avatars/Double Layered Avatars 1044A-1: Boundary Area 1044A-2: Boundary Area 1044A-3: Boundary Area 1044B-1: Boundary Area 1044B-2: Boundary Area 1044B-3: Boundary Area 1046A-1: Boundary Area 1046A-2: Boundary Area 1046A-3: Boundary Area 1046B-1: Boundary Area 1046B-2: Boundary Area 1046B-3: Boundary Area 1100: Compare 1110A-1: Dotted line 1110A-2: Dotted line 1110A-3: Dotted line 1121A-1: Real-time 3D clothed avatars/avatars 1121A-2: Clothed avatars/avatars 1121B-1: Real-time 3D clothed avatars/avatars 1121B-2: Clothed avatars/avatars 1121C-1: Real-time 3D clothed avatars/avatars 1121C-2: Clothed avatars/avatars 1121D-1: Real-time 3D clothed avatars/avatars 1121D-2: Clothed avatars/avatars 1121E-1: Real-time 3D clothed avatars/avatars 1121E-2: Clothed avatars/avatars 1121F-1: Real-time 3D clothed avatars/avatars 1121F-2: Clothed avatars/avatars 1146A: Area 1146B: Area 1146C: Area 1146D: Area 1146E: Area 1146F: Area 1200: Direct Garment Modelling/Modeling 1201-1: Ground-truth imagery 1201-2: Ground-truth imagery 1201-3: Ground truth 1210A: Frame 1210B: Frames 1210C: Frame 1221-1: Avatars 1221-2: Avatars 1221-3: Avatars 1221A: Avatars 1221B: Virtual Avatars 1221C: Virtual Avatar 1300: Method 1302: Steps 1304: Steps 1306: Steps 1308: Steps 1310: Steps 1400: Method 1402: Steps 1404: Steps 1406: Steps 1408: Steps 1410: Steps 1412: Steps 1500: Computer Systems 1502: Processor 1504: Memory 1506: Data Storage Device 1508: Busbar 1510: Input/Output Module 1512: Communication module 1514: Input Device 1516: Output device A: Anchor frame B: Chunk

[圖1]說明根據一些實施例的適合於在虛擬實境環境中提供即時穿著衣服之個體動畫之實例架構。[FIG. 1] illustrates an example architecture suitable for providing animation of instantly clothed individuals in a virtual reality environment, according to some embodiments.

[圖2]為說明根據本揭示內容之某些態樣的來自圖1之架構之實例伺服器及用戶端的方塊圖。[FIG. 2] is a block diagram illustrating an example server and client from the architecture of FIG. 1 in accordance with certain aspects of the present disclosure.

[圖3]說明根據一些實施例的穿著衣服之身體流水線。[FIG. 3] illustrates a body line for wearing clothes, according to some embodiments.

[圖4]說明根據一些實施例的在圖1之架構中使用的網路元件及可操作塊。[FIG. 4] illustrates network elements and operational blocks used in the architecture of FIG. 1, according to some embodiments.

[圖5A]至[圖5D]說明根據一些實施例的在即時穿著衣服之個體動畫模型中使用的編碼器及解碼器架構。[FIG. 5A]-[FIG. 5D] illustrate encoder and decoder architectures used in an animated model of a real-time clothed individual, according to some embodiments.

[圖6A]至[圖6B]說明根據一些實施例的用於即時穿著衣服之個體動畫模型的身體及服裝網路之架構。[FIG. 6A]-[FIG. 6B] illustrate the architecture of a body and clothing network for an animated model of an individual clothed in real time, according to some embodiments.

[圖7]說明根據一些實施例的用於提供即時穿著衣服之個體動畫的雙層模型之紋理編輯結果。[FIG. 7] illustrates the results of texture editing of a two-layer model for providing instant animation of a clothed individual, according to some embodiments.

[圖8]說明根據一些實施例的基於逆呈現之光度對準程序。[FIG. 8] illustrates an inverse rendering based photometric alignment procedure according to some embodiments.

[圖9]說明根據一些實施例的在雙層神經網路模型與單層神經網路模型之間的個體之即時三維穿著衣服之個體再現之比較。[FIG. 9] illustrates a comparison of real-time three-dimensional clothed individual representations of individuals between a two-layer neural network model and a single-layer neural network model, according to some embodiments.

[圖10]說明根據一些實施例的即時三維穿著衣服之個體再現模型的動畫結果。[FIG. 10] Illustrates animation results of a real-time three-dimensional clothed individual rendering model in accordance with some embodiments.

[圖11]說明根據一些實施例的不同即時三維穿著衣服之個體模型之間的機率相關性之比較。[FIG. 11] illustrates a comparison of probabilistic correlations between different real-time three-dimensional clothed individual models, according to some embodiments.

[圖12]說明根據一些實施例的系統組件之消融分析。[FIG. 12] illustrates ablation analysis of system components in accordance with some embodiments.

[圖13]為說明根據一些實施例的用於訓練直接服裝模型以自多個視圖產生即時個體動畫之方法中之步驟的流程圖。[FIG. 13] is a flowchart illustrating steps in a method for training a direct clothing model to generate real-time individual animation from multiple views, according to some embodiments.

[圖14]為說明根據一些實施例的用於將直接服裝模型嵌入於虛擬實境環境中之方法中之步驟的流程圖。[FIG. 14] is a flowchart illustrating steps in a method for embedding a direct garment model in a virtual reality environment, according to some embodiments.

[圖15]為說明實例電腦系統之方塊圖，藉由該電腦系統可實施圖1及圖2之用戶端及伺服器以及圖13至圖14之方法。[FIG. 15] is a block diagram illustrating an example computer system by which the client and server of FIGS. 1 and 2 and the methods of FIGS. 13-14 may be implemented.

1300:方法 1300: Method

1302:步驟 1302: Steps

1304:步驟 1304: Steps

1306:步驟 1306: Steps

1308:步驟 1308: Steps

1310:步驟 1310: Steps

Claims

一種電腦實施方法，其包含：收集一個體之多個影像，來自該個體之該些影像包括該個體之一或多個不同視角；基於該個體之該些影像而形成一三維服裝網格及一三維身體網格；將該三維服裝網格與該三維身體網格對準以形成一皮膚-服裝邊界及一衣物紋理；基於一經預測服裝位置及衣物紋理以及來自該個體之該些影像的一經內插位置及衣物紋理而判定一損耗因數；以及根據該損耗因數更新包括該三維服裝網格及該三維身體網格之一三維模型。 A computer-implemented method comprising: collecting images of an individual, the images from the individual including one or more different perspectives of the individual; forming a three-dimensional clothing mesh and a three-dimensional body mesh based on the images of the individual; aligning the three-dimensional garment mesh with the three-dimensional body mesh to form a skin-garment boundary and a garment texture; determining a loss factor based on a predicted clothing position and clothing texture and an interpolated position and clothing texture from the images of the individual; and A three-dimensional model comprising the three-dimensional clothing mesh and the three-dimensional body mesh is updated according to the loss factor.

如請求項1之電腦實施方法，其中收集該個體之該多個影像包含藉由一同步的多攝影機系統捕捉來自該個體之該些影像。The computer-implemented method of claim 1, wherein collecting the plurality of images of the individual includes capturing the images from the individual by a synchronized multi-camera system.

如請求項1之電腦實施方法，其中形成該三維身體網格包含：自該個體之該些影像判定一骨架姿態；以及將具有一表面變形之一蒙皮網格添加至該骨架姿態。 The computer-implemented method of claim 1, wherein forming the three-dimensional body mesh comprises: determine a skeletal pose from the images of the individual; and Add a skinned mesh with a surface deformation to the skeletal pose.

如請求項1之電腦實施方法，其中形成該三維身體網格包含自該個體之該些影像識別該個體之經曝露皮膚部分作為該三維身體網格之部分。The computer-implemented method of claim 1, wherein forming the three-dimensional body mesh comprises identifying exposed skin portions of the individual as part of the three-dimensional body mesh from the images of the individual.

如請求項1之電腦實施方法，其中形成該三維服裝網格包含藉由驗證該三維服裝網格中之一頂點之一投影屬於每一攝影機視圖上之一服裝區段來識別該頂點。The computer-implemented method of claim 1, wherein forming the three-dimensional garment mesh includes identifying a vertex in the three-dimensional garment mesh by verifying that a projection of the vertex belongs to a garment segment on each camera view.

如請求項1之電腦實施方法，其中將該三維服裝網格與該三維身體網格對準包含自該三維服裝網格選擇一服裝區段以及自該三維身體網格選擇一身體區段並將該服裝區段與該身體區段對準。The computer-implemented method of claim 1, wherein aligning the three-dimensional garment mesh with the three-dimensional body mesh comprises selecting a garment segment from the three-dimensional garment mesh and selecting a body segment from the three-dimensional body mesh and placing The garment section is aligned with the body section.

如請求項1之電腦實施方法，其中形成該三維服裝網格及該三維身體網格包含：自該個體之該些影像偵測一或多個二維關鍵點；以及自不同視點對多個影像進行三角剖分以將該些二維關鍵點轉換成形成該三維身體網格或該三維服裝網格之三維關鍵點。 The computer-implemented method of claim 1, wherein forming the three-dimensional clothing mesh and the three-dimensional body mesh comprises: detect one or more two-dimensional keypoints from the images of the individual; and The images are triangulated from different viewpoints to convert the 2D keypoints into 3D keypoints forming the 3D body mesh or the 3D clothing mesh.

如請求項1之電腦實施方法，其中將該三維服裝網格與該三維身體網格對準包含：將該三維服裝網格與一第一模板對準以及將該三維身體網格與一第二模板對準；以及選擇一明確約束來區分該第一模板與該第二模板。 The computer-implemented method of claim 1, wherein aligning the three-dimensional garment mesh with the three-dimensional body mesh comprises: aligning the three-dimensional garment mesh with a first template and aligning the three-dimensional body mesh with a second template; and An explicit constraint is selected to distinguish the first template from the second template.

如請求項1之電腦實施方法，其進一步包含使用用於多個骨架姿態之一時間編碼器來使該三維模型動畫化以及使每一骨架姿態與該三維服裝網格相關。The computer-implemented method of claim 1, further comprising animating the three-dimensional model using a temporal encoder for a plurality of skeletal poses and correlating each skeletal pose with the three-dimensional clothing mesh.

如請求項1之電腦實施方法，其進一步包含基於在一預選擇時間窗內串接之該三維服裝網格之多個圖框而判定一動畫損耗因數，如由一動畫模型所預測且如在該預選擇時間窗內自該些影像導出，以及基於該動畫損耗因數而更新該動畫模型。The computer-implemented method of claim 1, further comprising determining an animation loss factor based on a plurality of frames of the three-dimensional garment mesh concatenated within a preselected time window, as predicted by an animation model and as in The preselected time window is derived from the images, and the animation model is updated based on the animation loss factor.

一種系統，其包含：一記憶體，其儲存多個指令；以及一或多個處理器，其經組態以執行該些指令以使該系統：收集一個體之多個影像，來自該個體之該些影像包含來自該個體之不同輪廓的一或多個視圖；基於該個體之該些影像而形成一三維服裝網格及一三維身體網格；將該三維服裝網格與該三維身體網格對準以形成一皮膚服裝邊界及一衣物紋理；基於一經預測服裝位置及紋理以及來自該個體之該些影像的一經內插位置及紋理而判定一損耗因數；以及根據該損耗因數更新包括該三維服裝網格及該三維身體網格之一三維模型，其中收集該個體之該多個影像包含藉由一同步的多攝影機系統捕捉來自該個體之該些影像。 A system comprising: a memory that stores a plurality of instructions; and one or more processors configured to execute the instructions to cause the system to: collecting a plurality of images of an individual, the images from the individual comprising one or more views of different contours from the individual; forming a three-dimensional clothing mesh and a three-dimensional body mesh based on the images of the individual; aligning the three-dimensional garment mesh with the three-dimensional body mesh to form a skin garment boundary and a garment texture; determining a loss factor based on a predicted garment position and texture and an interpolated position and texture from the images of the individual; and Updating a three-dimensional model comprising the three-dimensional clothing mesh and the three-dimensional body mesh according to the loss factor, wherein collecting the plurality of images of the individual includes capturing the images from the individual by a synchronized multi-camera system.

如請求項11之系統，其中為了形成該三維身體網格，該一或多個處理器執行該些指令以：自該個體之該些影像判定一骨架姿態；以及將具有一表面變形之一蒙皮網格添加至該骨架姿態。 The system of claim 11, wherein to form the three-dimensional body mesh, the one or more processors execute the instructions to: determine a skeletal pose from the images of the individual; and Add a skinned mesh with a surface deformation to the skeletal pose.

如請求項11之系統，其中為了形成該三維身體網格，該一或多個處理器執行該些指令以自該個體之該些影像識別該個體之經曝露皮膚部分作為該三維身體網格之部分。The system of claim 11, wherein to form the three-dimensional body mesh, the one or more processors execute the instructions to identify exposed skin portions of the individual as part of the three-dimensional body mesh from the images of the individual part.

如請求項11之系統，其中為了形成該三維服裝網格，該一或多個處理器執行該些指令以藉由驗證該三維服裝網格中之一頂點之一投影屬於每一攝影機視圖上之一服裝區段來識別該頂點。The system of claim 11, wherein to form the three-dimensional garment mesh, the one or more processors execute the instructions by verifying that a projection of a vertex in the three-dimensional garment mesh belongs to each camera view a garment segment to identify the vertex.

如請求項11之系統，其中為了將該三維服裝網格與該三維身體網格對準，該一或多個處理器執行該些指令以自該三維服裝網格選擇一服裝區段以及自該三維身體網格選擇一身體區段並將該服裝區段與該身體區段對準。The system of claim 11, wherein to align the three-dimensional garment mesh with the three-dimensional body mesh, the one or more processors execute the instructions to select a garment segment from the three-dimensional garment mesh and from the three-dimensional garment mesh The three-dimensional body mesh selects a body segment and aligns the garment segment with the body segment.

一種電腦實施方法，其包含：自一個體收集一影像；自該影像選擇多個二維關鍵點；自該影像識別與每一二維關鍵點相關聯之一三維關鍵點；藉由一三維模型判定一三維服裝網格及一三維身體網格，該三維服裝網格及該三維身體網格錨定於一或多個三維骨架姿態中；產生包括該三維服裝網格、該三維身體網格及一紋理的該個體之一三維表示；以及將該個體之該三維表示即時地嵌入於一虛擬實境環境中。 A computer-implemented method comprising: collect an image from an individual; select a plurality of 2D keypoints from the image; identifying from the image a 3D keypoint associated with each 2D keypoint; Determining a 3D clothing mesh and a 3D body mesh from a 3D model, the 3D clothing mesh and the 3D body mesh anchored in one or more 3D skeletal poses; generating a three-dimensional representation of the individual including the three-dimensional garment mesh, the three-dimensional body mesh, and a texture; and The three-dimensional representation of the individual is embedded in a virtual reality environment in real-time.

如請求項16之電腦實施方法，其中識別與每一二維關鍵點相關聯之該三維關鍵點包含沿著該影像之視點插值以三維形式來投影該影像。The computer-implemented method of claim 16, wherein identifying the three-dimensional keypoint associated with each two-dimensional keypoint comprises projecting the image in three-dimensional form along viewpoint interpolation of the image.

如請求項16之電腦實施方法，其中判定該三維服裝網格及該三維身體網格包含基於該些二維關鍵點而判定該些三維骨架姿態之一損耗因數。The computer-implemented method of claim 16, wherein determining the three-dimensional clothing mesh and the three-dimensional body mesh comprises determining a loss factor of the three-dimensional skeletal poses based on the two-dimensional keypoints.

如請求項16之電腦實施方法，其中將該個體之該三維表示嵌入於該虛擬實境環境中包含根據該虛擬實境環境選擇該三維身體網格中之一衣物紋理。The computer-implemented method of claim 16, wherein embedding the three-dimensional representation of the individual in the virtual reality environment includes selecting a clothing texture in the three-dimensional body mesh based on the virtual reality environment.

如請求項16之電腦實施方法，其中將該個體之該三維表示嵌入於該虛擬實境環境中包含使該個體之該三維表示動畫化以與該虛擬實境環境互動。The computer-implemented method of claim 16, wherein embedding the three-dimensional representation of the individual in the virtual reality environment includes animating the three-dimensional representation of the individual to interact with the virtual reality environment.