TW202226164A

TW202226164A - Pixel-aligned volumetric avatars

Info

Publication number: TW202226164A
Application number: TW110148213A
Authority: TW
Inventors: 史蒂芬安東尼倫巴地; 傑森薩拉吉; 克羅伊茲湯瑪士西蒙; 齊藤俊介; 麥克瑟荷佛; 阿米特拉吉; 詹姆士亨利海斯
Original assignee: 美商菲絲博克科技有限公司
Priority date: 2020-12-23
Filing date: 2021-12-22
Publication date: 2022-07-01
Also published as: JP2024501958A; EP4264557A1; WO2022140445A1

Abstract

A method of forming a pixel-aligned volumetric avatar includes receiving multiple two-dimensional images having at least two or more fields of view of a subject. The method also includes extracting multiple image features from the two-dimensional images using a set of learnable weights, projecting the image features along a direction between a three-dimensional model of the subject and a selected observation point for a viewer, and providing, to the viewer, an image of the three-dimensional model of the subject. A system and a non-transitory, computer readable medium storing instructions to perform the above method, are also provided.

Description

像素對準之體積化身Pixel-aligned volume avatar

本發明係關於在虛擬實境（VR）及擴增實境（AR）應用中產生忠實的面部表情以用於產生即時體積化身。更具體言之，本發明提供在用於VR/AR應用之多重身分設定中之即時體積化身。The present invention relates to generating faithful facial expressions in virtual reality (VR) and augmented reality (AR) applications for generating instant volumetric avatars. More specifically, the present invention provides real-time volumetric avatars in multi-identity settings for VR/AR applications.

本發明係關於Lombardi等人於2020年12月23日申請之名為「學習預測隱式體積化身（LEARNING TO PREDICT IMPLICIT VOLUMETRIC AVATARS）」的美國臨時申請案第63/129,989號且根據35 U.S.C.§119(e)主張該美國臨時申請案第63/129,989號的優先權，該申請案之全部內容出於所有目的特此係以引用方式併入。本發明亦係關於2021年12月20日申請之名為「像素對準之體積化身（PIXEL-ALIGNED VOLUMETRIC AVATARS）」的美國非臨時申請案第17/556,367號且根據35 U.S.C.§119(e)主張該美國非臨時申請案第17/556,367號的優先權，該申請案之全部內容出於所有目的特此係以引用方式併入。The present invention is related to US Provisional Application Serial No. 63/129,989, titled "LEARNING TO PREDICT IMPLICIT VOLUMETRIC AVATARS," filed December 23, 2020 by Lombardi et al. and pursuant to 35 U.S.C. §119 (e) Claiming priority to US Provisional Application No. 63/129,989, the entire contents of which are hereby incorporated by reference for all purposes. This disclosure is also related to US Non-Provisional Application Serial No. 17/556,367, filed on December 20, 2021, and entitled "PIXEL-ALIGNED VOLUMETRIC AVATARS" and filed under 35 U.S.C. §119(e) Priority is claimed from this US Non-Provisional Application No. 17/556,367, which is hereby incorporated by reference in its entirety for all purposes.

在VR/AR應用之領域中，照片擬真的人類頭部之獲取及呈現為達成虛擬遙現之具有挑戰性的問題。當前，最高品質係藉由以個人特定方式對多視圖資料訓練之體積方法來達成。與較簡單之基於網格之模型相比，這些模型較好地表示精細結構，諸如頭髮。體積模型典型地使用全域碼來表示面部表情，使得其可由小的動畫參數集驅動。雖然此類架構達成了令人印象深刻的呈現品質，但其無法容易擴展至多重身分設定，且在計算上係昂貴的且難以在「即時」應用中實踐。In the field of VR/AR applications, the acquisition and presentation of photorealistic human heads is a challenging problem to achieve virtual telepresence. Currently, the highest quality is achieved by volumetric methods trained on multi-view data in an individual-specific manner. These models represent fine structures such as hair better than simpler mesh-based models. Volumetric models typically use global codes to represent facial expressions so that they can be driven by a small set of animation parameters. While this type of architecture achieves impressive rendering quality, it does not scale easily to multiple identity settings, and is computationally expensive and difficult to implement in "real-time" applications.

在一第一實施例中，一種電腦實施方法包括：接收具有一個體之至少兩個或多於兩個視場的多個二維影像；使用可學習權重之一集合自該些二維影像提取多個影像特徵；沿著該個體之一三維模型與一觀察者之一所選擇觀測點之間的一方向投影該些影像特徵；及向該觀察者提供該個體之該三維模型之一影像。In a first embodiment, a computer-implemented method includes: receiving a plurality of 2D images having at least two or more than two fields of view of a volume; extracting from the 2D images using one of a set of learnable weights a plurality of image features; projecting the image features along a direction between a three-dimensional model of the individual and an observation point selected by an observer; and providing the observer with an image of the three-dimensional model of the individual.

在一第二實施例中，一種系統包括：一記憶體，其儲存多個指令；及一或多個處理器，其經組態以執行該些指令以使得該系統執行操作。該些操作包括：接收具有一個體之至少兩個或多於兩個視場的多個二維影像；使用可學習權重之一集合自該些二維影像提取多個影像特徵；沿著該個體之一三維模型與一觀察者之一所選擇觀測點之間的一方向投影該些影像特徵；及向該觀察者提供該個體之該三維模型之一自動立體影像。In a second embodiment, a system includes: a memory that stores a plurality of instructions; and one or more processors configured to execute the instructions to cause the system to perform operations. The operations include: receiving a plurality of 2D images having at least two or more than two fields of view of an entity; extracting a plurality of image features from the 2D images using one of the learnable weight sets; projecting the image features in a direction between a three-dimensional model and an observation point selected by an observer; and providing the observer with an autostereoscopic image of the three-dimensional model of the individual.

在一第三實施例中，一種用於訓練一模型以將個體之視圖提供至一虛擬實境頭戴耳機中之一自動立體顯示器的電腦實施方法包括自多個使用者之面部收集多個實況影像。該電腦實施方法亦包括：運用儲存之經校準立體影像對來修正該些實況影像；運用一三維面部模型來產生該些個體之多個合成視圖，其中該些個體之該些合成視圖包括沿著對應於該些個體之多個視圖的不同方向投影的多個特徵圖之一內插；及基於該些實況影像與該些個體之該些合成視圖之間的一差異來訓練該三維面部模型。In a third embodiment, a computer-implemented method for training a model to provide a view of an individual to an autostereoscopic display in a virtual reality headset includes collecting a plurality of live scenes from the faces of a plurality of users image. The computer-implemented method also includes: using stored pairs of calibrated stereoscopic images to correct the live images; using a three-dimensional facial model to generate composite views of the individuals, wherein the composite views of the individuals include along interpolating one of feature maps corresponding to different directional projections of views of the individuals; and training the three-dimensional facial model based on a difference between the live images and the synthesized views of the individuals.

在另一實施例中，一種方法包括：接收具有一個體之至少兩個或多於兩個視場的多個二維影像；使用可學習權重之一集合自該些二維影像提取多個影像特徵；沿著該個體之一三維模型與一觀察者之一所選擇觀測點之間的一方向投影該些影像特徵；及向該觀察者提供該個體之該三維模型之一影像。In another embodiment, a method includes: receiving a plurality of 2D images having at least two or more than two fields of view of a volume; extracting the plurality of images from the 2D images using one of a set of learnable weights features; project the image features along a direction between a three-dimensional model of the individual and an observation point selected by an observer; and provide the observer with an image of the three-dimensional model of the individual.

在另一實施例中，一種系統包括用以儲存指令之一第一構件及用以執行該些指令以使得該系統執行一方法之一第二構件。該方法包括：接收具有一個體之至少兩個或多於兩個視場的多個二維影像；使用可學習權重之一集合自該些二維影像提取多個影像特徵；沿著該個體之一三維模型與一觀察者之一所選擇觀測點之間的一方向投影該些影像特徵；及向該觀察者提供該個體之該三維模型之一影像。In another embodiment, a system includes a first means for storing instructions and a second means for executing the instructions to cause the system to perform a method. The method includes: receiving a plurality of 2D images having at least two or more than two fields of view of an individual; extracting a plurality of image features from the 2D images using a set of learnable weights; Projecting the image features in a direction between a three-dimensional model and an observation point selected by an observer; and providing the observer with an image of the three-dimensional model of the individual.

在以下實施方式中，闡述眾多特定細節以提供對本發明之充分理解。然而，對於一般所屬技術領域中具有通常知識者將顯而易見，可在並無這些特定細節中之一些細節的情況下實踐本發明之實施例。在其他情況下，尚未詳細展示熟知結構及技術以免混淆本發明。一般綜述 In the following embodiments, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to one of ordinary skill in the art that embodiments of the invention may be practiced without some of these specific details. In other instances, well-known structures and techniques have not been shown in detail in order not to obscure the present invention. General overview

虛擬遙現應用嘗試以高準確度及保真度表示人類頭部。人類頭部歸因於其複雜幾何形狀及外觀屬性而在模型化及呈現方面具有挑戰性：皮膚之次表面散射、精細尺度表面細節、薄結構化頭髮及人類眼睛以及牙齒皆為鏡面及半透明的。現有方法包括複雜且昂貴的多視圖捕獲裝備（具有高達數百個攝影機）以甚至重建構人類頭部之個人特定模型。當前，高品質方法使用體積模型而非紋理化網格，此係由於其可較好地學習來表示面部上之精細結構，比如頭髮，此對於達成照片擬真的外觀至關重要。體積模型典型地使用全域碼來表示面部表情或僅適用於靜態場景。雖然此類架構達成令人印象深刻的呈現品質，但其難以適應多重身分設定。如用以控制表情之全域碼不足以模型化橫越個體之身分變化。解決此問題的一些嘗試包括用以表示場景及物件之隱式模型。這些模型具有以下優點：場景表示為連續空間中之參數函數，其允許對幾何結構及紋理進行細粒推斷。然而，這些方法未能對例如針對具有紋理化表面之頭髮所呈現的視圖相依效應進行模型化。方法可橫越物件一般化，但僅在低解析度下進行，且可僅純粹處置朗伯（Lambertian）表面，此對於人類頭部並不足夠。上述方法之一個缺點為其經訓練以僅模型化單個場景或物件。可產生多個物件之方法典型地在經預測紋理及幾何形狀之品質及解析度方面受限制。舉例而言，諸如自全域影像編碼（例如每影像單個潛在碼向量）產生一組權重之場景表示網路（SRN）之方法，難以一般化至局部改變（例如面部表情）且即使當這些改變在輸入影像中可見時亦無法恢復高頻細節。此係因為全域潛在碼彙總影像中之資訊且必須捨棄一些資訊以產生資料之緊密編碼。 Virtual telepresence applications attempt to represent the human head with high accuracy and fidelity. The human head is challenging to model and render due to its complex geometry and appearance properties: subsurface scattering of skin, fine-scale surface detail, thin structured hair, and human eyes and teeth that are both specular and translucent of. Existing methods include complex and expensive multi-view capture rigs (with up to hundreds of cameras) to reconstruct even person-specific models of the human head. Currently, high-quality methods use volumetric models rather than textured meshes, as they can learn better to represent fine structures on faces, such as hair, which are critical for achieving photo-realistic looks. Volumetric models typically use global codes to represent facial expressions or are only suitable for static scenes. While this type of architecture achieves impressive presentation quality, it is difficult to accommodate multiple identity settings. For example, the global code used to control expressions is not sufficient to model identity changes across individuals. Some attempts to solve this problem include implicit models used to represent scenes and objects. These models have the advantage that the scene is represented as a parametric function in a continuous space, which allows fine-grained inference of geometry and texture. However, these methods fail to model view-dependent effects such as presented for hair with textured surfaces. The method can generalize across objects, but only at low resolution, and can deal purely with Lambertian surfaces, which is not sufficient for the human head. One disadvantage of the above approach is that it is trained to model only a single scene or object. Methods by which multiple objects can be generated are typically limited in the quality and resolution of predicted textures and geometries. For example, methods such as Scene Representation Networks (SRNs), which generate a set of weights from global image encoding (eg, a single latent codevector per image), are difficult to generalize to local changes (eg, facial expressions) and even when these changes are High frequency details are also not restored when visible in the input image. This is because the global latent code aggregates information in the image and some information must be discarded to produce a tight encoding of the data.

為了解決用於電腦網路之沉浸式實境應用之領域中的以上問題，如本文中所揭示之實施例實施自有限數目個輸入進行之人類頭部之經預測體積化身。為了達成此情形，如本文中所揭示之實施例藉由參數化實現橫越多個身分之模型一般化，該參數化組合神經輻射場與直接自模型輸入提取之局部像素對準之特徵。此方法導致可在即時沉浸式應用中實施的淺且簡單的網路。在一些實施例中，在光度再現損失函數上訓練之模型可不使用顯式3D監督來即時呈現基於個體之化身。如本文中所揭示之模型在多重身分設定中產生忠實的面部表情，且因此可適用於即時群組沉浸式應用之領域中。如本文中所揭示之實施例即時地一般化至多個未見過的身分及表情，且提供時間影像序列之良好表示。 To address the above problems in the field of immersive reality applications for computer networks, embodiments as disclosed herein implement a predicted volumetric avatar of a human head from a limited number of inputs. To achieve this, embodiments as disclosed herein achieve model generalization across multiple identities through parameterization that combines features of neural radiation fields with local pixel alignments extracted directly from model inputs. This approach results in shallow and simple networks that can be implemented in instant immersive applications. In some embodiments, a model trained on a photometric rendering loss function can render individual-based avatars on the fly without using explicit 3D supervision. Models as disclosed herein produce faithful facial expressions in multi-identity settings, and are therefore applicable in the field of real-time group immersion applications. Embodiments as disclosed herein generalize in real-time to multiple unseen identities and expressions, and provide a good representation of temporal image sequences.

一些實施例包括用於僅使用人類頭部之若干輸入影像來估計體積3D化身的像素對準之體積化身（PVA）模型。PVA模型能夠即時地一般化至未見過的身分。為了改良橫越身分之一般化，PVA模型經由自輸入影像提取之局部的像素對準之特徵來參數化體積模型。因此，PVA模型可合成未見過的身分及表情之新穎視圖，同時保留所呈現化身中之高頻細節。另外，一些實施例包括像素對準之輻射場，其自針對空間中之任何點在任何視圖方向上的一組稀疏姿勢影像預測隱式形狀及外觀。實例系統架構 Some embodiments include a volumetric avatar (PVA) model for estimating pixel alignment of a volumetric 3D avatar using only several input images of a human head. PVA models can generalize to unseen identities on the fly. To improve the generalization across identities, the PVA model parameterizes the volume model via features of local pixel alignment extracted from the input image. Thus, the PVA model can synthesize novel views of unseen identities and expressions, while preserving high frequency details in the presented avatars. Additionally, some embodiments include a pixel-aligned radiation field that predicts implicit shape and appearance from a set of sparse pose images in any view direction for any point in space. Example System Architecture

圖1說明根據一些實施例的適合於存取體積化身模型引擎之實例架構100。架構100包括伺服器130，該些伺服器經由網路150與用戶端裝置110及至少一個資料庫152通信耦接。許多伺服器130中的一者經組態以代管記憶體，記憶體包括在由處理器執行時使得伺服器130執行如本文所揭示之方法中之步驟中的至少一些的指令。在一些實施例中，處理器經組態以控制圖形使用者介面（GUI）以使用戶端裝置110中之一者的使用者使用沉浸式實境應用程式存取體積化身模型引擎。因此，處理器可包括儀錶板工具，該儀錶板工具經組態以經由GUI向使用者顯示組件及圖形結果。出於負載平衡之目的，多個伺服器130可代管包括至一或多個處理器之指令之記憶體，且多個伺服器130可代管歷史日誌以及包括用於體積化身模型引擎之多個訓練檔案庫的資料庫152。此外，在一些實施例中，用戶端裝置110之多個使用者可存取相同體積化身模型引擎來運行一或多個沉浸式實境應用程式。在一些實施例中，具有單一用戶端裝置110之單一使用者可提供影像及資料以訓練在一或多個伺服器130中並行地運行之一或多個機器學習模型。因此，用戶端裝置110及伺服器130可經由網路150及位於其中之資源（諸如資料庫152中之資料）彼此通信。1 illustrates an example architecture 100 suitable for accessing a volumetric avatar model engine, according to some embodiments. Architecture 100 includes servers 130 that are communicatively coupled to client device 110 and at least one database 152 via network 150 . One of the many servers 130 is configured to host memory that includes instructions that, when executed by a processor, cause the server 130 to perform at least some of the steps in the methods as disclosed herein. In some embodiments, the processor is configured to control a graphical user interface (GUI) to enable a user of one of the client devices 110 to access the volume avatar model engine using an immersive reality application. Accordingly, the processor may include a dashboard tool configured to display components and graphical results to the user via the GUI. For load balancing purposes, multiple servers 130 may host memory including instructions to one or more processors, and multiple servers 130 may host history logs and as many as include a model engine for volumetric avatars A database 152 of training archives. Additionally, in some embodiments, multiple users of the client device 110 may access the same volumetric avatar model engine to run one or more immersive reality applications. In some embodiments, a single user with a single client device 110 can provide images and data to train one or more machine learning models running in parallel on one or more servers 130 . Thus, client device 110 and server 130 may communicate with each other via network 150 and resources located therein, such as data in database 152 .

伺服器130可包括具有適當處理器、記憶體及用於代管體積化身模型引擎（包括與其相關聯之多個工具）之通信能力的任何裝置。體積化身模型引擎可經由網路150由各種用戶端110來存取。用戶端110可為例如桌上型電腦、行動電腦、平板電腦（例如包括電子書閱讀器）、行動裝置（例如智慧型手機或PDA），或具有適當處理器、記憶體及用於存取伺服器130中之一或多者上之體積化身模型引擎之通信能力的任何其他裝置。在一些實施例中，用戶端裝置110可包括VR/AR頭戴耳機，該些頭戴耳機經組態以使用由伺服器130中之一或多者支援之體積化身模型來運行沉浸式實境應用程式。網路150可包括例如區域網路（LAN）、廣域網路（WAN）、網際網路及其類似者中之任一或多者。另外，網路150可包括但不限於以下網路拓樸中的任一或多者，包括：匯流排網路、星形網路、環形網路、網狀網路、星形匯流排網路、樹或階層式網路及其類似者。Server 130 may include any device with suitable processors, memory, and communication capabilities for hosting a volumetric avatar model engine, including tools associated therewith. The volume avatar model engine may be accessed by various clients 110 via network 150 . The client 110 may be, for example, a desktop computer, a mobile computer, a tablet computer (eg, including an e-book reader), a mobile device (eg, a smart phone or a PDA), or has a suitable processor, memory, and access server Any other means of communication capabilities of the volume avatar model engine on one or more of the servers 130. In some embodiments, client device 110 may include VR/AR headsets configured to run immersive reality using volumetric avatar models supported by one or more of servers 130 application. Network 150 may include, for example, any or more of a local area network (LAN), a wide area network (WAN), the Internet, and the like. Additionally, network 150 may include, but is not limited to, any one or more of the following network topologies, including: bus network, star network, ring network, mesh network, star bus network , tree or hierarchical network and the like.

圖2為說明根據本發明之某些態樣的來自架構100之實例伺服器130及用戶端裝置110的方塊圖200。用戶端裝置110及伺服器130經由各別通信模組218-1及218-2（在下文中，被集體地稱作「通信模組218」）經由網路150進行通信耦接。通信模組218經組態以與網路150介接以經由網路150將諸如資料、請求、回應及命令之資訊發送至其他裝置並接收該些資訊。通信模組218可為例如數據機或乙太網路卡，且可包括用於無線通信（例如，經由電磁輻射，諸如射頻-RF-、近場通信-NFC-、WiFi及藍牙無線電技術）之無線電硬體及軟體。使用者可經由輸入裝置214及輸出裝置216與用戶端裝置110互動。輸入裝置214可包括滑鼠、鍵盤、指標、觸控螢幕、麥克風、操縱桿、虛擬操縱桿及其類似者。在一些實施例中，輸入裝置214可包括攝影機、麥克風及感測器，諸如觸控感測器、聲學感測器、慣性運動單元-IMU-及經組態以將輸入資料提供至VR/AR頭戴耳機之其他感測器。舉例而言，在一些實施例中，輸入裝置214可包括用以偵測使用者之瞳孔在VR/AR頭戴耳機中之位置的眼睛追蹤裝置。輸出裝置216可為螢幕顯示器、觸控螢幕、揚聲器及其類似者。用戶端裝置110可包括記憶體220-1及處理器212-1。記憶體220-1可包括應用程式222及GUI 225，該應用程式及該GUI經組態以在用戶端裝置110中運行且與輸入裝置214及輸出裝置216耦接。應用程式222可由使用者自伺服器130下載且可由伺服器130代管。在一些實施例中，用戶端裝置110係VR/AR頭戴耳機且應用程式222係沉浸式實境應用程式。2 is a block diagram 200 illustrating an example server 130 and client device 110 from architecture 100 in accordance with certain aspects of the present disclosure. Client device 110 and server 130 are communicatively coupled via network 150 via respective communication modules 218-1 and 218-2 (hereinafter collectively referred to as "communication modules 218"). Communication module 218 is configured to interface with network 150 to send and receive information such as data, requests, responses, and commands to other devices via network 150 . The communication module 218 may be, for example, a modem or an Ethernet network card, and may include a module for wireless communication (eg, via electromagnetic radiation, such as radio frequency-RF-, near-field communication-NFC-, WiFi, and Bluetooth radio technologies) Radio hardware and software. The user can interact with the client device 110 via the input device 214 and the output device 216 . Input device 214 may include a mouse, keyboard, pointer, touch screen, microphone, joystick, virtual joystick, and the like. In some embodiments, input devices 214 may include cameras, microphones, and sensors, such as touch sensors, acoustic sensors, inertial motion units - IMUs - and configured to provide input data to VR/AR Other sensors in the headset. For example, in some embodiments, the input device 214 may include an eye-tracking device to detect the position of the user's pupils in the VR/AR headset. The output device 216 can be a screen display, a touch screen, a speaker, and the like. The client device 110 may include a memory 220-1 and a processor 212-1. The memory 220-1 may include an application 222 and a GUI 225 configured to run in the client device 110 and coupled to the input device 214 and the output device 216. The application 222 may be downloaded by the user from the server 130 and may be hosted by the server 130 . In some embodiments, client device 110 is a VR/AR headset and application 222 is an immersive reality application.

伺服器130包括記憶體220-2、處理器212-2及通信模組218-2。在下文中，處理器212-1及212-2以及記憶體220-1及220-2將分別被集體地稱作「處理器212」及「記憶體220」。處理器212經組態以執行儲存於記憶體220中之指令。在一些實施例中，記憶體220-2包括體積化身模型引擎232。體積化身模型引擎232可共用或向GUI 225提供特徵及資源，包括與訓練及使用用於沉浸式實境應用程式（例如應用程式222）之三維化身呈現模型相關聯的多個工具。使用者可經由安裝於用戶端裝置110之記憶體220-1中的應用程式222存取體積化身模型引擎232。因此，應用程式222（包括GUI 225）可由伺服器130安裝且執行由伺服器130經由多個工具中之任一者提供之指令碼及其他常式。應用程式222之執行可受處理器212-1控制。The server 130 includes a memory 220-2, a processor 212-2 and a communication module 218-2. Hereinafter, processors 212-1 and 212-2 and memories 220-1 and 220-2 will be collectively referred to as "processor 212" and "memory 220", respectively. Processor 212 is configured to execute instructions stored in memory 220 . In some embodiments, memory 220 - 2 includes a volumetric avatar model engine 232 . The volumetric avatar model engine 232 may share or provide features and resources to the GUI 225, including tools associated with training and using a three-dimensional avatar rendering model for an immersive reality application, such as the application 222. The user can access the volume avatar model engine 232 through the application 222 installed in the memory 220 - 1 of the client device 110 . Thus, applications 222, including GUI 225, may be installed by server 130 and execute scripting and other routines provided by server 130 through any of a number of tools. Execution of the application program 222 may be controlled by the processor 212-1.

就此而言，如本文中所揭示，體積化身模型引擎232可經組態以創建、儲存、更新及維持PVA模型240。PVA模型240可包括編碼器-解碼器工具242、射線行進工具244及輻射場工具246。編碼器-解碼器工具242收集具有個體之多個同時視圖之輸入影像，且提取像素對準之特徵以經由射線行進工具244中之射線行進程序調節輻射場工具246。PVA模型240可自由編碼器-解碼器工具242處理之一或多個樣本影像產生未見過的個體之新穎視圖。在一些實施例中，編碼器-解碼器工具242係淺（例如包括幾個單節點或兩節點層）卷積網路。在一些實施例中，輻射場工具246將三維位置及像素對準之特徵轉換成可在任何所要視圖方向上投影的顏色及不透明度場。In this regard, the volumetric avatar model engine 232 may be configured to create, store, update, and maintain the PVA model 240 as disclosed herein. The PVA model 240 may include an encoder-decoder tool 242 , a ray travel tool 244 and a radiation field tool 246 . The encoder-decoder tool 242 collects an input image with multiple simultaneous views of an individual, and extracts features of pixel alignment to adjust the radiation field tool 246 via a ray marching procedure in the ray marching tool 244. The PVA model 240 may process one or more sample images free of the encoder-decoder tool 242 to generate novel views of unseen individuals. In some embodiments, the encoder-decoder tool 242 is a shallow (eg, including several one-node or two-node layers) convolutional network. In some embodiments, the radiation field tool 246 converts the characteristics of three-dimensional position and pixel alignment into color and opacity fields that can be projected in any desired view direction.

在一些實施例中，體積化身模型引擎232可存取儲存於訓練資料庫252中之一或多個機器學習模型。訓練資料庫252包括體積化身模型引擎232根據使用者經由應用程式222之輸入而可用於機器學習模型之訓練的訓練檔案庫及其他資料檔案。此外，在一些實施例中，至少一或多個訓練檔案庫或機器學習模型可儲存於記憶體220中之任一者中且使用者可經由應用程式222對其進行存取。In some embodiments, the volumetric avatar model engine 232 may access one or more machine learning models stored in the training database 252 . The training database 252 includes a database of training files and other data files that the volumetric avatar model engine 232 can use for training of the machine learning model based on user input via the application 222 . Additionally, in some embodiments, at least one or more training files or machine learning models may be stored in any of the memories 220 and accessible to the user via the application 222 .

體積化身模型引擎232可包括出於其中所包括之引擎及工具之特定目的而訓練的演算法。該些演算法可包括利用任何線性或非線性演算法之機器學習或人工智慧演算法，諸如神經網路演算法或多變量回歸演算法。在一些實施例中，機器學習模型可包括神經網路（NN）、卷積神經網路（CNN）、生成對抗神經網路（GAN）、深度增強式學習（DRL）演算法、深度遞回神經網路（DRNN）、經典機器學習演算法，諸如隨機森林、k最近鄰（KNN）演算法、k均值叢集演算法或其任何組合。更一般而言，機器學習模型可包括涉及訓練步驟及最佳化步驟之任何機器學習模型。在一些實施例中，訓練資料庫252可包括用以根據機器學習模型之所要結果修改係數之訓練檔案庫。因此，在一些實施例中，體積化身模型引擎232經組態以存取訓練資料庫252以擷取文件及檔案庫作為用於機器學習模型之輸入。在一些實施例中，體積化身模型引擎232、其中所含有之工具以及訓練資料庫252之至少部分可代管於可由伺服器130或用戶端裝置110存取的不同伺服器中。The volumetric avatar model engine 232 may include algorithms trained for the specific purposes of the engines and tools included therein. These algorithms may include machine learning or artificial intelligence algorithms utilizing any linear or non-linear algorithm, such as neural network road algorithms or multivariate regression algorithms. In some embodiments, machine learning models may include neural networks (NN), convolutional neural networks (CNN), generative adversarial neural networks (GAN), deep reinforcement learning (DRL) algorithms, deep recurrent neural networks Networks (DRNN), classical machine learning algorithms such as random forests, k-nearest neighbors (KNN) algorithms, k-means clustering algorithms, or any combination thereof. More generally, a machine learning model can include any machine learning model that involves training steps and optimization steps. In some embodiments, the training database 252 may include a training file library used to modify coefficients according to the desired results of the machine learning model. Thus, in some embodiments, the volumetric avatar model engine 232 is configured to access the training database 252 to retrieve files and archives as input for the machine learning model. In some embodiments, at least a portion of the volume avatar model engine 232 , the tools contained therein, and the training database 252 may be hosted on a different server accessible by the server 130 or the client device 110 .

圖3說明根據一些實施例的用於VR/AR頭戴耳機使用者之面部部分之3D呈現的模型架構300的方塊圖。模型架構300係像素對準之體積化身（PVA）模型。PVA模型300係自產生多個2D輸入影像301-1、301-2及301-n（下文中，被集體地稱作「輸入影像301」）之多視圖影像集合學習。輸入影像301中之每一者係與攝影機視圖向量 v _i（例如， v ₁、 v ₂及 v _n）相關聯，該攝影機視圖向量指示使用者之面部對於特定影像的視圖方向。在一些實施例中，同時或準同時收集輸入影像301使得不同視圖向量 v _i指向個體之相同體積表示。向量 v _i中之每一者係已知視點311，其與攝影機本質參數 K _i及旋轉 R _i（例如{ K _i, [ R|t] _i}）相關聯。攝影機本質參數 K _i可包括亮度、色映射、感測器效率及其他攝影機相依參數。旋轉 R _i指示個體頭部相對於攝影機之定向（及距離）。不同攝影機感測器對同一入射輻射有略微不同的回應，儘管其為相同的攝影機模型。若不採取任何措施來解決此問題，則強度差最終會被納入場景表示 N中，此將致使影像自某些視點不自然地變亮或變暗。為了解決此問題，吾人學習每攝影機偏置及增益值。此允許系統具有「較容易」方式來解釋資料中之此變化。 3 illustrates a block diagram of a model architecture 300 for 3D rendering of a face portion of a VR/AR headset user, according to some embodiments. Model Architecture 300 is a pixel-aligned volumetric avatar (PVA) model. The PVA model 300 is learned from a multi-view image collection that generates a plurality of 2D input images 301-1, 301-2, and 301-n (hereinafter, collectively referred to as "input images 301"). Each of the input images 301 is associated with a camera view vector v _i (eg, v ₁ , v ₂ , and v _n ), which indicates the view direction of the user's face for a particular image. In some embodiments, the input images 301 are collected simultaneously or quasi-simultaneously such that the different view vectors vi _point to the same volume representation of the individual. Each of the vectors vi is a known viewpoint 311, which is associated with a camera intrinsic parameter K _i and a rotation R _i (eg, { K _i , [ R |t] _i } ₎ . Camera intrinsic parameters K _i may include luminance, color mapping, sensor efficiency, and other camera dependent parameters. The rotation Ri _indicates the orientation (and distance) of the individual's head relative to the camera. Different camera sensors respond slightly differently to the same incident radiation despite the same camera model. If nothing is done to fix this, the intensity difference will eventually be incorporated into the scene representation N , which will cause the image to unnaturally brighten or darken from certain viewpoints. To solve this problem, we learn per-camera offset and gain values. This allows the system to have an "easier" way to interpret this change in data.

「n」之值係純粹例示性的，此係因為任何具有一般技術者將認識到，可使用任何數目n個輸入影像301。PVA模型300產生頭戴耳機使用者之體積呈現321。體積呈現321為可用以自目標視點產生個體之2D影像的3D模型（例如「化身」）。此2D影像隨著目標視點改變（例如，隨著觀察者圍繞頭戴耳機使用者移動）而改變。The value of "n" is purely exemplary as anyone of ordinary skill will recognize that any number n of input images 301 may be used. The PVA model 300 produces a volumetric representation 321 of the headset user. Volume rendering 321 is a 3D model (eg, an "avatar") that can be used to generate a 2D image of an individual from a target viewpoint. This 2D image changes as the target viewpoint changes (eg, as the viewer moves around the headset user).

PVA模型300包括卷積編碼器-解碼器310A、射線行進級310B及輻射場級310C（下文中，被集體地稱作「PVA級310」）。使用梯度下降，運用選自多身分訓練語料庫之輸入影像301來訓練PVA模型300。因此，PVA模型300包括在自多個個體之經預測影像與對應實況之間所定義的損失函數。此使得PVA模型300能夠獨立於個體呈現準確的體積呈現321。The PVA model 300 includes a convolutional encoder-decoder 310A, a ray marching stage 310B, and a radiation field stage 310C (hereinafter, collectively referred to as "PVA stage 310"). Using gradient descent, a PVA model 300 is trained using input images 301 selected from a multi-identity training corpus. Thus, the PVA model 300 includes a loss function defined between predicted images from multiple individuals and the corresponding reality. This enables the PVA model 300 to present an accurate volumetric representation 321 independently of the individual.

卷積編碼器-解碼器網路310A獲取輸入影像301且產生像素對準之特徵圖303-1、303-2及303-n（下文中，被集體地稱作「特徵圖303」f(i)）。射線行進級310B遵循沿著由{Kj , [R|t]j}界定之目標視圖j中之射線的像素中之每一者，而累積在每一點處由輻射場級310C產生的顏色c及光密度（「不透明性」）。輻射場級310C （ N）將3D位置及像素對準之特徵轉換成顏色及不透明度，以呈現輻射場315（ c, σ）。 Convolutional encoder-decoder network 310A takes input image 301 and generates pixel-aligned feature maps 303-1, 303-2, and 303-n (hereinafter, collectively referred to as "feature maps 303" f(i )). The ray travel stage 310B follows each of the pixels along the ray in the target view j defined by {Kj, [R|t]j}, accumulating at each point the colors c and the colors c produced by the radiation field stage 310C Optical density ("opacity"). Radiation field stage 310C( N ) converts the characteristics of 3D position and pixel alignment into color and opacity to render radiation field 315( c ,σ).

輸入影像301為具有對應於由攝影機沿著方向 v _i收集之2D影像的高度（ h）及寬度（ w），及針對每一顏色像素R、G、B之3層的深度的3D物件。特徵圖303為具有尺寸 h× w× d的3D物件。編碼器-解碼器網路310A使用可學習權重320-1、320-2…320-n（下文中，被集體地稱作「可學習權重320」）來編碼輸入影像301。射線行進級310B執行世界至攝影機投影323、雙線性內插325、位置編碼327及特徵聚集329。 Input image 301 is a 3D object having a height ( h ) and width ( _w ) corresponding to the 2D image collected by the camera along direction vi , and a depth of 3 layers for each color pixel R, G, B. Feature map 303 is a 3D object with dimensions h × w × d . The encoder-decoder network 310A encodes the input image 301 using learnable weights 320-1, 320-2 . . . 320-n (hereinafter, collectively referred to as "learnable weights 320"). Ray marching stage 310B performs world-to-camera projection 323, bilinear interpolation 325, position encoding 327, and feature aggregation 329.

在一些實施例中，對於調節視圖 v _i

R ^h×w×3，特徵圖303可定義為如下函數：

（1） In some embodiments, for _adjusting the view vi

R ^h×w×3 , the feature map 303 can be defined as the following function:

(1)

其中ϕ ( X): R ³→ R ^6×l 為藉由2 × l個不同基底函數進行的點330（ X

R ³）之位置編碼。點330( X)為沿著自個體之2D影像指向特定視點331 r ₀之射線的點。特徵圖303（ f ⁽ⁱ⁾

R ^h×w×d）係與攝影機位置向量 v _i相關聯，其中 d為特徵通道之數目、 h及 w為影像高度及寬度，且f _X

R ^d'為與點 X相關聯之經聚集影像特徵。對於每一特徵圖f ⁽ⁱ⁾，射線行進級310B藉由使用特定視點之攝影機本質參數（ K）及外質參數（ R, t）沿著射線投影3D點 X來獲得f _X

R ^d。

where ϕ ( X ): R ³ → R ^6×l is the point 330 ( X

R ³ ) position code. Point 330( X ₎ is a point along a ray from the individual's 2D image that points to a particular viewpoint 331r0 . Feature map 303 ( f ⁽ⁱ⁾

R ^h×w×d ₎ is associated with the camera position vector vi , where d is the number of feature channels, h and w are the image height and width, and f _X

^Rd' is the clustered image feature associated with point X. For each feature map f ⁽ⁱ⁾ , the ray marching stage 310B obtains fx by projecting the 3D point X along the _ray using the camera intrinsic parameters ( K ) and extrinsic parameters ( R , t) of the particular viewpoint

^Rd .

其中Π為至攝影機像素座標之透視投影函數，且F(f, x)為在像素位置x處之 f之雙線性內插325。射線行進級310B組合來自用於輻射場級310C之多個影像之像素對準之特徵f ⁽ⁱ⁾ _X 。 where Π is the perspective projection function to the camera pixel coordinates, and F(f, x) is the bilinear interpolation 325 of f at pixel position x. The ray travel stage 310B combines the features f ⁽ⁱ⁾ _x from the pixel alignment for the multiple images of the radiation field stage 310C.

對於具有攝影機本質 K _j以及旋轉 R _j及平移 t _j的每一給定訓練影像 v _j，在攝影機之焦平面以及中心331( r ₀)

R ³中之給定視點的像素p

R ²之經預測顏色係藉由使用攝影機至世界投影矩陣P ^-1= [ R _i| t _i] ^-1 K ^- ¹ _i將射線行進至場景中來獲得，其中射線之方向係由下式給出：

（5） For each given training image vj _{with camera essence Kj and rotation Rj and translation tj} _, at _the camera _'s focal plane and center 331 ₍ r0 )

pixel p of a given viewpoint in ^R3

The predicted color of R ² is obtained by traveling the ray into the scene using the camera-to-world projection matrix P ^-1 = [ R _i | t _i ] ^-1 K ^- ¹ _i , where the direction of the ray is given by out:

(5)

射線行進級310B累積沿著由r( t) = r ₀+ td（其中 t

[ t _near , t _far]）界定如下的射線335之輻射值及不透明度值：

（6） The ray travel stage 310B accumulates along by r( t ) = r ₀ + t d (where t

[ t _near , t _far ]) defines the radiation and opacity values of the ray 335 as follows:

(6)

其中，

（7） in,

(7)

在一些實施例中，射線行進級310B對一組 n _s 個點t ~[t _near, t _far]進行均一地取樣。設定 X= r(t)，正交規則可用以近似積分6及7。可將函數 I _α(p)定義為

（8） In some embodiments, the ray travel stage 310B uniformly samples a set of _ns points t ~ [t _near , t _far ]. Setting X = r (t), the quadrature rule can be used to approximate the integrals 6 and 7. The function I _α (p) can be defined as

(8)

其中α _i= 1 - exp(-δ _i·σ _i)，其中δ _i為沿著射線335之第 i+1個取樣點與第 i個取樣點之間的距離。 where α _i = 1 - exp(-δ _i ·σ _i ), where δ _i is the distance between the i + 1 th sample point and the ith sample point along ray 335 .

在具有已知攝影機視點 v _i及固定數目個調節視圖的多視圖設定中，射線行進級310B藉由簡單串接來聚集特徵。具體言之，對於具有由{ R _i} ⁿ _i=1及{ ti}ni=1給出之對應旋轉及平移矩陣的n個調節影像{ v _i} ⁿ _i=1，針對如在方程式（3）中之每一點 X使用特徵{f ⁽ⁱ⁾ _X } ⁿ _i=1，射線行進級310B產生最終特徵如下：

In a multi-view setup with a known camera _viewpoint vi and a fixed number of adjusted views, the ray-travel stage 310B gathers features by simple concatenation. Specifically, for n adjusted images { v _i } ⁿ _{i = 1} with corresponding rotation and translation matrices given by { R _i } ⁿ _{i = 1} and { t i } ni = 1, for as in equation ( 3) Using the feature {f ⁽ⁱ⁾ _X } ⁿ _i=1 for each point X in , the ray travel stage 310B produces the final feature as follows:

其中

表示沿著深度尺寸之串接。此保留來自視點之特徵資訊{ v _i} ⁿ _i=1，從而幫助PVA模型300判定最佳組合且使用調節資訊。 in

Represents concatenation along the depth dimension. This preserves the feature information { v _i } ⁿ _i=1 from the viewpoint, helping the PVA model 300 to determine the best combination and use the adjustment information.

在一些實施例中，PVA模型300對於視點及調節視圖之數目不可知。如上所述之簡單串接在此狀況下係不足的，此係由於調節視圖之數目可能未先驗地知曉，從而在推斷時間期間產生不同的特徵尺寸（ d）。為了概述多視圖設定之特徵，一些實施例包括排列不變函數G: R ^n×d→ R ^d使得對於任何排列ψ：

In some embodiments, the PVA model 300 is agnostic to the number of viewpoints and adjustment views. Simple concatenation as described above is insufficient in this case since the number of adjustment views may not be known a priori, resulting in different feature sizes ( d ) during inference time. To outline the features of the multi-view setting, some embodiments include a permutation-invariant function G: R ^{n × d} → R ^d such that for any permutation ψ:

用於特徵聚集之簡單排列不變函數為經取樣特徵圖303之平均值。當在訓練期間之深度資訊可用的時，此聚集程序可為合乎需要的。然而，在存在深度模糊度（例如，針對在取樣之前投影至特徵圖303上之點）的情況下，上述聚集可導致假影。為了避免此情形，一些實施例考慮攝影機資訊以包括輻射場級310C中之有效調節。因此，一些實施例包括獲取特徵向量f ⁽ⁱ⁾ _X 之調節函數網路N _cf: R ^d+7→ R ^d'及攝影機資訊（ci）且產生攝影機彙總之特徵向量f' ⁽ⁱ⁾ _X 。接著遍及多個或所有調節視圖將這些經修改向量平均化如下：

The simple permutation invariant function for feature aggregation is the average of the sampled feature maps 303 . This aggregation procedure may be desirable when depth information is available during training. However, in the presence of depth ambiguity (eg, for points projected onto the feature map 303 prior to sampling), the aggregation described above can lead to artifacts. To avoid this, some embodiments take into account camera information to include effective adjustments in radiation field level 310C. Thus, some embodiments include obtaining a network of adjustment functions N _cf : R ^d+7 → R ^d' and camera information (ci) of the eigenvectors f( ⁱ ) _X and generating the camera-aggregated eigenvectors f' ⁽ⁱ⁾ _X . These modified vectors are then averaged across multiple or all adjustment views as follows:

此方法之優點為攝影機彙總之特徵可在執行特徵平均之前考量很可能的閉塞。攝影機資訊經編碼為4D旋轉四元數及3D攝影機位置。The advantage of this approach is that the features aggregated by the camera can account for likely occlusions before performing feature averaging. The camera information is encoded as a 4D rotation quaternion and a 3D camera position.

一些實施例亦可包括背景估計網路N _bg，以避免學習場景表示中之背景的部分。背景估計網路N _bg可被定義為N _bg: R ^nc: → R ^h×w×3：以學習每攝影機固定背景。在一些實施例中，輻射場級310C可使用N _bg以將最終影像像素預測為：

（11） Some embodiments may also include a background estimation network N _bg to avoid learning parts of the background in the scene representation. The background estimation network N _bg can be defined as N _bg : R ^nc : → R ^h×w×3 : to learn a per-camera fixed background. In some embodiments, radiation field level 310C may use N _bg to predict final image pixels as:

(11)

其中針對攝影機c _i， I _bg=

+N _bg(c _i)，其中

為使用修復提取的背景之初始估計值，且 I _α係如方程式（8）所定義。這些修復背景常常有雜訊，從而導致在個人頭部周圍出現「光暈」效應。為了避免此情形，N _bg模型學習至修復背景之殘差。此具有無需高容量網路來考量背景的優點。 where for camera c _i , I _bg =

+N _bg ( _ci ), where

is an initial estimate of the background extracted using _inpainting , and Iα is defined as equation (8). These repaired backgrounds are often noisy, resulting in a "halo" effect around the individual's head. To avoid this situation, the N _bg model learns to repair the background residuals. This has the advantage of not requiring a high-capacity network to consider the background.

對於實況目標影像 v _j，PVA模型300使用簡單的光度量重建構損失來訓練輻射場級310C及特徵提取網路兩者：

For the live target image v _j , the PVA model 300 uses a simple photometric reconstruction loss to train both the radiation field stage 310C and the feature extraction network:

圖4A至圖4C說明體積化身421A-1、421A-2、421A-3、421A-4及421A-5（下文中，被集體地稱作「體積化身421A」）、421B-1、421B-2、421B-3、421B-4及421B-5（下文中，被集體地稱作「體積化身421B」）、421C-1、421C-2、421C-3、421C-4及421C-5（下文中，被集體地稱作「體積化身421C」）。體積化身421A、421B及421C（下文中，被集體地稱作「體積化身421」）係僅使用兩個個體視圖作為輸入獲得的不同個體之高保真度再現。4A-4C illustrate volumetric avatars 421A-1, 421A-2, 421A-3, 421A-4, and 421A-5 (hereinafter, collectively referred to as "volumetric avatars 421A"), 421B-1, 421B-2 , 421B-3, 421B-4, and 421B-5 (hereinafter, collectively referred to as "volume avatars 421B"), 421C-1, 421C-2, 421C-3, 421C-4, and 421C-5 (hereinafter collectively referred to as "volume avatars 421B") , collectively referred to as "Volume Avatar 421C"). Volume avatars 421A, 421B, and 421C (hereinafter collectively referred to as "volume avatars 421") are high-fidelity reproductions of different individuals obtained using only two individual views as input.

體積化身421說明如本文中所揭示之PVA模型可在僅給出兩個視圖作為輸入的情況下自大量新穎視點產生不同個體化身之多個視圖。Volume avatar 421 illustrates that a PVA model as disclosed herein can generate multiple views of different individual avatars from a large number of novel viewpoints given only two views as input.

圖5說明根據一些實施例的化身521A-1、521A-2、521A-3、521A-4（下文中，被集體地稱作「實境捕獲化身521A」）、化身521B-1、521B-2、521B-3、521B-4（下文中，被集體地稱作「神經體積化身521B」）、化身521C-1、521C-2、521C-3、521C-4（下文中，被集體地稱作「全域調節之cNeRF化身521C」）以及化身521D-1、521D-2、521D-3、521D-4（下文中，被集體地稱作「體積化身521D」）。化身521A、521B、521C及521D將在下文中被集體地稱作「化身521」。自新穎身分之兩個輸入影像獲得化身521且將其與實況影像501-1、501-2、501-3及501-4（下文中，被集體地稱作「實況影像501」）進行比較，作為計算重建構之輸入。5 illustrates avatars 521A-1, 521A-2, 521A-3, 521A-4 (hereinafter, collectively referred to as "reality-capture avatars 521A"), avatars 521B-1, 521B-2, according to some embodiments The "Globally conditioned cNeRF avatar 521C") and avatars 521D-1, 521D-2, 521D-3, 521D-4 (hereinafter collectively referred to as "volume avatars 521D"). Avatars 521A, 521B, 521C, and 521D will be collectively referred to hereinafter as "avatars 521". Avatar 521 is obtained from two input images of the novel identity and compared to live images 501-1, 501-2, 501-3 and 501-4 (hereinafter collectively referred to as "live image 501"), as input for computational reconstruction.

實境捕獲化身521A係運用運動恢復結構（structure-from-motion，SFM）及多視圖立體（multi-view stereo，MVS）演算法獲得，該演算法自一組所捕獲影像重建構3D模型。神經體積化身521B係運用基於立體像素之推斷方法來獲得，該方法對場景之動態影像進行全域編碼且對表示場景之立體像素網格及翹曲場進行解碼。cNeRF化身521C為具有全域身分調節之NeRF演算法（cNeRF）的變體。在一些實施例中，cNeRF化身521C使用VGG網路以提取每一訓練身分之單一64D特徵向量，且另外根據此輸入調節NeRF模型。如本文中所揭示，體積化身521D係運用PVA模型獲得。The reality capture avatar 521A is obtained using structure-from-motion (SFM) and multi-view stereo (MVS) algorithms that reconstruct a 3D model from a set of captured images. The neural volume avatar 521B is obtained using a voxel-based inference method that globally encodes a moving image of the scene and decodes the voxel grid and warp field representing the scene. The cNeRF avatar 521C is a variant of the NeRF algorithm with global identity modulation (cNeRF). In some embodiments, the cNeRF avatar 521C uses a VGG network to extract a single 64D feature vector for each training identity, and additionally adjusts the NeRF model based on this input. As disclosed herein, the volumetric avatar 521D was obtained using the PVA model.

體積化身521D係比實境捕獲化身521A更完整的重建構，實境捕獲化身典型地使用更多個實況影像501來獲得良好重建構。歸因於如本文中所揭示之PVA模型中之像素對準之特徵，體積化身521D亦導致比cNerf化身521C更詳細的重建構，其在測試時間提供用於模型之更完整資訊。The volume avatar 521D is a more complete reconstruction than the reality capture avatar 521A, which typically uses more live images 501 to obtain a good reconstruction. Due to the features of pixel alignment in the PVA model as disclosed herein, the volume avatar 521D also results in a more detailed reconstruction than the cNerf avatar 521C, which provides more complete information for the model at test time.

下表1為在使用不同度量的情況下體積化身521D的效能與NV化身521B及cNeRF化身521C的比較。結構相似性指數（SSIM）較佳具有最大值一（1），且學習之感知影像補片相似性（LPIPS）度量及均方誤差（MSE）度量較佳具有較低值。表1 SSIM MSE LPIPS cNeRF 0.7663 1611.01 12 4.3775 NV 0.8027 1208.36 3.1112 PVA 0.8889 383.71 1.7392 Table 1 below compares the performance of volume avatar 521D with NV avatar 521B and cNeRF avatar 521C using different metrics. The Structural Similarity Index (SSIM) preferably has a maximum value of one (1), and the Learned Perceptual Image Patch Similarity (LPIPS) metric and the Mean Squared Error (MSE) metric preferably have lower values. Table 1 SSIM MSE LPIPS cNeRF 0.7663 1611.01 12 4.3775 NV 0.8027 1208.36 3.1112 PVA 0.8889 383.71 1.7392

如本文中所揭示之PVA模型解決了以如神經體積及cNeRF之場景特定方式訓練的全域身分編碼方法的某些缺點，該些方法並不良好地一般化至未見過的身分。舉例而言，cNeRF化身521C使面部特徵平滑且丟失了未見過的身分之一些局部細節（比如521C-3及521C-4中之面部毛髮，及521C-2中之頭髮長度），此係由於此模型很大程度上依賴於所學習之全域先驗。由於不存在建置至SfM+MVS框架中之先驗模型，因此實境捕獲化身521A無法捕獲頭部結構，從而導致不完整的重建構。將需要大量影像以忠實地重建構用於RC 521A模型之新穎身分。神經體積化身521B由於考量了某種程度之局部資訊之所產生翹曲場而產生較好的紋理。然而，神經體積化身521B使用經組態具有全域編碼之編碼器且將測試時間身分投影至最近的訓練時間身分中，從而導致化身預測不準確。體積化身521D自僅兩個實例視點連同頭髮之結構來重建構體積頭部。The PVA model as disclosed herein addresses some of the shortcomings of global identity encoding methods trained in a scene-specific manner like neural volumes and cNeRF, which do not generalize well to unseen identities. For example, cNeRF avatar 521C smoothes facial features and loses some local details of unseen identities (such as facial hair in 521C-3 and 521C-4, and hair length in 521C-2) due to This model relies heavily on the learned global prior. Since there is no a priori model built into the SfM+MVS framework, the reality capture avatar 521A cannot capture the head structure, resulting in an incomplete reconstruction. A large number of images will be required to faithfully reconstruct the novel identity for the RC 521A model. The neural volume avatar 521B produces better textures due to the warp field generated by taking into account some degree of local information. However, neural volume avatar 521B uses an encoder configured with global encoding and projects test temporal identities into the nearest training temporal identities, resulting in inaccurate avatar predictions. The volume avatar 521D reconstructs the volume head from only two instance viewpoints along with the structure of the hair.

圖6說明根據一些實施例的使用eNerf的所產生之阿爾發視圖631A-1、631A-2及631A-3（下文中，被集體地稱作「阿爾發視圖631A」）、正常視圖633A-1、633A-2及633A-3（下文中，被集體地稱作「正常視圖633A」）以及化身視圖621A-1、621A-2及621A-3（下文中，被集體地稱作「化身621A」），以及針對三個不同個體之相關聯實況影像601-1、601-2及601-3。如本文中所揭示，運用PVA模型獲得用於像素對準之化身的阿爾發視圖631B-1、631B-2及631B-3（下文中，被集體地稱作「阿爾發視圖631B」）、正常視圖633B-1、633B-2及633B-3（下文中，被集體地稱作「正常視圖633B」）以及化身視圖621B-1、621B-2及621B-3（下文中，被集體地稱作「化身621B」）。6 illustrates generated alpha views 631A-1, 631A-2, and 631A-3 (hereinafter, collectively referred to as "alpha views 631A"), normal view 633A-1 using eNerf, according to some embodiments , 633A-2 and 633A-3 (hereinafter collectively referred to as "normal view 633A") and avatar views 621A-1, 621A-2 and 621A-3 (hereinafter collectively referred to as "avatar 621A") ), and associated live images 601-1, 601-2, and 601-3 for three different individuals. As disclosed herein, alpha views 631B-1, 631B-2, and 631B-3 (hereinafter, collectively referred to as "alpha views 631B") of avatars for pixel alignment are obtained using the PVA model, normal Views 633B-1, 633B-2, and 633B-3 (hereinafter, collectively referred to as "normal view 633B") and avatar views 621B-1, 621B-2, and 621B-3 (hereinafter, collectively referred to as "normal view 633B") "Incarnation 621B").

與化身621A相比，化身621B良好地適合於捕獲表情資訊，化身621A更難將面部表情一般化為新穎身分。為了獲得621A化身，根據關於測試時間身分之獨熱（one-hot）表情碼及獨熱身分資訊（eNeRF）來調節NeRF模型。然而，儘管在訓練期間已看到所有身分，但與化身621A相比，化身621B更好地一般化至多個身分之動態表情。因為PVA模型利用局部特徵進行調節，所以化身621B比化身621A更好地捕獲對特定身分（幾何形狀與紋理兩者）之動態效應。Avatar 621B is well suited for capturing expression information compared to avatar 621A, which is more difficult to generalize facial expressions into novel identities. In order to obtain the 621A avatar, the NeRF model is adjusted according to the one-hot emoticon code and the one-hot identity information (eNeRF) about the identity of the test time. However, despite having seen all identities during training, avatar 621B generalizes better to the dynamic expressions of multiple identities than avatar 621A. Because the PVA model is conditioned with local features, avatar 621B captures dynamic effects on specific identities (both geometry and texture) better than avatar 621A.

圖7說明根據一些實施例的相對於視圖（列）之數目之經預測化身721-1、721-2、721-3、721-4及725-5（下文中，被集體地稱作「化身721」）。正常視圖733-1、733-2、733-3、733-4及733-5（下文中，被集體地稱作「正常視圖733」）係與化身721中之每一者相關聯。7 illustrates predicted avatars 721-1, 721-2, 721-3, 721-4, and 725-5 (hereinafter, collectively referred to as "avatars") relative to the number of views (columns), according to some embodiments 721”). Normal views 733-1, 733-2, 733-3, 733-4, and 733-5 (hereinafter, collectively referred to as "normal views 733") are associated with each of the avatars 721.

化身721及正常視圖733說明根據一些實施例的與在影像701-1、701-2、701-3、701-4及701-5（下文中，被集體地稱作「實況影像701」）中所捕獲不同的個體視圖。因為PVA模型自訓練身分學習形狀先驗，所以正常視圖733與實況影像701之身分一致。然而，當外推至極端視圖（733-1及721-1）時，在「調節」實況影像701中未見過的面部部分中出現假影。此係歸因於由於樣本點投影至實況影像701-1上引起的固有深度模糊度。添加第二視圖（例如實況影像701-2）已經顯著減少了正常視圖733-2及化身721-2中之假影，此係因為PVA模型現在具有關於來自不同視圖之特徵的更多資訊且因此具有深度資訊。一般而言，如本文中所揭示之PVA模型可藉由僅兩個調節視圖達成大程度的視圖外推。Avatar 721 and normal view 733 are illustrated in accordance with some embodiments and in images 701-1, 701-2, 701-3, 701-4, and 701-5 (hereinafter, collectively referred to as "live image 701") different individual views captured. Because the PVA model learns the shape prior from the training identities, the normal view 733 is identical to the identities of the live image 701 . However, when extrapolated to the extreme views (733-1 and 721-1), artifacts appear in parts of the face that have not been seen in "adjusted" live image 701. This is due to the inherent depth ambiguity due to the projection of the sample points onto the live image 701-1. Adding a second view (eg live image 701-2) has significantly reduced artifacts in normal view 733-2 and avatar 721-2 because the PVA model now has more information about features from different views and therefore Has in-depth information. In general, the PVA model as disclosed herein can achieve a large degree of view extrapolation with only two adjustment views.

圖8說明根據一些實施例的基於輸入影像801-1及801-2（下文中，被集體地稱作「輸入影像801」）自正常視圖833-1及833-2（下文中，被集體地稱作「正常視圖833」）導出的化身821-1及821-2（下文中，被集體地稱作「化身821」）之背景消融結果。8 illustrates a view from normal views 833-1 and 833-2 (hereinafter, collectively referred to as "input images 801") based on input images 801-1 and 801-2 (hereinafter, collectively referred to as "input images 801"), according to some embodiments, according to some embodiments. Background ablation results of avatars 821-1 and 821-2 (hereinafter, collectively referred to as "avatars 821") derived from "normal view 833").

圖9說明根據一些實施例的基於調節視圖901的PVA模型對所使用特徵提取器921A（「沙漏網路」）、921B（「UNet」）及921C（淺卷積網路）之選擇的敏感度。921A及921B係可靠的特徵提取器。在一些實施例中，淺編碼器-解碼器架構921C執行可能係合乎需要的，此係由於其保留了更多局部資訊而不必將所有像素層級資訊編碼至瓶頸層中。9 illustrates the sensitivity of a PVA model based on conditioning view 901 to the selection of used feature extractors 921A ("Hourglass Network"), 921B ("UNet"), and 921C (Shallow Convolutional Network), according to some embodiments . The 921A and 921B are reliable feature extractors. In some embodiments, a shallow encoder-decoder architecture 921C implementation may be desirable because it preserves more local information without having to encode all pixel-level information into the bottleneck layer.

圖10說明根據一些實施例的攝影機感知之特徵彙總策略。對於第一個體，在不具有攝影機特定資訊（化身1021A-1）或具有攝影機特定資訊（化身1021B-1）的情況下，將對應於不同視圖且運用不同攝影機收集的輸入影像1001A-1、1001B-1及1001C-1（下文中，被集體地稱作「輸入影像1001-1」）進行平均化。同樣地，對於第二個體，在不具有攝影機特定資訊（化身1021A-2）或具有攝影機特定資訊（化身1021B-2）的情況下，將對應於不同視圖及攝影機的輸入影像1001A-2、1001B-2及1001C-2（下文中，被集體地稱作「輸入影像1001-2」）進行平均化。且對於第三個體，在不具有攝影機特定資訊（參見方程式（3）中之攝影機本質參數（ K）及非本質參數（ R, t））（化身1021A-3）或具有攝影機特定資訊（化身1021B-3）的情況下，將對應於不同視圖及攝影機的輸入影像1001A-3、1001B-3及1001C-3（下文中，被集體地稱作「輸入影像1001-3」）進行平均化。 10 illustrates a camera-aware feature aggregation strategy, according to some embodiments. For the first individual, with no camera specific information (avatar 1021A-1) or with camera specific information (avatar 1021B-1), the input images 1001A-1, 1001B corresponding to different views and collected using different cameras -1 and 1001C-1 (hereinafter, collectively referred to as "input image 1001-1") are averaged. Likewise, for the second individual, with no camera-specific information (avatar 1021A-2) or with camera-specific information (avatar 1021B-2), the input images 1001A-2, 1001B corresponding to different views and cameras -2 and 1001C-2 (hereinafter collectively referred to as "input image 1001-2") are averaged. And for the third individual, there is no camera specific information (see the camera intrinsic parameters ( K ) and non-essential parameters ( R , t) in equation (3)) (avatar 1021A-3) or with camera specific information (avatar 1021B In the case of -3), the input images 1001A-3, 1001B-3, and 1001C-3 (hereinafter collectively referred to as "input images 1001-3") corresponding to different views and cameras are averaged.

特定言之，在無攝影機資訊之情況下，化身1021A-1、1021A-2及1021A-3由於來自不同視點（尤其在化身1021A-1及1021A-2中）之資訊之不一致平均而在所產生影像中呈現條紋。Specifically, in the absence of camera information, avatars 1021A-1, 1021A-2, and 1021A-3 are generated due to inconsistent averaging of information from different viewpoints (especially in avatars 1021A-1 and 1021A-2). Streaks appear in the image.

圖11說明用於自使用者面部之一部分的多個二維（2D）影像呈現使用者面部之一部分的三維（3D）視圖之方法1100中的流程圖。方法1100中之步驟可至少部分藉由執行儲存於記憶體中之指令的處理器執行，其中處理器及記憶體為本文所揭示之用戶端裝置或VR/AR頭戴耳機之部分（例如，記憶體220、處理器212、用戶端裝置110）。在又其他實施例中，與方法1100一致之方法中的步驟中之至少一或多者可藉由執行儲存於記憶體中之指令的處理器執行，其中處理器及記憶體中之至少一者遠端地位於雲端伺服器中，且頭戴耳機裝置經由耦接至網路之通信模組通信耦接至雲端伺服器（參看伺服器130、通信模組218）。在一些實施例中，方法1100可使用體積化身模型引擎來執行，該體積化身模型引擎經組態以訓練包括編碼器-解碼器工具、射線行進工具及輻射場工具之PVA模型，這些工具包括機器學習或人工智慧演算法中之神經網路架構，如本文中所揭示（例如體積化身模型引擎232、PVA模型240、編碼器-解碼器工具242、射線行進工具244及輻射場工具246）。在一些實施例中，與本發明一致之方法可包括來自方法1100之至少一或多個步驟，該至少一或多個步驟按不同次序、同時、準同時或時間上重疊地執行。11 illustrates a flowchart in a method 1100 for presenting a three-dimensional (3D) view of a portion of a user's face from a plurality of two-dimensional (2D) images of a portion of the user's face. The steps in method 1100 may be performed, at least in part, by a processor executing instructions stored in memory, where the processor and memory are part of a client device or VR/AR headset disclosed herein (eg, memory). body 220, processor 212, client device 110). In yet other embodiments, at least one or more of the steps in a method consistent with method 1100 may be performed by a processor executing instructions stored in memory, wherein at least one of the processor and the memory Remotely located in a cloud server, and the headset device is communicatively coupled to the cloud server via a communication module coupled to the network (see server 130, communication module 218). In some embodiments, method 1100 may be performed using a volumetric avatar model engine configured to train PVA models including encoder-decoder tools, ray marching tools, and radiation field tools, including machines Neural network architectures in learning or artificial intelligence algorithms, as disclosed herein (eg, volume avatar model engine 232, PVA model 240, encoder-decoder tool 242, ray travel tool 244, and radiation field tool 246). In some embodiments, methods consistent with the present disclosure may include at least one or more steps from method 1100 performed in a different order, simultaneously, quasi-simultaneously, or overlapping in time.

步驟1102包括接收具有個體之至少兩個或多於兩個視場的多個二維影像。Step 1102 includes receiving a plurality of two-dimensional images having at least two or more than two fields of view of the individual.

步驟1104包括使用可學習權重之集合自二維影像提取多個影像特徵。在一些實施例中，步驟1104包括提取用以收集二維影像之攝影機的本質屬性。Step 1104 includes extracting a plurality of image features from the 2D image using a set of learnable weights. In some embodiments, step 1104 includes extracting essential properties of the camera used to collect the 2D image.

步驟1106包括沿著個體之三維模型與觀察者之所選擇觀測點之間的方向投影影像特徵。在一些實施例中，步驟1106包括將第一方向所相關聯的特徵圖與第二方向所相關聯的特徵圖進行內插。在一些實施例中，步驟1106包括沿著個體之三維模型與所選擇觀測點之間的方向聚集多個像素之影像特徵。在一些實施例中，步驟1106包括以排列不變的組合來串接由多個攝影機中之每一者產生的多個特徵圖，該多個攝影機中之每一者具有一本質特性。Step 1106 includes projecting the image features along a direction between the three-dimensional model of the individual and the selected observation point of the observer. In some embodiments, step 1106 includes interpolating the feature map associated with the first direction with the feature map associated with the second direction. In some embodiments, step 1106 includes aggregating image features of a plurality of pixels along a direction between the three-dimensional model of the individual and the selected observation point. In some embodiments, step 1106 includes concatenating, in a permutation-invariant combination, a plurality of feature maps generated by each of a plurality of cameras, each of the plurality of cameras having an essential characteristic.

在一些實施例中，個體為具有指向使用者的網路攝影機之用戶端裝置的使用者，且步驟1106包括將所選擇觀測點識別為來自用戶端裝置的指向使用者的網路攝影機之位置。在一些實施例中，個體為其中運行有沉浸式實境應用程式的用戶端裝置的使用者，且步驟1106進一步包括將所選擇觀測點識別為在該沉浸式實境應用程式內觀察者所在的位置。In some embodiments, the individual is a user of a client device with a webcam pointed at the user, and step 1106 includes identifying the selected observation point as the location of the webcam pointed at the user from the client device. In some embodiments, the individual is a user of a client device in which the immersive reality application is running, and step 1106 further includes identifying the selected observation point as where the observer is located within the immersive reality application. Location.

步驟1108包括向觀察者提供個體之三維模型之影像。在一些實施例中，步驟1108包括基於個體之三維模型之影像與個體之實況影像之間的差異而評估損失函數；及基於該損失函數而更新可學習權重之集合中之至少一者。在一些實施例中，觀察者正使用網路耦接之用戶端裝置，且步驟1108包括將具有個體之三維模型之多個影像的視訊串流傳輸至網路耦接之用戶端裝置。Step 1108 includes providing the viewer with an image of the three-dimensional model of the individual. In some embodiments, step 1108 includes evaluating a loss function based on the difference between the image of the three-dimensional model of the individual and the live image of the individual; and updating at least one of the set of learnable weights based on the loss function. In some embodiments, the viewer is using a network-coupled client device, and step 1108 includes streaming a video stream with multiple images of the three-dimensional model of the individual to the network-coupled client device.

圖12說明根據一些實施例的用於訓練模型以自使用者面部之一部分的多個二維（2D）影像呈現使用者面部之一部分的三維（3D）視圖之方法1200中的流程圖。方法1200中之步驟可至少部分藉由執行儲存於記憶體中之指令的處理器執行，其中處理器及記憶體為本文所揭示之用戶端裝置或VR/AR頭戴耳機之部分（例如，記憶體220、處理器212、用戶端裝置110）。在又其他實施例中，與方法1200一致之方法中的步驟中之至少一或多者可藉由執行儲存於記憶體中之指令的處理器執行，其中處理器及記憶體中之至少一者遠端地位於雲端伺服器中，且頭戴耳機裝置經由耦接至網路之通信模組通信耦接至雲端伺服器（參看伺服器130、通信模組218）。在一些實施例中，方法1200可使用體積化身模型引擎來執行，該體積化身模型引擎經組態以訓練包括編碼器-解碼器工具、射線行進工具及輻射場工具之PVA模型，這些工具包括機器學習或人工智慧演算法中之神經網路架構，如本文中所揭示（例如體積化身模型引擎232、PVA模型240、編碼器-解碼器工具242、射線行進工具244及輻射場工具246）。在一些實施例中，與本發明一致之方法可包括來自方法1200之至少一或多個步驟，該至少一或多個步驟按不同次序、同時、準同時或時間上重疊地執行。12 illustrates a flowchart in a method 1200 for training a model to render a three-dimensional (3D) view of a portion of a user's face from a plurality of two-dimensional (2D) images of a portion of the user's face, according to some embodiments. The steps in method 1200 may be performed, at least in part, by a processor executing instructions stored in memory, where the processor and memory are part of a client device or VR/AR headset disclosed herein (eg, memory). body 220, processor 212, client device 110). In yet other embodiments, at least one or more of the steps in a method consistent with method 1200 may be performed by a processor executing instructions stored in memory, wherein at least one of the processor and the memory Remotely located in a cloud server, and the headset device is communicatively coupled to the cloud server via a communication module coupled to the network (see server 130, communication module 218). In some embodiments, method 1200 may be performed using a volumetric avatar model engine configured to train PVA models including encoder-decoder tools, ray marching tools, and radiation field tools, including machines Neural network architectures in learning or artificial intelligence algorithms, as disclosed herein (eg, volume avatar model engine 232, PVA model 240, encoder-decoder tool 242, ray travel tool 244, and radiation field tool 246). In some embodiments, methods consistent with the present disclosure may include at least one or more steps from method 1200 performed in a different order, simultaneously, quasi-simultaneously, or overlapping in time.

步驟1202包括自多個使用者之面部收集多個實況影像。Step 1202 includes collecting multiple live images from multiple users' faces.

步驟1204包括藉由儲存之經校準立體影像對來修正實況影像。Step 1204 includes correcting the live image by the stored pair of calibrated stereoscopic images.

步驟1206包括運用三維面部模型產生個體之多個合成視圖，其中個體之合成視圖包括沿著對應於個體之多個視圖的不同方向投影的多個特徵圖之內插。在一些實施例中，步驟1206包括：沿著所選擇觀測方向投影來自實況影像中之每一者的影像特徵；及以排列不變的組合來串接由實況影像中之每一者產生的多個特徵圖，該些實況影像中之每一者具有一本質特性。在一些實施例中，步驟1206進一步包括藉由平均化來自多個攝影機之多個特徵向量以在所要點處遍及不同方向形成攝影機彙總之特徵向量來內插特徵圖。Step 1206 includes generating multiple synthetic views of the individual using the three-dimensional facial model, wherein the synthetic views of the individual include interpolation of multiple feature maps projected along different directions corresponding to the multiple views of the individual. In some embodiments, step 1206 includes: projecting image features from each of the live images along the selected viewing direction; and concatenating multiple images generated by each of the live images in a permutation-invariant combination feature maps, each of the live images has an essential characteristic. In some embodiments, step 1206 further includes interpolating the feature map by averaging the plurality of feature vectors from the plurality of cameras to form camera-aggregated feature vectors across different directions at the desired point.

步驟1208包括基於實況影像與個體之合成視圖之間的差異來訓練三維面部模型。在一些實施例中，步驟1208包括基於指示實況影像與個體之合成視圖之間的差異之損失函數之值來更新用於特徵圖中之多個特徵中之每一者的可學習權重之集合中的至少一者。在一些實施例中，步驟1208包括基於自多個實況影像投影之像素背景值訓練用於實況影像中之多個像素中之每一者的背景值。在一些實施例中，步驟1208包括使用用於收集多個實況影像之多個攝影機中之每一者的特定特徵來產生背景模型。硬體綜述 Step 1208 includes training a three-dimensional facial model based on the differences between the live image and the synthesized view of the individual. In some embodiments, step 1208 includes updating the set of learnable weights for each of the plurality of features in the feature map based on the value of the loss function indicative of the difference between the live image and the individual's synthetic view at least one of. In some embodiments, step 1208 includes training background values for each of the plurality of pixels in the live image based on pixel background values projected from the plurality of live images. In some embodiments, step 1208 includes generating a background model using specific characteristics of each of the plurality of cameras used to collect the plurality of live images. Hardware Overview

圖13為說明可藉以實施頭戴耳機及其他用戶端裝置110以及方法1100及1200的例示性電腦系統1300的方塊圖。在某些態樣中，電腦系統1300可使用在專屬伺服器中或整合至另一實體中或橫越多個實體而分佈的硬體或軟體與硬體之組合來實施。電腦系統1300可包括桌上型電腦、膝上型電腦、平板電腦、平板手機、智慧型手機、功能電話（feature phone）、伺服器電腦或其他。伺服器電腦可在遠端位於資料中心中或在本端儲存。13 is a block diagram illustrating an exemplary computer system 1300 in which the headset and other client devices 110 and methods 1100 and 1200 may be implemented. In some aspects, computer system 1300 may be implemented using hardware, or a combination of software and hardware, in a dedicated server or integrated into another entity or distributed across multiple entities. Computer system 1300 may include a desktop computer, laptop computer, tablet computer, phablet, smartphone, feature phone, server computer, or others. Server computers can be located remotely in a data center or stored locally.

電腦系統1300包括用於傳達資訊之匯流排1308或其他通信機構，及與匯流排1308耦接以用於處理資訊之處理器1302（例如處理器212）。藉助於實例，電腦系統1300可藉由一或多個處理器1302實施。處理器1302可為通用微處理器、微控制器、數位信號處理器（DSP）、特殊應用積體電路（ASIC）、場可程式化閘陣列（FPGA）、可程式化邏輯裝置（PLD）、控制器、狀態機、閘控邏輯、離散硬體組件或可執行資訊之計算或其他操控的任何其他合適實體。Computer system 1300 includes a bus 1308 or other communication mechanism for communicating information, and a processor 1302 (eg, processor 212) coupled with bus 1308 for processing information. By way of example, computer system 1300 may be implemented with one or more processors 1302 . The processor 1302 can be a general purpose microprocessor, microcontroller, digital signal processor (DSP), application specific integrated circuit (ASIC), field programmable gate array (FPGA), programmable logic device (PLD), A controller, state machine, gating logic, discrete hardware component, or any other suitable entity that can perform computation or other manipulation of information.

除了硬體以外，電腦系統1300亦可包括創建用於所討論之電腦程式之執行環境的程式碼，例如構成以下各者的程式碼：處理器韌體、協定堆疊、資料庫管理系統、作業系統或其在以下各者中儲存中之一或多者的組合：所包括之記憶體1304（例如記憶體220）（諸如隨機存取記憶體（RAM）、快閃記憶體、唯讀記憶體（ROM）、可程式化唯讀記憶體（PROM）、可抹除可程式化唯讀記憶體（EPROM））、暫存器、硬碟、可移磁碟、CD-ROM、DVD或與匯流排1308耦接以用於儲存待藉由處理器1302執行之資訊及指令的任何其他合適儲存裝置。處理器1302及記憶體1304可由專用邏輯電路補充或併入於專用邏輯電路中。In addition to hardware, computer system 1300 may also include code that creates an execution environment for the computer program in question, such as code that constitutes: processor firmware, protocol stack, database management system, operating system or a combination thereof stored in one or more of: included memory 1304 (eg, memory 220 ) (such as random access memory (RAM), flash memory, read-only memory ( ROM), Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM)), Scratchpad, Hard Disk, Removable Disk, CD-ROM, DVD, or bus 1308 is coupled to any other suitable storage device for storing information and instructions to be executed by processor 1302. Processor 1302 and memory 1304 may be supplemented by or incorporated in special purpose logic circuitry.

該些指令可儲存於記憶體1304中且在一或多個電腦程式產品中實施，例如在電腦可讀取媒體上編碼以供電腦系統1300執行或控制該電腦系統之操作的電腦程式指令之一或多個模組，且根據所屬技術領域中具有通常知識者熟知之任何方法，該些指令包括但不限於如下電腦語言：資料導向語言（例如SQL、dBase）、系統語言（例如C、Objective-C、C++、彙編）、架構語言（如Java、.NET）及應用程式語言（例如PHP、Ruby、Perl、Python）。指令亦可用電腦語言實施，諸如陣列語言、特性導向語言、彙編語言、製作語言、命令行介面語言、編譯語言、並行語言、波形括號語言、資料流語言、資料結構式語言、宣告式語言、深奧語言、擴展語言、***語言、函數語言、互動模式語言、解譯語言、反覆語言、串列為基的語言、小語言、以邏輯為基的語言、機器語言、巨集語言、元程式設計語言、多重範型語言（multiparadigm language）、數值分析、非英語語言、物件導向分類式語言、物件導向基於原型的語言、場外規則語言、程序語言、反射語言、基於規則的語言、指令碼處理語言、基於堆疊的語言、同步語言、語法處置語言、視覺語言、wirth語言及基於xml的語言。記憶體1304亦可用於在待由處理器1302執行之指令之執行期間儲存暫時性變數或其他中間資訊。The instructions may be stored in memory 1304 and implemented in one or more computer program products, such as one of the computer program instructions encoded on a computer-readable medium for execution by computer system 1300 or to control the operation of the computer system or more modules, and according to any method well known to those of ordinary skill in the art, the instructions include, but are not limited to, the following computer languages: data-oriented languages (eg, SQL, dBase), system languages (eg, C, Objective- C, C++, assembly), architectural languages (e.g. Java, .NET) and application languages (e.g. PHP, Ruby, Perl, Python). Instructions can also be implemented in computer languages such as array languages, feature-oriented languages, assembly languages, authoring languages, command-line interface languages, compiled languages, parallel languages, curly bracket languages, data stream languages, data structured languages, declarative languages, esoteric languages Languages, Extended Languages, Fourth Generation Languages, Functional Languages, Interactive Pattern Languages, Interpretation Languages, Iterative Languages, String-Based Languages, Small Languages, Logic-Based Languages, Machine Languages, Macro Languages, Metaprograms Design languages, multiparadigm languages, numerical analysis, non-English languages, object-oriented categorical languages, object-oriented prototype-based languages, off-site rule languages, procedural languages, reflective languages, rule-based languages, script processing Languages, stack-based languages, synchronous languages, grammar-processing languages, visual languages, wirth languages, and xml-based languages. Memory 1304 may also be used to store transient variables or other intermediate information during execution of instructions to be executed by processor 1302.

如本文所論述之電腦程式未必對應於檔案系統中之檔案。程式可儲存於保持其他程式或資料（例如，儲存於標記語言文件中之一或多個指令碼）的檔案的一部分中、儲存於專用於所討論之程式的單一檔案中，或儲存於多個經協調檔案（例如，儲存一或多個模組、子程式或程式碼之部分的檔案）中。電腦程式可經部署以在一台電腦上或在位於一個位點或橫越多個位點分佈且由通信網路互連的多台電腦上執行。本說明書中描述的程序及邏輯流程可由一或多個可程式化處理器執行，該一或多個可程式化處理器執行一或多個電腦程式以藉由對輸入資料進行操作且產生輸出來執行功能。Computer programs as discussed herein do not necessarily correspond to files in a file system. Programs may be stored in part of a file that holds other programs or data (eg, one or more scripts stored in a markup language file), in a single file dedicated to the program in question, or in multiple In coordinated files (eg, files that store portions of one or more modules, subprograms, or code). A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network. The procedures and logic flows described in this specification can be executed by one or more programmable processors executing one or more computer programs to perform operations on input data and generate output Execute function.

電腦系統1300進一步包括與匯流排1308耦接以用於儲存資訊及指令的資料儲存裝置1306，諸如磁碟或光碟。電腦系統1300可經由輸入/輸出模組1310耦接至各種裝置。輸入/輸出模組1310可為任何輸入/輸出模組。例示性輸入/輸出模組1310包括諸如USB埠之資料埠。輸入/輸出模組1310經組態以連接至通信模組1312。例示性通信模組1312包括網路連接介面卡，諸如乙太網路卡及數據機。在某些態樣中，輸入/輸出模組1310經組態以連接至複數個裝置，諸如輸入裝置1314及/或輸出裝置1316。例示性輸入裝置1314包括鍵盤及指標裝置，例如滑鼠或軌跡球，消費者可藉由該指標裝置提供輸入至電腦系統1300。其他種類之輸入裝置1314亦可用以提供與消費者的互動，諸如觸覺輸入裝置、視覺輸入裝置、音訊輸入裝置或腦機介面裝置。舉例而言，提供給消費者之回饋可為任何形式之感測回饋，例如視覺回饋、聽覺回饋或觸覺回饋；且可自消費者接收任何形式之輸入，包括聲輸入、語音輸入、觸覺輸入或腦波輸入。例示性輸出裝置1316包括用於向消費者顯示資訊之顯示裝置，諸如液晶顯示器（liquid crystal display，LCD）監視器。Computer system 1300 further includes a data storage device 1306, such as a magnetic or optical disk, coupled to bus 1308 for storing information and instructions. Computer system 1300 may be coupled to various devices via input/output module 1310 . The input/output module 1310 can be any input/output module. Exemplary input/output modules 1310 include data ports such as USB ports. Input/output module 1310 is configured to connect to communication module 1312. Exemplary communication modules 1312 include network connection interface cards, such as Ethernet cards and modems. In some aspects, input/output module 1310 is configured to connect to a plurality of devices, such as input device 1314 and/or output device 1316. Exemplary input devices 1314 include keyboards and pointing devices, such as a mouse or trackball, by which a consumer may provide input to computer system 1300 . Other types of input devices 1314 may also be used to provide interaction with consumers, such as tactile input devices, visual input devices, audio input devices, or brain-computer interface devices. For example, the feedback provided to the consumer can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and can receive any form of input from the consumer, including acoustic input, voice input, tactile input, or Brainwave input. Exemplary output devices 1316 include display devices for displaying information to consumers, such as liquid crystal display (LCD) monitors.

根據本發明之一個態樣，回應於處理器1302執行記憶體1304中所含有之一或多個指令的一或多個序列，可至少部分地使用電腦系統1300實施頭戴耳機及用戶端裝置110。可將這些指令自另一機器可讀取媒體（諸如資料儲存裝置1306）讀取至記憶體1304中。主記憶體1304中含有之指令序列之執行致使處理器1302執行本文中所描述之程序步驟。呈多處理配置之一或多個處理器亦可用以執行記憶體1304中含有之指令序列。在替代態樣中，硬佈線電路可代替軟體指令使用或與軟體指令組合使用，以實施本發明之各個態樣。因此，本發明之態樣不限於硬體電路及軟體之任何特定組合。According to one aspect of the invention, in response to processor 1302 executing one or more sequences of one or more instructions contained in memory 1304, headset and client device 110 may be implemented, at least in part, using computer system 1300 . These instructions can be read into memory 1304 from another machine-readable medium, such as data storage 1306 . Execution of sequences of instructions contained in main memory 1304 causes processor 1302 to perform the program steps described herein. One or more processors in a multiprocessing configuration may also be used to execute sequences of instructions contained in memory 1304 . In alternative aspects, hard-wired circuitry may be used in place of or in combination with software instructions to implement various aspects of the present invention. Accordingly, aspects of the present invention are not limited to any specific combination of hardware circuitry and software.

本說明書中所描述之標的的各種態樣可於計算系統中實施，該計算系統包括後端組件，例如資料伺服器，或包括中間軟體組件，例如應用程式伺服器，或包括前端組件，例如具有消費者可與本說明書中所描述之標的之實施互動所經由的圖形消費者介面或網路瀏覽器的用戶端電腦，或一或多個後端組件、中間軟體組件或前端組件的任何組合。系統之組件可藉由數位資料通信之任何形式或媒體（例如通信網路）互連。通信網路可包括例如LAN、WAN、網際網路及其類似者中的任一或多者。此外，通信網路可包括但不限於例如以下網路拓樸中的任一或多者，包括：匯流排網路、星形網路、環形網路、網狀網路、星形匯流排網路、樹或階層式網路或其類似者。通信模組可例如為數據機或乙太網路卡。Various aspects of the subject matter described in this specification can be implemented in computing systems that include back-end components, such as data servers, or intermediate software components, such as application servers, or front-end components, such as A client computer, or any combination of one or more backend components, middleware components, or frontend components, through which a consumer may interact with implementations of the subject matter described in this specification is through a graphical consumer interface or a web browser. The components of the system may be interconnected by any form or medium of digital data communication, such as a communication network. A communication network may include, for example, any or more of a LAN, WAN, the Internet, and the like. Additionally, a communication network may include, but is not limited to, for example, any or more of the following network topologies, including: bus network, star network, ring network, mesh network, star bus network Road, tree or hierarchical network or the like. The communication module can be, for example, a modem or an Ethernet card.

電腦系統1300可包括用戶端及伺服器。用戶端及伺服器大體上彼此遠離且典型地經由通信網路互動。用戶端與伺服器之關係藉助於在各別電腦上運行且彼此具有主從式關係的電腦程式產生。電腦系統1300為例如但不限於桌上型電腦、膝上型電腦或平板電腦。電腦系統1300亦可嵌入於另一裝置中，例如但不限於行動電話、PDA、行動音訊播放器、全球定位系統（GPS）接收器、視訊遊戲控制台及/或電視機上盒。The computer system 1300 may include a client and a server. Clients and servers are generally remote from each other and typically interact via a communication network. The relationship between the client and the server is created by means of computer programs running on the respective computers and having a master-slave relationship with each other. Computer system 1300 is, for example, but not limited to, a desktop computer, laptop computer, or tablet computer. Computer system 1300 may also be embedded in another device, such as, but not limited to, a mobile phone, PDA, mobile audio player, global positioning system (GPS) receiver, video game console, and/or television set-top box.

如本文中所使用之術語「機器可讀取儲存媒體」或「電腦可讀取媒體」係指參與將指令提供至處理器1302以供執行之任一或多個媒體。此媒體可採取許多形式，包括但不限於非揮發性媒體、揮發性媒體及傳輸媒體。非揮發性媒體包括例如光碟或磁碟，諸如資料儲存裝置1306。揮發性媒體包括動態記憶體，諸如記憶體1304。傳輸媒體包括同軸纜線、銅線及光纖，包括包含匯流排1308之電線。機器可讀取媒體之常見形式包括例如軟碟、軟性磁碟、硬碟、磁帶、任何其他磁性媒體、CD-ROM、DVD、任何其他光學媒體、打孔卡、紙帶、具有孔圖案之任何其他實體媒體、RAM、PROM、EPROM、FLASH EPROM、任何其他記憶體晶片或卡匣，或可供電腦讀取之任何其他媒體。機器可讀取儲存媒體可為機器可讀取儲存裝置、機器可讀取儲存基板、記憶體裝置、影響機器可讀取傳播信號之物質的組成物，或其中之一或多者的組合。The term "machine-readable storage medium" or "computer-readable medium" as used herein refers to any medium or mediums that participate in providing instructions to processor 1302 for execution. This medium can take many forms, including but not limited to non-volatile media, volatile media, and transmission media. Non-volatile media include, for example, optical or magnetic disks, such as data storage device 1306 . Volatile media includes dynamic memory, such as memory 1304 . Transmission media include coaxial cables, copper wire, and fiber optics, including wires including bus bars 1308 . Common forms of machine-readable media include, for example, floppy disks, floppy disks, hard disks, magnetic tapes, any other magnetic media, CD-ROMs, DVDs, any other optical media, punched cards, paper tape, any Other physical media, RAM, PROM, EPROM, FLASH EPROM, any other memory chips or cartridges, or any other media that can be read by a computer. A machine-readable storage medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter that affects a machine-readable propagated signal, or a combination of one or more thereof.

為了說明硬體與軟體之互換性，諸如各種說明性區塊、模組、組件、方法、操作、指令及演算法之項目已大體按其功能性加以了描述。將此功能性實施為硬體、軟體抑或硬體與軟體之組合視強加於整個系統上之特定應用及設計約束而定。所屬技術領域中具有通常知識者可針對每一特定應用以不同方式實施所描述功能性。To illustrate this interchangeability of hardware and software, items such as various illustrative blocks, modules, components, methods, operations, instructions, and algorithms have been described generally in terms of their functionality. Implementing this functionality as hardware, software, or a combination of hardware and software depends upon the particular application and design constraints imposed on the overall system. One of ordinary skill in the art may implement the described functionality in different ways for each particular application.

如本文中所使用，在一系列項目之前的藉由術語「及」或「或」分離該些項目中之任一者的片語「…中之至少一者」修改清單整體，而非清單中之每一成員（例如，每一項目）。片語「…中之至少一者」不需要選擇至少一個項目；實情為，該片語允許包括該些項目中之任一者中的至少一者及/或該些項目之任何組合中的至少一者及/或該些項目中之每一者中的至少一者的涵義。藉助於實例，片語「A、B及C中之至少一者」或「A、B或C中之至少一者」各自指僅A、僅B或僅C；A、B及C之任何組合；及/或A、B及C中之每一者中的至少一者。As used herein, the phrase "at least one of" preceding a list of items by the terms "and" or "or" to separate any of those items modifies the list as a whole, rather than the list each member (eg, each item). The phrase "at least one of" does not require selection of at least one item; instead, the phrase allows the inclusion of at least one of any of the items and/or at least one of any combination of the items Meaning of at least one of one and/or each of these items. By way of example, the phrases "at least one of A, B, and C" or "at least one of A, B, or C" each mean only A, only B, or only C; any combination of A, B, and C ; and/or at least one of each of A, B, and C.

詞語｢例示性｣在本文中用以意謂｢充當一實例、例項或說明｣。本文中被描述為「例示性」之任何實施例未必被解釋為比其他實施例較佳或有利。諸如一態樣、該態樣、另一態樣、一些態樣、一或多個態樣、一實施、該實施、另一實施、一些實施、一或多個實施、一實施例、該實施例、另一實施例、一些實施例、一或多個實施例、一組態、該組態、另一組態、一些組態、一或多個組態、本發明技術、本發明（the disclosure/the present disclosure）及其其他變化及類似者之片語係為方便起見，且並不暗示與此（等）片語相關之揭示內容對於本發明技術係必需的，亦不暗示此揭示內容適用於本發明技術之所有組態。與此（等）片語相關之揭示內容可適用於所有組態或一或多個組態。與此（等）片語相關之揭示內容可提供一或多個實例。諸如一態樣或一些態樣之片語可指一或多個態樣且反之亦然，且此情況類似地適用於其他前述片語。The word "exemplary" is used herein to mean "serving as an instance, instance, or illustration." Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. such as an aspect, the aspect, another aspect, some aspects, one or more aspects, an implementation, the implementation, another implementation, some implementations, one or more implementations, an embodiment, the implementation example, another embodiment, some embodiments, one or more embodiments, a configuration, the configuration, another configuration, some configurations, one or more configurations, the techniques of the present invention, the present invention disclosure/the present disclosure) and other variations and similar phrases are for convenience and do not imply that the disclosure related to this phrase(s) is necessary to the technology of the invention, nor does it imply that such disclosure The content applies to all configurations of the present technology. The disclosure related to this phrase(s) may apply to all configurations or one or more configurations. The disclosure related to this phrase(s) may provide one or more examples. A phrase such as an aspect or aspects may refer to one or more aspects and vice versa, and this applies similarly to the other aforementioned phrases.

除非具體陳述，否則以單數形式對元件的提及並不意欲意謂「一個且僅一個」，而是「一或多個」。術語「一些」係指一或多個。帶下劃線及/或斜體標題及子標題僅僅用於便利性，不限制本發明技術，且不結合本發明技術之描述的解釋而參考。關係術語，諸如第一及第二及其類似者，可用以區分一個實體或動作與另一實體或動作，而未必需要或意指此類實體或動作之間的任何實際此類關係或次序。所屬技術領域中具有通常知識者已知或稍後將知曉的貫穿本發明而描述的各種組態之元件的所有結構及功能等效物以引用方式明確地併入本文中，且意欲由本發明技術涵蓋。此外，本文所揭示之任何內容皆不意欲專用於公眾，無論在以上描述中是否明確地敍述此揭示內容。不應依據35 U.S.C. §112第六段的規定解釋任何請求項要素，除非使用片語「用於…之構件」來明確地敍述該要素或者在方法請求項之狀況下使用片語「用於…之步驟」來敍述該要素。Reference to an element in the singular is not intended to mean "one and only one" unless specifically stated, but rather "one or more." The term "some" refers to one or more. The underlined and/or italicized headings and subheadings are for convenience only, do not limit the techniques of the present invention, and are not to be referenced in connection with the interpretation of the description of the techniques of the present invention. Relational terms, such as first and second and the like, may be used to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. All structural and functional equivalents of elements in the various configurations described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be utilized by the present technology covered. Furthermore, nothing disclosed herein is intended to be dedicated to the public, whether or not such disclosure is expressly recited in the above description. No claim element shall be construed in accordance with the sixth paragraph of 35 U.S.C. §112 unless the element is expressly recited by the use of the phrase "means for" or in the case of a method claim, the phrase "for... steps" to describe this element.

雖然本說明書含有許多特性，但這些特性不應理解為限制可能描述之內容的範圍，而是應理解為對標的之特定實施的描述。在單獨實施例之上下文中描述於此說明書中之某些特徵亦可在單一實施例中以組合形式實施。相反，在單一實施例之上下文中描述的各種特徵亦可分別或以任何適合子組合於多個實施例中實施。此外，儘管上文可將特徵描述為以某些組合起作用且甚至最初按此來描述，但來自所描述組合之一或多個特徵在一些狀況下可自該組合刪除，且所描述之組合可針對子組合或子組合之變化。While this specification contains many features, these should not be construed as limiting the scope of what may be described, but rather as descriptions of particular implementations of the subject matter. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as functioning in certain combinations and even initially described as such, one or more features from a described combination may in some cases be deleted from the combination and the described combination Variations of sub-combinations or sub-combinations may be targeted.

本說明書之標的已關於特定態樣加以描述，但其他態樣可經實施且在以下申請專利範圍之範圍內。舉例而言，儘管在圖式中以特定次序來描繪操作，但不應將此理解為需要以所展示之特定次序或以順序次序執行這些操作，或執行所有所說明操作以達成合乎需要之結果。可以不同次序執行申請專利範圍中所敍述之動作且該些動作仍達成合乎需要的結果。作為一個實例，附圖中描繪之程序未必需要所展示之特定次序，或依序次序，以達成合乎需要的結果。在某些情形中，多任務及並行處理可為有利的。此外，不應將上文所描述之態樣中之各種系統組件的分離理解為在所有態樣中皆要求此分離，且應理解，所描述之程式組件及系統可大體上一起整合於單一軟體產品中或封裝至多個軟體產品中。The subject matter of this specification has been described with respect to certain aspects, but other aspects can be implemented and are within the scope of the following claims. For example, although operations are depicted in the figures in a particular order, this should not be construed as requiring performance of the operations in the particular order shown or in a sequential order, or performance of all illustrated operations, to achieve desirable results . The actions recited in the claimed scope can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain situations, multitasking and parallel processing may be advantageous. Furthermore, the separation of various system components in the aspects described above should not be construed as requiring such separation in all aspects, and it should be understood that the described program components and systems may be generally integrated together in a single software product or packaged into multiple software products.

在此將標題、背景技術、圖式簡單說明、摘要及圖式併入本發明中且提供為本發明之說明性實例而非限定性描述。應遵從以下理解：其將不用以限制申請專利範圍之範圍或涵義。另外在實施方式中可見，出於精簡本發明之目的，本說明書提供說明性實例且在各種實施中將各種特徵分組在一起。不應將本發明之方法解釋為反映以下意圖：相較於每一請求項中明確敍述之特徵，所描述之標的需要更多的特徵。確切而言，如申請專利範圍所反映，本發明標的在於單一所揭示組態或操作之少於全部的特徵。以下申請專利範圍特此併入實施方式中，其中每一請求項就其自身而言作為分開描述之標的。The heading, background, brief description of the drawings, abstract and drawings are hereby incorporated into this disclosure and are provided as illustrative examples of the invention rather than as limiting descriptions. The following understanding should be followed: it is not intended to limit the scope or meaning of the claimed scope. Also seen in the embodiments, for the purpose of streamlining the invention, this specification provides illustrative examples and groups various features together in various implementations. The methods of the present invention should not be interpreted as reflecting an intention that the described subject matter requires more features than those expressly recited in each claim. Rather, as reflected in the claimed scope, the present invention is directed to less than all features of a single disclosed configuration or operation. The following claims are hereby incorporated into the Embodiments, with each claim in its own right the subject of a separate description.

申請專利範圍並不意欲限於本文中所描述之態樣，而應符合與申請專利範圍所表述的一致之完整範圍且涵蓋所有法定均等物。儘管如此，申請專利範圍均不意欲涵蓋未能滿足可適用專利法之要求的標的，且亦不應以此方式解釋該些標的。The claimed scope is not intended to be limited to the aspects described herein, but is to be accorded the full scope consistent with the presentation of the claimed scope and to encompass all statutory equivalents. Nonetheless, none of the claims are intended to cover subject matter that fails to meet the requirements of applicable patent law, nor should such subject matter be construed in this manner.

100:架構 110:用戶端裝置 130:伺服器 150:網路 152:資料庫 200:方塊圖 212-1:處理器 212-2:處理器 214:輸入裝置 216:輸出裝置 218-1:通信模組 218-2:通信模組 220-1:記憶體 220-2:記憶體 222:應用程式 225:圖形使用者介面（GUI） 232:體積化身模型引擎 240:像素對準之體積化身（PVA）模型 242:編碼器-解碼器工具 244:射線行進工具 246:輻射場工具 252:訓練資料庫 300:模型架構 301-1:2D輸入影像 301-2:2D輸入影像 301-n:2D輸入影像 303-1:特徵圖 303-2:特徵圖 303-n:特徵圖 310A:卷積編碼器-解碼器網路 310B:射線行進級 310C:輻射場級 311:視點 315:輻射場 320-1:可學習權重 320-2:可學習權重 320-n:可學習權重 321:體積呈現 323:世界至攝影機投影 325:雙線性內插 327:位置編碼 329:特徵聚集 330:點 331:視點/中心 335:射線 421A-1:體積化身 421A-2:體積化身 421A-3:體積化身 421A-4:體積化身 421A-5:體積化身 421B-1:體積化身 421B-2:體積化身 421B-3:體積化身 421B-4:體積化身 421B-5:體積化身 421C-1:體積化身 421C-2:體積化身 421C-3:體積化身 421C-4:體積化身 421C-5:體積化身 501-1:實況影像 501-2:實況影像 501-3:實況影像 501-4:實況影像 521A-1:實境捕獲化身 521A-2:實境捕獲化身 521A-3:實境捕獲化身 521A-4:實境捕獲化身 521B-1:神經體積化身 521B-2:神經體積化身 521B-3:神經體積化身 521B-4:神經體積化身 521C-1:cNeRF化身 521C-2:cNeRF化身 521C-3:cNeRF化身 521C-4:cNeRF化身 521D-1:體積化身 521D-2:體積化身 521D-3:體積化身 521D-4:體積化身 601-1:實況影像 601-2:實況影像 601-3:實況影像 621A-1:化身視圖 621A-2:化身視圖 621A-3:化身視圖 621B-1:化身視圖 621B-2:化身視圖 621B-3:化身視圖 631A-1:阿爾發視圖 631A-2:阿爾發視圖 631A-3:阿爾發視圖 631B-1:阿爾發視圖 631B-2:阿爾發視圖 631B-3:阿爾發視圖 633A-1:正常視圖 633A-2:正常視圖 633A-3:正常視圖 633B-1:正常視圖 633B-2:正常視圖 633B-3:正常視圖 701-1:實況影像 701-2:實況影像 701-3:影像 701-4:影像 701-5:影像 721-1:化身 721-2:化身 721-3:化身 721-4:化身 721-5:化身 733-1:正常視圖 733-2:正常視圖 733-3:正常視圖 733-4:正常視圖 733-5:正常視圖 801-1:輸入影像 801-2:輸入影像 821-1:化身 821-2:化身 833-1:正常視圖 833-2:正常視圖 901:調節視圖 921A:沙漏網路 921B:UNet 921C:淺卷積網路/淺編碼器-解碼器架構 1001A-1:輸入影像 1001A-2:輸入影像 1001A-3:輸入影像 1001B-1:輸入影像 1001B-2:輸入影像 1001B-3:輸入影像 1001C-1:輸入影像 1001C-2:輸入影像 1001C-3:輸入影像 1021A-1:化身 1021A-2:化身 1021A-3:化身 1021B-1:化身 1021B-2:化身 1021B-3:化身 1100:方法 1102:步驟 1104:步驟 1106:步驟 1108:步驟 1200:方法 1202:步驟 1204:步驟 1206:步驟 1208:步驟 1300:電腦系統 1302:處理器 1304:記憶體 1306:資料儲存裝置 1308:匯流排 1310:輸入/輸出模組 1312:通信模組 1314:輸入裝置 1316:輸出裝置 100: Architecture 110: Client Device 130: Server 150: Internet 152:Database 200: Block Diagram 212-1: Processor 212-2: Processor 214: Input device 216: Output device 218-1: Communication module 218-2: Communication module 220-1: Memory 220-2: Memory 222: Apps 225: Graphical User Interface (GUI) 232: Volume Avatar Model Engine 240: Pixel Aligned Volume Avatar (PVA) Model 242: Encoder-Decoder Tools 244: Ray Marching Tool 246: Radiation Field Tool 252: Training database 300: Model Architecture 301-1: 2D Input Image 301-2: 2D Input Image 301-n: 2D input image 303-1: Feature Map 303-2: Feature Map 303-n: Feature Map 310A: Convolutional Encoder-Decoder Networks 310B: Ray Marching Stage 310C: Radiated Field Class 311: Viewpoint 315: Radiation Field 320-1: Learnable Weights 320-2: Learnable Weights 320-n: learnable weights 321: Volume rendering 323: World to Camera Projection 325: bilinear interpolation 327: position code 329: Feature Aggregation 330: point 331: Viewpoint/Center 335: Ray 421A-1: Volume Incarnation 421A-2: Volume Incarnation 421A-3: Volume Incarnation 421A-4: Volume Incarnation 421A-5: Volume Incarnation 421B-1: Volume Incarnation 421B-2: Volume Incarnation 421B-3: Volume Incarnation 421B-4: Volume Incarnation 421B-5: Volume Avatar 421C-1: Volume Incarnation 421C-2: Volume Incarnation 421C-3: Volume Incarnation 421C-4: Volume Incarnation 421C-5: Volume incarnation 501-1: Live Video 501-2: Live Video 501-3: Live Video 501-4: Live Video 521A-1: Reality Capture Avatar 521A-2: Reality Capture Avatar 521A-3: Reality Capture Avatar 521A-4: Reality Capture Avatar 521B-1: Nerve Volume Incarnation 521B-2: Nerve Volume Incarnation 521B-3: Nerve Volume Incarnation 521B-4: Nerve Volume Incarnation 521C-1: cNeRF incarnation 521C-2: cNeRF incarnation 521C-3: cNeRF Incarnation 521C-4: cNeRF Incarnation 521D-1: Volume Avatar 521D-2: Volume Avatar 521D-3: Volume Avatar 521D-4: Volume Avatar 601-1: Live Video 601-2: Live Video 601-3: Live Video 621A-1: Avatar View 621A-2: Avatar View 621A-3: Avatar View 621B-1: Avatar View 621B-2: Avatar View 621B-3: Avatar View 631A-1: Alpha View 631A-2: Alpha View 631A-3: Alpha View 631B-1: Alpha View 631B-2: Alpha View 631B-3: Alpha View 633A-1: Normal View 633A-2: Normal View 633A-3: Normal View 633B-1: Normal view 633B-2: Normal View 633B-3: Normal View 701-1: Live Video 701-2: Live Video 701-3: Video 701-4: Video 701-5: Video 721-1: Avatar 721-2: Avatar 721-3: Avatar 721-4: Avatar 721-5: Avatar 733-1: Normal View 733-2: Normal View 733-3: Normal View 733-4: Normal View 733-5: Normal View 801-1: Input image 801-2: Input image 821-1: Avatar 821-2: Avatar 833-1: Normal View 833-2: Normal View 901: Adjust View 921A: Hourglass Network 921B:UNet 921C: Shallow Convolutional Networks/Shallow Encoder-Decoder Architecture 1001A-1: Input image 1001A-2: Input image 1001A-3: Input Image 1001B-1: Input image 1001B-2: Input image 1001B-3: Input Image 1001C-1: Input image 1001C-2: Input image 1001C-3: Input image 1021A-1: Avatar 1021A-2: Avatar 1021A-3: Avatar 1021B-1: Avatar 1021B-2: Avatar 1021B-3: Avatar 1100: Method 1102: Steps 1104: Steps 1106: Steps 1108: Steps 1200: Method 1202: Steps 1204: Steps 1206: Steps 1208: Steps 1300: Computer Systems 1302: Processor 1304: Memory 1306: Data Storage Device 1308: Busbar 1310: Input/Output Module 1312: Communication module 1314: Input Device 1316: Output device

[圖1]說明根據一些實施例的適合於在虛擬實境環境中提供即時、穿著衣服的個體動畫之實例架構。[FIG. 1] illustrates an example architecture suitable for providing instant, clothed animation of individuals in a virtual reality environment, according to some embodiments.

[圖2]為說明根據本發明之某些態樣的來自圖1之架構之實例伺服器及用戶端的方塊圖。[FIG. 2] is a block diagram illustrating an example server and client from the architecture of FIG. 1 in accordance with some aspects of the present invention.

[圖3]說明根據一些實施例的用於VR/AR頭戴耳機使用者之面部之一部分之3D呈現的模型架構的方塊圖。[FIG. 3] A block diagram illustrating a model architecture for 3D rendering of a portion of a VR/AR headset user's face, according to some embodiments.

[圖4A]至[圖4C]說明根據一些實施例的在僅給出兩個視圖作為輸入之情況下所計算的體積化身。[FIG. 4A]-[FIG. 4C] illustrate volume avatars calculated given only two views as input, according to some embodiments.

[圖5]說明根據一些實施例的與實況身分相比之不同技術：實境捕獲、神經體積、全域調節、神經輻射場（NeRF）及像素對準技術。[FIG. 5] illustrates different techniques compared to live identities: reality capture, neural volume, global modulation, neural radiation field (NeRF), and pixel alignment techniques, according to some embodiments.

[圖6]說明根據一些實施例的與實況相比使用eNerf及像素對準之化身在典型視點中產生的阿爾發/正常/化身。[FIG. 6] illustrates the alpha/normal/avatar produced in a typical viewpoint using eNerf and pixel alignment of the avatar compared to live, according to some embodiments.

[圖7]說明根據一些實施例的關於視圖之數目的經預測紋理。[FIG. 7] illustrates predicted texture with respect to the number of views, according to some embodiments.

[圖8]說明根據一些實施例的背景消融結果。[FIG. 8] illustrates background ablation results according to some embodiments.

[圖9]說明根據一些實施例的像素對準之特徵對所使用特徵提取器（包括淺卷積網路）之選擇的敏感度。[FIG. 9] illustrates the sensitivity of pixel-aligned features to the choice of feature extractor used, including shallow convolutional networks, according to some embodiments.

[圖10]說明根據一些實施例的攝影機感知之特徵彙總策略。[FIG. 10] illustrates a camera-aware feature aggregation strategy according to some embodiments.

[圖11]說明根據一些實施例的用於自使用者面部之一部分的多個二維（2D）影像呈現使用者面部之一部分的三維（3D）視圖之方法中的流程圖。[FIG. 11] illustrates a flowchart in a method for presenting a three-dimensional (3D) view of a portion of a user's face from a plurality of two-dimensional (2D) images of a portion of the user's face, according to some embodiments.

[圖12]說明根據一些實施例的用於訓練模型以自使用者面部之一部分的多個二維（2D）影像呈現使用者面部之一部分的三維（3D）視圖之方法中的流程圖。[FIG. 12] illustrates a flowchart in a method for training a model to present a three-dimensional (3D) view of a portion of a user's face from a plurality of two-dimensional (2D) images of a portion of the user's face, according to some embodiments.

[圖13]說明根據一些實施例的經組態以執行使用AR或VR裝置之方法中之至少一些的電腦系統。[FIG. 13] illustrates a computer system configured to perform at least some of the methods of using an AR or VR device, according to some embodiments.

在圖式中，除非另有明確陳述，否則類似元件根據其描述同樣地予以標記。In the drawings, similar elements are likewise labeled according to their descriptions, unless expressly stated otherwise.

300:模型架構 300: Model Architecture

301-1:2D輸入影像 301-1: 2D Input Image

301-2:2D輸入影像 301-2: 2D Input Image

301-n:2D輸入影像 301-n: 2D input image

303-1:特徵圖 303-1: Feature Map

303-2:特徵圖 303-2: Feature Map

303-n:特徵圖 303-n: Feature Map

310A:卷積編碼器-解碼器網路 310A: Convolutional Encoder-Decoder Networks

310B:射線行進級 310B: Ray Marching Stage

310C:輻射場級 310C: Radiated Field Class

311:視點 311: Viewpoint

315:輻射場 315: Radiation Field

320-1:可學習權重 320-1: Learnable Weights

320-2:可學習權重 320-2: Learnable Weights

320-n:可學習權重 320-n: learnable weights

321:體積呈現 321: Volume rendering

323:世界至攝影機投影 323: World to Camera Projection

325:雙線性內插 325: bilinear interpolation

327:位置編碼 327: position code

329:特徵聚集 329: Feature Aggregation

330:點 330: point

331:視點/中心 331: Viewpoint/Center

335:射線 335: Ray

Claims

一種電腦實施方法，其包含：接收具有一個體之至少兩個或多於兩個視場的多個二維影像；使用可學習權重之一集合自該些二維影像提取多個影像特徵；沿著該個體之一三維模型與一觀察者之一所選擇觀測點之間的一方向投影該些影像特徵；及向該觀察者提供該個體之該三維模型之一影像。 A computer-implemented method comprising: receiving a plurality of two-dimensional images having at least two or more than two fields of view of a volume; extracting a plurality of image features from the two-dimensional images using a set of learnable weights; projecting the image features along a direction between a three-dimensional model of the individual and an observation point selected by an observer; and An image of the three-dimensional model of the individual is provided to the observer.

如請求項1之電腦實施方法，其中提取該些影像特徵包含提取用以收集該些二維影像之一攝影機的本質屬性。The computer-implemented method of claim 1, wherein extracting the image features includes extracting essential attributes of a camera used to collect the two-dimensional images.

如請求項1之電腦實施方法，其中沿著該個體之該三維模型與該觀察者之該所選擇觀測點之間的該方向投影該些影像特徵包含將一第一方向所相關聯的一特徵圖與一第二方向所相關聯的一特徵圖進行內插。The computer-implemented method of claim 1, wherein projecting the image features along the direction between the three-dimensional model of the individual and the selected observation point of the observer comprises a feature associated with a first direction The map is interpolated with a feature map associated with a second direction.

如請求項1之電腦實施方法，其中沿著該個體之該三維模型與該觀察者之該所選擇觀測點之間的該方向投影該些影像特徵包含沿著該個體之該三維模型與該觀察者之該所選擇觀測點之間的該方向聚集多個像素之該些影像特徵。The computer-implemented method of claim 1, wherein projecting the image features along the direction between the three-dimensional model of the individual and the selected observation point of the observer comprises along the three-dimensional model of the individual and the observation The direction between the selected observation points gathers the image features of a plurality of pixels.

如請求項1之電腦實施方法，其中沿著該個體之該三維模型與該觀察者之該所選擇觀測點之間的該方向投影該些影像特徵包含以一排列不變的組合來串接由多個攝影機中之每一者產生的多個特徵圖，該多個攝影機中之每一者具有一本質特性。The computer-implemented method of claim 1, wherein projecting the image features along the direction between the three-dimensional model of the individual and the selected observation point of the observer comprises concatenating the image features in a permuted combination by A plurality of feature maps generated by each of a plurality of cameras, each of the plurality of cameras having an essential characteristic.

如請求項1之電腦實施方法，其進一步包含：基於該個體之該三維模型之該影像與該個體之一實況影像之間的一差異而評估一損失函數；及基於該損失函數而更新該可學習權重之該集合中的至少一者。The computer-implemented method of claim 1, further comprising: evaluating a loss function based on a difference between the image of the three-dimensional model of the individual and a live image of the individual; and updating the possible loss function based on the loss function At least one of the set of learning weights.

如請求項1之電腦實施方法，其中該個體為具有指向一使用者的一網路攝影機之一用戶端裝置的該使用者，該方法進一步包含將該所選擇觀測點識別為從該用戶端裝置指向該使用者的該網路攝影機之一位置。The computer-implemented method of claim 1, wherein the individual is the user having a client device with a webcam pointed to a user, the method further comprising identifying the selected observation point as a client device from the client device Point to one of the user's webcam locations.

如請求項1之電腦實施方法，其中該觀察者係使用一網路耦接之用戶端裝置，且提供該個體之該三維模型之該影像包含將具有該個體之該三維模型之多個影像的一視訊串流傳輸至該網路耦接之用戶端裝置。The computer-implemented method of claim 1, wherein the viewer is using a network-coupled client device, and the image of the three-dimensional model of the individual providing the image includes images that would have images of the three-dimensional model of the individual A video stream is transmitted to the client device coupled to the network.

如請求項1之電腦實施方法，其中該個體係其中運行有一沉浸式實境應用程式的一用戶端裝置的一使用者，該方法進一步包含將該所選擇觀測點識別為在該沉浸式實境應用程式內該觀察者所在的位置。The computer-implemented method of claim 1, wherein the system is a user of a client device with an immersive reality application running therein, the method further comprising identifying the selected observation point as being in the immersive reality The location of this watcher within the application.

一種系統，其包含：一記憶體，其儲存多個指令；及一或多個處理器，其經組態以執行該些指令以使得該系統執行操作，該些操作包含：接收具有一個體之至少兩個或多於兩個視場的多個二維影像；使用可學習權重之一集合自該些二維影像提取多個影像特徵；沿著該個體之一三維模型與一觀察者之一所選擇觀測點之間的一方向投影該些影像特徵；及向該觀察者提供該個體之該三維模型之一自動立體影像。 A system comprising: a memory that stores a plurality of instructions; and one or more processors configured to execute the instructions to cause the system to perform operations including: receiving a plurality of two-dimensional images having at least two or more than two fields of view of a volume; extracting a plurality of image features from the two-dimensional images using a set of learnable weights; projecting the image features along a direction between a three-dimensional model of the individual and an observation point selected by an observer; and An autostereoscopic image of the three-dimensional model of the individual is provided to the observer.

如請求項10之系統，其中為了提取該些影像特徵，該一或多個處理器執行指令以提取用以收集該些二維影像之一攝影機的本質屬性。The system of claim 10, wherein to extract the image features, the one or more processors execute instructions to extract essential attributes of a camera used to collect the two-dimensional images.

如請求項10之系統，其中為了沿著該個體之該三維模型與該觀察者之該所選擇觀測點之間的該方向投影該些影像特徵，該一或多個處理器執行指令以將一第一方向所相關聯的一特徵圖與一第二方向所相關聯的一特徵圖進行內插。The system of claim 10, wherein to project the image features along the direction between the three-dimensional model of the individual and the selected observation point of the observer, the one or more processors execute instructions to convert a A feature map associated with the first direction is interpolated with a feature map associated with a second direction.

如請求項10之系統，其中為了沿著該個體之該三維模型與該觀察者之該所選擇觀測點之間的該方向投影該些影像特徵，該一或多個處理器執行指令以沿著該個體之該三維模型與該觀察者之該所選擇觀測點之間的該方向聚集多個像素之該些影像特徵。The system of claim 10, wherein to project the image features along the direction between the three-dimensional model of the individual and the selected observation point of the observer, the one or more processors execute instructions to The direction between the three-dimensional model of the individual and the selected observation point of the observer aggregates the image features of a plurality of pixels.

如請求項10之系統，其中為了沿著該個體之該三維模型與該觀察者之該所選擇觀測點之間的該方向投影該些影像特徵，該一或多個處理器執行指令而以一排列不變的組合來串接由多個攝影機中之每一者產生的多個特徵圖，該多個攝影機中之每一者具有一本質特性。The system of claim 10, wherein to project the image features along the direction between the three-dimensional model of the individual and the selected observation point of the observer, the one or more processors execute instructions to A permutation-invariant combination concatenates feature maps produced by each of a plurality of cameras, each of the plurality of cameras having an essential characteristic.

一種用於訓練一模型以將個體之視圖提供至一虛擬實境頭戴耳機中之一自動立體顯示器的電腦實施方法，其包含：自多個使用者之面部收集多個實況影像；運用儲存之經校準立體影像對來修正該些實況影像；運用一三維面部模型來產生該些個體之多個合成視圖，其中該些個體之該些合成視圖包括沿著對應於該些個體之多個視圖的不同方向投影的多個特徵圖之一內插；及基於該些實況影像與該些個體之該些合成視圖之間的一差異來訓練該三維面部模型。 A computer-implemented method for training a model to provide views of an individual to an autostereoscopic display in a virtual reality headset, comprising: collect multiple live images from multiple users' faces; correcting the live images using stored pairs of calibrated stereoscopic images; using a three-dimensional facial model to generate synthetic views of the individuals, wherein the synthetic views of the individuals include an interpolation of one of feature maps projected along different directions corresponding to the views of the individuals ;and The three-dimensional facial model is trained based on a difference between the live images and the synthetic views of the individuals.

如請求項15之電腦實施方法，其中產生多個合成視圖包含沿著一所選擇觀測方向投影來自該些實況影像中之每一者的影像特徵，且以一排列不變的組合來串接由該些實況影像中之每一者產生的多個特徵圖，該些實況影像中之每一者具有一本質特性。The computer-implemented method of claim 15, wherein generating a plurality of composite views includes projecting image features from each of the live images along a selected viewing direction, concatenated by a permutation-invariant combination A plurality of feature maps are generated from each of the live images, each of the live images having an essential characteristic.

如請求項15之電腦實施方法，其中訓練該三維面部模型包含基於指示該些實況影像與該些個體之該些合成視圖之間的該差異之一損失函數之一值來更新用於該些特徵圖中之多個特徵中之每一者的可學習權重之一集合中的至少一者。The computer-implemented method of claim 15, wherein training the three-dimensional facial model comprises updating a value for the features based on a value of a loss function indicative of the difference between the live images and the synthesized views of the individuals at least one of a set of learnable weights for each of the plurality of features in the graph.

如請求項15之電腦實施方法，其中訓練該三維面部模型包含基於從該些多個實況影像投影之一像素背景值訓練用於該些實況影像中之多個像素中之每一者的一背景值。The computer-implemented method of claim 15, wherein training the three-dimensional facial model comprises training a background for each of the plurality of pixels in the plurality of live images based on a pixel background value projected from the plurality of live images value.

如請求項15之電腦實施方法，其進一步包含藉由平均化來自多個攝影機之多個特徵向量以在一所要點處遍及不同方向形成一攝影機彙總之特徵向量來內插該些特徵圖。The computer-implemented method of claim 15, further comprising interpolating the feature maps by averaging a plurality of feature vectors from a plurality of cameras to form a camera-aggregated feature vector throughout different directions at a point.

如請求項15之電腦實施方法，其中訓練該三維面部模型包含使用用於收集該多個實況影像之多個攝影機中之每一者的特定特徵來產生一背景模型。The computer-implemented method of claim 15, wherein training the three-dimensional facial model includes generating a background model using specific features of each of the plurality of cameras used to collect the plurality of live images.