CN118279472A

CN118279472A - Combining light normalization in 3D user representations

Info

Publication number: CN118279472A
Application number: CN202311824574.6A
Authority: CN
Inventors: B·莫拉尔斯赫尔南多; M·S·哈驰因森
Original assignee: Apple Inc
Current assignee: Apple Inc
Priority date: 2022-12-29
Filing date: 2023-12-28
Publication date: 2024-07-02

Abstract

This document relates to light normalization in a combined 3D user representation. Various implementations disclosed include devices, systems, and methods for adjusting a combined user representation via light normalization techniques. For example, the process may include: a first user representation of at least a first portion of a user is obtained, the first user representation generated via a first technique based on the user being under a first lighting condition in a first physical environment. The process may further include: obtaining a second user representation of at least a second portion of the user, the second user representation generated by: generating an initial user representation and extinguishing the initial user representation based on the illumination representation of the second physical environment having the second illumination condition; and generating the second user representation by re-lighting the extinguished initial user representation based on the first lighting condition. The process may further include: a combined user representation is generated based on the first user representation and the second user representation.

Description

Combining light normalization in 3D user representations

Technical Field

The present disclosure relates generally to electronic devices, and in particular, to systems, methods, and devices for representing light normalization of a user in computer-generated content.

Background

The prior art may not accurately or faithfully present a current (e.g., real-time) representation of the appearance of a user of an electronic device. For example, the device may provide an avatar representation of the user based on images of the user's face obtained minutes, hours, days, or even years ago. Such a representation may not accurately represent the current (e.g., real-time) appearance of the user, e.g., may not show the lighting conditions of the user's current environment and/or may apply the lighting conditions of the viewing environment to a real-world representation. Accordingly, it may be desirable to provide a device that effectively provides a more accurate, faithful, and/or current representation of a user.

Disclosure of Invention

Various implementations disclosed herein include devices, systems, and methods that generate a combined user representation using a first user representation (e.g., live frame specific 3D data) and a second user representation (e.g., piFU texture data from a registration). Techniques may be used to generate a first user representation representing a user under a first lighting condition (e.g., based on training data having the first lighting condition to train to generate live frame-specific 3D data of the texture lighting). The second user representation is generated by generating an initial representation using an image in a second physical environment having a second different lighting condition, extinguishing based on the lighting representation of the second physical environment (e.g., an image-based lighting (IBL) cube map estimated via machine learning techniques), and then relighting to match the first lighting condition. The combined user representation may also be adjusted to match lighting in the current physical environment (e.g., via color matching based on color grading). The combined user representation may be displayed live, for example, during a communication session.

Various implementations disclosed herein include devices, systems, and methods that generate a set of values representing a three-dimensional (3D) shape and appearance of a user's face at a point in time for generating a user representation (e.g., a avatar). In some implementations, the set of values may be defined relative to a surface having a non-planar shape (e.g., a curvilinear planar shape). The set of values may include depth values defining the depth of a portion of the face relative to a plurality of points on the surface (e.g., points in a grid on a partially cylindrical surface). For example, the depth value of a point may define a depth D1 at which a portion of the face is behind the location of the point on the surface, e.g., at the depth D1 along an orthogonal ray that begins at the point. The techniques described herein use depth values that are different from those in existing RGBDA images (e.g., red-green-blue-depth alpha images) because existing RGBDA images define content depths relative to a single camera location, while the techniques described herein define depths relative to multiple points on a surface having a planar shape (e.g., a curvilinear planar shape such as a cylinder).

Several advantages may be realized using a relatively simple set of values having depth values defined relative to a plurality of points on the surface. The set of values may require less computation and bandwidth than using a 3D mesh or 3D point cloud while achieving a more accurate representation of the user than RGBDA images. Furthermore, the set of values may be formatted/packaged in a manner similar to existing formats (e.g., RGBDA images), which may enable more efficient integration with systems based on such formats.

Various implementations disclosed herein include devices, systems, and methods that generate a 3D representation of a user for each of a plurality of moments in time by combining the same predetermined 3D data of a first portion of the user with frame-specific 3D data of a second portion of the user captured at the plurality of moments in time. The predetermined 3D data may be a mesh of the user's upper body and head generated from registration data, such as disposable pixel alignment implicit function (PIFu) data. The predetermined 3D data (such as PIFu data) may include a highly efficient implicit representation that locally aligns pixels of the 2D image with the global background of its corresponding 3D object. The frame specific data may represent a user face at each of a plurality of points in time, e.g., a frame specific 3D representation of a live sequence of data, such as the set of values representing the 3D shape and appearance of the user face at a point in time as described herein. The 3D data (e.g., PIFu data and frame-specific 3D data) from the two different sources may be combined for each time instant by spatially aligning the data using the 3D reference points (e.g., points defined relative to the skeletal representation) associated with the two data sets. A 3D representation of the user at multiple moments in time may be generated on a viewing device that combines the data and uses the combined data to present the view, for example, during a live communication (e.g., coexistence) session.

In general, one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of: at a processor of the device, a first user representation of at least a first portion of the user is obtained, wherein the first user representation is generated via a first technique based on first image data obtained in a first physical environment via a first set of sensors, and the first user representation represents the user under a first lighting condition. These actions also include: a second user representation of at least a second portion of the user is obtained. The second user representation is generated by: generating an initial user representation of the user based on second image data obtained via a second set of sensors in a second physical environment having second lighting conditions; extinguishing the initial user representation based on the illuminated representation of the second physical environment; and generating a second user representation by re-lighting the extinguished initial user representation based on the first lighting condition. These actions also include: a combined user representation is generated based on the first user representation and the second user representation.

These and other embodiments can each optionally include one or more of the following features.

In some aspects, the first user representation includes texture data generated via a machine learning model that is trained using training data obtained via one or more sensors in one or more environments having first lighting conditions. In some aspects, a first lighting condition is provided in the one or more environments using a plurality of lights positioned in a fixed positional relationship to provide a uniform light distribution over the face of the training subject. In some aspects, the first lighting condition is based on uniformly distributed light. In some aspects, the first lighting condition is a lighting condition of a first physical environment.

In some aspects, the illumination representation of the second physical environment includes an omnidirectional image representation of a second illumination condition of the second physical environment. In some aspects, an omnidirectional image representation of a second lighting condition of a second physical environment is generated by a machine learning model based on second image data obtained via sensor data.

In some aspects, re-illuminating the second user representation includes: the lighting properties of the second lighting conditions are matched to the lighting properties of the first user representation.

In some aspects, the first lighting condition of the first physical environment is different from the second lighting condition of the second physical environment.

In some aspects, the actions further include: providing a view of the adjusted combined user representation in a three-dimensional (3D) environment, wherein the adjusted combined user representation is generated by adjusting the combined user representation based on at least one of one or more color attributes or one or more light attributes of the 3D environment.

In some aspects, the first physical environment is different from the second physical environment.

In some aspects, the second portion represents the face, hair, neck, upper body, and clothing of the user, and the first portion represents only the face and hair of the user.

In some aspects, the combined user representation is a three-dimensional (3D) user representation.

According to some implementations, a non-transitory computer readable storage medium has stored therein instructions that are computer executable to perform or cause to be performed any of the methods described herein. According to some implementations, an apparatus includes one or more processors, non-transitory memory, and one or more programs; the one or more programs are stored in the non-transitory memory and configured to be executed by the one or more processors, and the one or more programs include instructions for performing or causing performance of any of the methods described herein.

Drawings

Accordingly, the present disclosure may be understood by those of ordinary skill in the art, and the more detailed description may reference aspects of some illustrative implementations, some of which are shown in the accompanying drawings.

FIG. 1 illustrates an apparatus for obtaining sensor data from a user in accordance with some implementations.

FIG. 2 illustrates an example of a three-dimensional (3D) representation of at least a portion of a user in accordance with some implementations.

Fig. 3A and 3B illustrate examples of two-dimensional (2D) manifold surfaces provided as visualizations of a height field representation of a face, according to some implementations.

FIG. 4 illustrates an example of updating portions of a user's facial representation in accordance with some implementations.

Fig. 5A and 5B illustrate examples of 3D reference points defined relative to a skeletal representation of a user according to some implementations.

FIG. 6 illustrates an example in which a predetermined 3D representation and a parameterized grid are combined to generate a representation of a portion of a user based on 3D reference points, according to some implementations.

FIG. 7 illustrates an example of generating and displaying portions of a facial representation of a user in accordance with some implementations.

FIG. 8 illustrates an example of generating a combined 3D representation of a user for two different user representations based on one or more lighting conditions, according to some implementations.

FIG. 9 illustrates a system flow diagram that may generate a combined representation of a user based on predetermined representation data and frame-specific representation data, according to some implementations.

Fig. 10 illustrates a view of an exemplary electronic device operating in different physical environments during a communication session of a first user at a first device and a second user at a second device, and a combined 3D representation of the second user of the first device, according to some implementations.

FIG. 11 is a flow chart representation of a method for generating a combined 3D representation of a user for two user representations for multiple times within a period of time based on a quench/relighting technique, in accordance with some implementations.

Fig. 12 is a block diagram illustrating device components of an exemplary device according to some implementations.

Fig. 13 is a block diagram of an exemplary Head Mounted Device (HMD) according to some implementations.

The various features shown in the drawings may not be drawn to scale according to common practice. Accordingly, the dimensions of the various features may be arbitrarily expanded or reduced for clarity. In addition, some figures may not depict all of the components of a given system, method, or apparatus. Finally, like reference numerals may be used to refer to like features throughout the specification and drawings.

Detailed Description

Numerous details are described to provide a thorough understanding of the exemplary implementations shown in the drawings. However, the drawings illustrate only some example aspects of the disclosure and therefore should not be considered limiting. It will be apparent to one of ordinary skill in the art that other effective aspects or variations do not include all of the specific details set forth herein. Moreover, well-known systems, methods, components, devices, and circuits have not been described in detail so as not to obscure the more pertinent aspects of the exemplary implementations described herein.

Fig. 1 illustrates an exemplary environment 100 of a real-world environment 105 (e.g., a room) that includes a device 10 having a display 15. In some implementations, the device 10 displays the content 20 to the user 25. For example, the content 20 may be a button, a user interface icon, a text box, a graphic, a avatar of a user or another user, or the like. In some implementations, the content 20 may occupy the entire display area of the display 15.

The device 10 obtains image data, motion data, and/or physiological data (e.g., pupil data, facial feature data, etc.) from the user 25 via a plurality of sensors (e.g., sensors 35a, 35b, and 35 c). For example, the device 10 obtains eye gaze characteristic data 40b via the sensor 35b, upper facial characteristic data 40a via the sensor 35a, and lower facial characteristic data 40c via the sensor 35 c.

While this example and other examples discussed herein show a single device 10 in a real-world environment 105, the techniques disclosed herein are applicable to multiple devices and other real-world environments. For example, the functions of device 10 may be performed by a plurality of devices with sensors 35a, 35b, and 35c located on each respective device, or distributed among them in any combination.

In some implementations, the plurality of sensors (e.g., sensors 35a, 35b, and 35 c) may include any number of sensors that collect data related to the appearance of user 25. For example, when wearing a head-mounted device (HMD), one sensor (e.g., a camera within the HMD) may collect pupil data for eye tracking, and one sensor on a separate device (e.g., one camera, such as a wide-angle view) may be able to capture all facial feature data of the user. Alternatively, if the device 10 is an HMD, a separate device may not be necessary. For example, if the device 10 is an HMD, in one implementation, the sensor 35b may be located within the HMD to capture pupil data (e.g., eye gaze characteristic data 40 b), and additional sensors (e.g., sensors 35a and 35 c) may be located on the HMD but on an outer surface of the HMD facing the user's head/face to capture facial feature data (e.g., capturing upper facial feature data 40a via sensor 35a and capturing lower facial feature data 40c via sensor 35 c).

In some implementations, as shown in fig. 1, the device 10 is a handheld electronic device (e.g., a smart phone or tablet computer). In some implementations, the device 10 is a laptop computer or a desktop computer. In some implementations, the device 10 has a touch pad, and in some implementations, the device 10 has a touch sensitive display (also referred to as a "touch screen" or "touch screen display"). In some implementations, the device 10 is a wearable device, such as an HMD.

In some implementations, the device 10 includes an eye tracking system for detecting eye position and eye movement via the eye gaze characteristic data 40 b. For example, the eye tracking system may include one or more Infrared (IR) Light Emitting Diodes (LEDs), an eye tracking camera (e.g., a Near IR (NIR) camera), and an illumination source (e.g., an NIR light source) that emits light (e.g., NIR light) to the eyes of the user 25. Further, the illumination source of the device 10 may emit NIR light to illuminate the eyes of the user 25, and the NIR camera may capture images of the eyes of the user 25. In some implementations, images captured by the eye tracking system may be analyzed to detect the position and movement of the eyes of user 25, or to detect other information about the eyes such as color, shape, status (e.g., open-ended, strabismus, etc.), pupil dilation, or pupil diameter. Further, gaze points estimated from eye-tracked images may enable gaze-based interactions with content shown on a near-eye display of the device 10.

In some implementations, the device 10 has a Graphical User Interface (GUI), one or more processors, memory, and one or more modules, programs, or sets of instructions stored in the memory for performing a plurality of functions. In some implementations, the user 25 interacts with the GUI through finger contacts and gestures on the touch-sensitive surface. In some implementations, these functions include image editing, drawing, rendering, word processing, web page creation, disk editing, spreadsheet making, game playing, phone calls, video conferencing, email sending and receiving, instant messaging, fitness support, digital photography, digital video recording, web browsing, digital music playing, and/or digital video playing. Executable instructions for performing these functions may be included in a computer-readable storage medium or other computer program product configured for execution by one or more processors.

In some implementations, the device 10 employs various physiological sensors, detection or measurement systems. The detected physiological data may include, but is not limited to: electroencephalogram (EEG), electrocardiogram (ECG), electromyogram (EMG), functional near infrared spectrum signal (fNIRS), blood pressure, skin conductance or pupillary response. Furthermore, the device 10 may detect multiple forms of physiological data simultaneously in order to benefit from the synchronized acquisition of physiological data. Furthermore, in some implementations, the physiological data represents involuntary data, i.e., responses that are not consciously controlled. For example, the pupillary response may be indicative of involuntary movement.

In some implementations, one or both eyes 45 of the user 25 (including one or both pupils 50 of the user 25) present physiological data (e.g., eye gaze characteristic data 40 b) in the form of pupillary responses. The pupillary response of user 25 causes a change in the size or diameter of pupil 50 via the optic nerve and the opthalmic cranial nerve. For example, the pupillary response may include a constrictive response (pupil constriction), i.e., pupil narrowing, or a dilated response (pupil dilation), i.e., pupil widening. In some implementations, the device 10 can detect a pattern of physiological data representing a time-varying pupil diameter.

The user data (e.g., upper facial feature characteristic data 40a, lower facial feature characteristic data 40c, and eye gaze feature data 40 b) may change over time and the device 10 may use the user data to generate and/or provide a representation of the user.

In some implementations, the user data (e.g., upper facial feature characteristic data 40a and lower facial feature characteristic data 40 c) includes texture data for facial features, such as eyebrow movement, mandibular movement, nasal movement, cheek movement, and the like. For example, when a person (e.g., user 25) smiles, the upper and lower facial features (e.g., upper and lower facial feature characteristic data 40a, 40 c) may include a large amount of muscle movement that may be repeated by a representation of the user (e.g., a avatar) based on captured data from sensor 35.

According to some implementations, an electronic device (e.g., device 10) may generate an augmented reality (XR) environment during a communication session and present the XR environment to one or more users. In contrast to a physical environment in which people may sense and/or interact without the assistance of an electronic device, an augmented reality (XR) environment refers to a completely or partially simulated environment in which people sense and/or interact via an electronic device. For example, the XR environment may include Augmented Reality (AR) content, mixed Reality (MR) content, virtual Reality (VR) content, and the like. In the case of an XR system, a subset of the physical movements of a person, or a representation thereof, are tracked and in response one or more characteristics of one or more virtual objects simulated in the XR environment are adjusted in a manner consistent with at least one physical law. As one example, the XR system may detect head movements and, in response, adjust the graphical content and sound field presented to the person in a manner similar to the manner in which such views and sounds change in the physical environment. As another example, the XR system may detect movement of an electronic device (e.g., mobile phone, tablet, laptop, etc.) presenting the XR environment, and in response, adjust the graphical content and sound field presented to the person in a manner similar to how such views and sounds would change in the physical environment. In some cases (e.g., for reachability reasons), the XR system may adjust characteristics of graphical content in the XR environment in response to representations of physical movements (e.g., voice commands).

There are many different types of electronic systems that enable a person to sense and/or interact with various XR environments. Examples include head-mounted systems, projection-based systems, head-up displays (HUDs), vehicle windshields integrated with display capabilities, windows integrated with display capabilities, displays formed as lenses designed for placement on a person's eyes (e.g., similar to contact lenses), headphones/earphones, speaker arrays, input systems (e.g., wearable or handheld controllers with or without haptic feedback), smartphones, tablets, and desktop/laptop computers. The head-mounted system may have an integrated opaque display and one or more speakers. Alternatively, the head-mounted system may be configured to accept an external opaque display (e.g., a smart phone). The head-mounted system may incorporate one or more imaging sensors for capturing images or video of the physical environment, and/or one or more microphones for capturing audio of the physical environment. The head-mounted system may have a transparent or translucent display instead of an opaque display. The transparent or translucent display may have a medium through which light representing an image is directed to the eyes of a person. The display may utilize digital light projection, OLED, LED, uLED, liquid crystal on silicon, laser scanning light sources, or any combination of these techniques. The medium may be an optical waveguide, a holographic medium, an optical combiner, an optical reflector, or any combination thereof. In some implementations, the transparent or translucent display may be configured to selectively become opaque. Projection-based systems may employ retinal projection techniques that project a graphical image onto a person's retina. The projection system may also be configured to project the virtual object into the physical environment, for example as a hologram or on a physical surface.

Fig. 2 illustrates an example of a 3D representation 200 of at least a portion of a user in accordance with some implementations. For example, 3D representation 200 may represent a portion of user 25 after being scanned by one or more sensors of device 10 (e.g., during an enrollment process). In an exemplary implementation, the 3D representation 200 may be generated using a pixel alignment implicit function (PIFu) technique that locally aligns pixels of the 2D registered image with the global background to form the 3D representation 200 (also referred to as a PIFu grid). The 3D representation 200 includes a plurality of vertices and polygons that may be determined during a registration process based on image data (such as RGB data and depth data). For example, as shown in the expanded region 202, the vertex 204 is circled as a point between two or more polygons that are part of a 3D PIFu mesh.

In some implementations, the 3D representation 200 is determined during the registration process and is located in a particular physical environment (e.g., the real world environment 105 of fig. 1). The physical environment of the registry may include a registry lighting condition. For example, the registered lighting conditions may include particular brightness values and other lighting attributes (e.g., incandescent light, sunlight, etc.) that may affect the appearance of the 3D representation 200.

Fig. 3A and 3B illustrate examples of two-dimensional (2D) manifold surfaces provided as visualizations of a height field representation of a face, according to some implementations. The "height field representation" may also be referred to herein as a parameterized grid. In particular, FIG. 3A illustrates an exemplary environment 300A of a high-field representation of a face that combines three different types of data to provide a high-field representation of a face as shown in a face representation grid 308. These different types of data include RGB data 302, alpha data 304, and depth data 306. For each frame of acquired image data, the techniques described herein determine RGB data 302, alpha data 304, and depth data 306, and provide such unusual "RGBDA" data as shown by the face representation grid 308. For example, the face representation grid 308 provides a mapping to locations on the 2D manifold based on the ray origin and ray direction. The face representation grid 308 or ray grid provides depth data to generate and/or update a 3D reconstruction of the face (e.g., when a user moves his or her face, such as when speaking in a communication session). Fig. 3B and 4 further describe the application of the application face representation grid 308.

FIG. 3B illustrates an exemplary environment 300B of a two-dimensional manifold surface provided as a visualization of a user's facial representation, according to some implementations. In particular, environment 300B illustrates a parameterized image 320 of a facial representation of a user (e.g., user 25 of fig. 1). Parameterized image 320 shows a more detailed illustration of face representation grid 308 of fig. 3A. For example, a frame-specific representation instruction set may obtain live image data of a user's face (e.g., image 310) and parameterize different points on the face based on the surface of a shape (such as column 315). In other words, the frame-specific representation instruction set may generate a set of values that represent the 3D shape and appearance of the user's face at a point in time for generating a user representation (e.g., a head portrait). In some implementations, using a surface with a non-planar shape (e.g., pillars 315) provides less distortion than using a flat/planar surface or using a single point. The set of values includes depth values (e.g., vector arrows pointing to the user's represented face to represent the depth values, similar to a height field or height map or parameterized grid) defining the depth of a portion of the face relative to a plurality of points on the surface (e.g., points in a grid on a partially cylindrical surface, such as point array 325). The parameterized values may include fixed parameters such as ray position, end points, direction, etc., and the parameterized values may include varying parameters such as depth, color, texture, opacity, etc., updated with live image data. For example, as shown in the expanded region 330 of the user's nose, the depth value of a point (e.g., point 332 at the user's nose tip) may define that a portion of the face is at a depth D1 that is behind the point's location on the surface, e.g., at depth D1 along a ray that begins at the point and is orthogonal to the point.

The techniques described herein use depth values that are different from those in existing RGBDA images (e.g., red-green-blue-depth alpha images) because existing RGBDA images define content depth relative to a single camera location/point, while the techniques described herein define depth as portions of a face relative to multiple points on a surface having a planar shape (e.g., a curvilinear planar shape such as a cylinder). A curved surface (such as column 315) implemented for parameterized image 320 is used to reduce distortion of the user representation (e.g., the head portrait) at areas of the user representation that are not visible from the flat projection surface. In some implementations, a projection surface having a planar shape can be folded and shaped in any manner to mitigate distortion in a desired region based on a parameterized application. The use of different bend/curve shapes allows the user representation to be clearly presented from more viewpoints.

Fig. 3B shows the points of the surface (e.g., a 2D manifold surface) as being spaced apart at regular intervals along vertical and horizontal lines on the surface (e.g., evenly spaced vector arrows pointing to the user's represented face). In some implementations, the points may be unevenly distributed across the 2D manifold surface, such as irregularly spaced along vertical and horizontal grid lines around the surface, but may be focused on specific areas of the user's face. For example, some regions in the facial structure where more detail/motion may be present may have more points, and some points may have fewer points in regions where less detail/motion may be present, such as forehead (less detail) and nose (less motion). In some implementations, when generating a representation of a user (e.g., generating an avatar) during a communication session, the techniques described herein may selectively focus more on areas of the eyes and mouth that will likely move more during the conversation, thus generating a more accurate representation of the person during the communication session. For example, the techniques described herein may present updates to the user representation around the mouth and eyes at a faster frame rate than other portions of the face (e.g., forehead, ears, etc.) that do not move as much during the conversation.

FIG. 4 illustrates an exemplary environment 400 for updating portions of a user's facial representation in accordance with some implementations. In particular, fig. 4 illustrates an application that utilizes a face representation grid 410 (e.g., face representation grid 308) and updated depth data 420 and maps the updated face representation grid 410 to a user's face as shown in a mapped image 430. The updated mapping image 430 may then be utilized to update the user's representation 440 in real-time (e.g., as additional frames of RGBDA data are obtained). In an exemplary implementation, the mapping data is based on 3D reference points defined relative to the skeletal representation, such as based on a user defined atlas joint, as further described herein with reference to fig. 5A, 5B, and 6.

Fig. 5A and 5B illustrate examples of 3D reference points defined relative to a skeletal representation of a user according to some implementations. Fig. 5A and 5B illustrate a user (e.g., user 25 in fig. 1) in different head positions and orientations to illustrate different bone positions. In particular, fig. 5A and 5B each illustrate a 3D reference point 510 determined based on an offset 515 from a determined atlas joint 520. The 3D reference point 510 may be utilized to track the kinematic motion of the user by tracking skeletal motion relative to the atlas joint (e.g., providing tracking in the X-axis aligned with the ear canal and the z-axis relative to the frankfurt plane). In some implementations, the 3D reference point 510 is associated with a center of the user's eye, the center being defined at a location offset from the atlas joint. For example, during the registration process, it may be determined to provide an offset of the pupil origin in the parameterized grid (e.g., the height field representation). In some implementations, the 3D reference point can be a point centered between the eyes of the user based on the bone's atlas joint and the user's specific head shape characteristics (e.g., offset position of the 3D reference point 510 associated with the position of the atlas joint 520 determined based on the offset 515 in the figure). An example of combining a predetermined 3D representation with a parameterized grid to generate a representation of a portion of a user using 3D reference points 510 is further described herein with reference to fig. 6.

FIG. 6 illustrates an exemplary environment 600 in which a predetermined 3D representation and a parameterized grid are combined to generate a representation of a portion of a user based on 3D reference points, according to some implementations. In an exemplary implementation, at step 610, a predetermined 3D representation 612 (e.g., 3D representation 200) is obtained (e.g., from a registration process), the predetermined 3D representation including a location of a 3D reference point 602 (e.g., 3D reference point 510 associated with a center of an eye of a user, the center defined at a location offset from an atlas joint to track bone movement). Then at step 620, a frame of the parameterized mesh 622 is obtained and a depth matching process associated with the predetermined 3D representation 612 has been initiated. For example, facial points of the parameterized mesh 622 (e.g., PIFu mesh) are projected outward to find corresponding points on the predetermined 3D representation 612 (e.g., curved projection plane). The parameterized grid 622 also includes locations of 3D reference points 624 (e.g., 3D reference points 510 associated with the center of the user's eyes defined at locations offset from the atlas joints to track bone movement) that are used to initialize the mapping between the predetermined 3D representation 612 and the parameterized grid 622. The frames of the parameterized grid 622 are then combined on the predetermined 3D representation 612 based on the 3D reference points 602, 624 at step 630. At step 640, an updated representation 642 of the user is determined based on the mapped combination of the predetermined 3D representation 612 and the frames of the parameterized grid 622. In some implementations in which a parameterized grid 622 (e.g., a height field) is used to define a frame-specific 3D representation, the combination of data may be facilitated by mapping vertices of a predetermined 3D representation to locations on the parameterized grid 622 based on 3D reference points (e.g., 3D reference points 602, 624). The mapping using 3D reference points enables frame specific face data specified on parameterized mesh 622 to be used directly to adjust the position of vertices of predetermined 3D representation 612. In some implementations, the position of the vertex can be adjusted by mixing the predetermined vertex position of the vertex with its frame specific data vertex position (e.g., using a specified alpha value). In other words, predetermined 3D representation vertices may be mapped onto the parameterized mesh 622, the parameterized mesh 622 is adjusted using real-time data corresponding to the user's head/face, and the adjusted parameterized mesh 622 represents a combined 3D representation of the user combining the predetermined 3D representation with one of the frame-specific 3D representations.

In some implementations, combining the predetermined 3D representation 612 with the corresponding frame-specific 3D representation of the parameterized grid 622 includes adjusting sub-portions (e.g., face portions) of the predetermined 3D representation 612. In some implementations, adjusting the sub-portion of the predetermined 3D representation 612 includes adjusting the positions of vertices of the predetermined 3D representation 612 (e.g., PIFu mesh, such as 3D representation 200 of fig. 2) and applying texture (e.g., parameterized mesh 622) based on each of the frame-specific 3D representations. For example, the adjustment may deform and color a predetermined sub-portion (e.g., a face) to correspond to the real-time shape and color of the portion (e.g., face) of the user at each moment in time.

FIG. 7 illustrates an example of generating and displaying portions of a facial representation of a user in accordance with some implementations. In particular, fig. 7 illustrates an exemplary environment 700 for a process of combining enrollment data 710 (e.g., enrollment image data 712 and generated predetermined 3D representation 714) and live data 720 (e.g., live image data 722 and generated frame-specific 3D representation 724) to generate user representation data 730 (e.g., avatar 735). The registration image data 712 shows an image of a user (e.g., user 25 of fig. 1) during the registration process. For example, when the system obtains image data (e.g., RGB images) of a user's face as the user provides different facial expressions, a registered avatar may be generated. For example, the user may be told to "lift your eyebrows", "smile", "frowning", etc. to provide the system with a series of facial features for the registration process. A registered avatar preview may be shown to the user as the user provides the registered image to obtain a visualization of the status of the registration process. In this example, the enrollment image data 710 displays an enrollment avatar with four different user expressions, however, more or fewer different expressions may be utilized to gather sufficient data for the enrollment process. The predetermined 3D representation 714 (e.g., 3D representation 200) includes a plurality of vertices and polygons that may be determined during the registration process based on image data, such as RGB data and depth data.

Live image data 722 represents an example of a user image acquired while the device is in use (such as during an XR experience) (e.g., live image data while using device 10 of fig. 1 (such as an HMD)). For example, live image data 722 represents images acquired when the user wears device 10 of fig. 1 as an HMD. For example, if the device 10 is an HMD, in one implementation, the sensor 35b may be located within the HMD to capture pupil data (e.g., eye gaze characteristic data 40 b), and additional sensors (e.g., sensors 35a and 35 c) may be located on the HMD but on an outer surface of the HMD facing the user's head/face to capture facial feature data (e.g., capturing upper facial feature data 40a via sensor 35a and capturing lower facial feature data 40c via sensor 35 c). The generated frame-specific 3D representation 724 may be generated based on the obtained live image data 722.

User representation data 730 is an exemplary illustration of a user during the avatar display process. For example, the avatar 735A (side-facing) and avatar 735B (front-facing) are generated based on the acquired registration data 710 and updated as the system obtains and analyzes real-time image data of the live data 720 and updates different values for the planar surface (e.g., updates the values of vector points of the array for each acquired live image data for the frame-specific 3D representation 724).

FIG. 8 illustrates an exemplary environment 800 for implementing a process for generating a combined 3D representation of a user for two different user representations based on one or more lighting conditions, according to some implementations. In particular, fig. 8 illustrates an exemplary environment 800 for a process of combining live user representation data 810 (e.g., generated frame-specific 3D representation 812, such as live data 720 of fig. 7) and re-lit registered user representation data 820 (e.g., generated re-lit predetermined 3D representation 842) to generate user representation data 850 (e.g., avatar 852).

The process of the exemplary environment 800 may be comparable to the exemplary environment 700 of fig. 7 in generating 3D representation data of live user representation data 810 (e.g., live data 720) and combining with registered user representation data 820 (e.g., registered data 710) to generate combined user representation data 850 (e.g., user representation data 730). However, the exemplary environment 800 illustrates a process of extinguishing (e.g., removing lighting conditions from a registered environment) and relighting registered representation data (e.g., updating relight registered user representation data 820 with lighting conditions from a current or "live" environment). In other words, the relighting process will then allow a combination of live user representation data 810 and registered user representation data 820 to more accurately generate user representation data 850 with the same lighting condition data. For example, live user representation data 810 is collected in an environment (also referred to herein as a "live user environment") that includes some lighting condition information, such as live lighting data 814 (e.g., brightness values and other lighting attributes). In addition, during the registration process, a predetermined 3D representation 832 (e.g., predetermined 3D representation 714 of fig. 7), which may be different lighting data than the live lighting data 814, is acquired, for example, in an environment (also referred to herein as a "registration environment") that includes some lighting condition information, such as registration lighting data 834 (e.g., brightness values and other lighting attributes). In accordance with the techniques described herein, the extinction module 830 may extinguish the predetermined 3D representation data 832 by removing the registered lighting data 834 from the data set associated with the predetermined 3D representation data 832 to generate an extinguished 3D representation 835. The relighting module 840 may then relight the extinguished 3D representation 835 with the live illumination data 814 to generate a relight predetermined 3D representation 842. Thus, the re-lit predetermined 3D representation 842 may be used for re-lit registered user representation data 820 to be used in combination with live user representation data 810 to generate combined user representation data 850.

As shown in fig. 8, the combined user representation data 850 is an exemplary illustration of the user during the avatar display process. For example, the 3D representation 852 is generated based on the acquired re-lit registered user representation data 820 and updated as the system obtains and analyzes real-time image data of the live user representation data 810. For example, different values of the planar surface are updated (e.g., values of vector points for the array of frame-specific 3D representations 812 are updated for each acquired live image data) and matched with the lighting conditions of the live lighting data 814. For example, the 3D representation 852 is shown with lighting condition data 854 that is similar to similar lighting conditions from the live lighting data 814.

In some implementations, the lighting condition adjustments may be applied to other portions of the predetermined 3D representation data 832 (such as the user's hand), where the data does not overlap with the frame-specific representation data 812, which may correspond to only the user's face and/or head (e.g., a cylindrical 2D shape designed to update facial features during a communication session). In some implementations, filtering techniques may be utilized to identify different portions of the frame-specific representation data 812 that correspond to non-skin features (e.g., hair, clothing, etc.). In some implementations, filtering the frame-specific representation data 812 includes identifying portions of the sample 3D representation that correspond to the user's hair or clothing. In some implementations, filtering the sample 3D representation includes excluding non-skin features from the filtered sample 3D representation. Such non-skin features may be identified via an algorithm or machine learning model (e.g., using a semantic segmentation algorithm or model). In some implementations, filtering the sample 3D representation includes excluding portions of the sample 3D representation based on brightness (e.g., taking only the first 25% to solve the shadow problem). In some implementations, adjusting the first 3D representation includes generating a transform based on the first 3D representation and the filtered sample 3D representation. In some implementations, adjusting the first 3D representation includes applying a transformation to change each subsection (e.g., texel) of the first 3D representation (e.g., correct for illumination or color of all parts of the hand).

FIG. 9 is a system flow diagram of an exemplary environment 900 in which a system may generate a combined representation of a user based on predetermined representation data and frame-specific representation data, according to some implementations. In some implementations, the system flow of the example environment 900 may be performed between two or more devices (e.g., the device 10 of fig. 1) such as a mobile device, a desktop computer, a laptop computer, or a server device. The images of the example environment 900 may be displayed on a device (e.g., the device 10 of fig. 1), such as a Head Mounted Device (HMD), having a screen for displaying images and/or a screen for viewing stereoscopic images. In some implementations, the system flow of the exemplary environment 900 is performed on processing logic (including hardware, firmware, software, or a combination thereof). In some implementations, the system flow of the example environment 900 is performed on a processor executing code stored in a non-transitory computer readable medium (e.g., memory).

In some implementations, the system flow of the exemplary environment 900 includes a registration process, a scheduled representation process, a frame-specific representation process, and a combined representation-based avatar display process. Alternatively, the example environment 900 may include only a predetermined representation process, a frame-specific representation process, and an avatar display process, and obtain registration data from another source (e.g., previously stored registration data). In other words, the registration process may have occurred such that the registration data of the user has been provided because the registration process has been completed. In an exemplary implementation, the system flow of the exemplary environment 900 for the avatar display process is performed at a receiving device for displaying an avatar and obtains data from a transmitting device, wherein the transmitting device includes a registration process, a predetermined representation process, and a frame-specific representation process.

The system flow of the registration process of the exemplary environment 900 collects image data (e.g., RGB data) from sensors of a physical environment (e.g., the physical environment 105 of fig. 1) and generates registration data. The registration data may include texture, muscle activation, etc. of most, if not all, of the user's face. In some implementations, the enrollment data may be captured when different instructions for acquiring different poses of the user's face are provided to the user. For example, the user may be told to "lift your eyebrows", "smile", "frowning", etc. to provide the system with a series of facial features for the registration process.

The system flow of the avatar display process of the exemplary environment 900 captures image data (e.g., RGB, depth, IR, etc.) from sensors of a physical environment (e.g., physical environment 105 of fig. 1), determines parameterized data of facial features, obtains and evaluates registration data, and generates and displays portions of a user's facial representation (e.g., 3D avatar) based on the parameterized values. For example, the techniques described herein to generate and display portions of a user's facial representation may be implemented on real-time sensor data streamed to an end user (e.g., a 3D avatar overlaid onto an image of a physical environment within a CGR environment). In one exemplary implementation, the avatar display process occurs during real-time display (e.g., the avatar is updated in real-time as the user makes facial gestures and changes to his or her facial features). Alternatively, the avatar display process may occur when analyzing streaming image data (e.g., generating a 3D avatar of a person from video).

In one exemplary implementation, environment 900 includes an image composition pipeline that collects or obtains data of a physical environment (e.g., image data from image sources such as sensors 912A-912N). The example environment 900 is an example of acquiring image sensor data (e.g., light intensity data-RGB) for a registration process to generate registration data 924 (e.g., image data of different head poses and/or different facial expressions), performing a predetermined representation process with the registration data 924, and acquiring image sensor data 915 (e.g., light intensity data, depth data, and position information) for a frame-specific representation process of a plurality of image frames. In some implementations, the enrollment data 924 includes lighting data related to lighting conditions of the physical environment during enrollment (e.g., enrollment lighting data 834). For example, diagram 906 (e.g., exemplary environment 100 of fig. 1) represents a user (e.g., user 25) gathering image data as the user scans his or her face and facial features in a physical environment (e.g., physical environment 105 of fig. 1) during a registration process. Image 916 represents a user who collects image data as the user scans his or her face and facial features in real-time (e.g., during a communication session). The image sensors 912A, 912B-912N (hereinafter referred to as sensors 912) may include one or more depth cameras that collect depth data, one or more light intensity cameras (e.g., RGB cameras) that collect light intensity image data (e.g., a sequence of RGB image frames), one or more position sensors for collecting positioning information, and/or other sensors for collecting data of the environment (e.g., live illumination data 917).

For positioning information, some implementations include a visual inertial ranging (VIO) system to estimate distance traveled by determining equivalent ranging information using camera sequence images (e.g., light intensity data). Alternatively, some implementations of the present disclosure may include SLAM systems (e.g., position sensors). The SLAM system may include a multi-dimensional (e.g., 3D) laser scanning and range measurement system that is GPS independent and provides real-time simultaneous localization and mapping. The SLAM system can generate and manage very accurate point cloud data generated from reflections of laser scans from objects in the environment. Over time, the movement of any point in the point cloud is accurately tracked so that the SLAM system can use the point in the point cloud as a reference point for position, maintaining a precise understanding of its position and orientation as it travels through the environment. The SLAM system may also be a visual SLAM system that relies on light intensity image data to estimate the position and orientation of the camera and/or device.

In an exemplary implementation, the environment 900 includes a registration instruction set 920 configured with instructions executable by a processor to generate registration data 924 from sensor data. For example, the registration instruction set 920 collects image data of the illustration 906, such as light intensity image data (e.g., RGB image from a light intensity camera), from the sensor and generates registration data 922 (e.g., facial feature data such as texture, muscle activation, etc.) of the user. For example, the registration instruction set generates registration data 924 (e.g., the registered image data 710 of fig. 7). In some implementations, the registration data 924 includes registration illumination data (e.g., registration illumination data 834 of fig. 8).

In an exemplary implementation, environment 900 further includes a predetermined set of presentation instructions 930 configured with instructions executable by the processor to generate a 3D representation 937 (e.g., PIFu mesh) representing the 3D shape and appearance of the user's upper torso, head, and face at various points in time during the registration process from registration data 924 (e.g., registration image 926). In some embodiments, the predetermined representation instruction set 930 includes a blanking module 932 (e.g., blanking module 830) configured to remove the registered lighting data from the registration data 924 to generate a blanked 3D representation 938 (e.g., the blanked 3D representation 835 of fig. 8) representing the data 934. For example, the predetermined representation instruction set 930 collects registration data 924 such as light intensity image data (e.g., live camera feed, such as RGB data from a light intensity camera), depth image data (e.g., depth image data according to depth from a depth camera such as an infrared or time-of-flight sensor), registration illumination data, and other physical environment information sources (e.g., camera positioning information such as position and orientation data from a position sensor, e.g., gesture data) of a user in a physical environment (e.g., user 25 in physical environment 105 of fig. 1), and generates representation data 934 (e.g., muscle activation, geometry, potential space of facial expression, etc.) that may include extinction representation data (e.g., extinction 3D representation 835 of fig. 8). In addition, the predetermined set of representation instructions 930 determines reference data 936 that associates the representation data 934 with a 3D reference point (e.g., 3D reference point 510 of fig. 5) defined relative to the skeletal representation of the user. For example, a 3D representation 937 or a extinguished 3D representation 938 may be generated using PIFu techniques that locally align pixels of the 2D registration image 926 with a global background to form a 3D representation 936 (also referred to as a PIFu grid). The 3D representations 937, 938 (e.g., the representation 200 of fig. 2) include a plurality of vertices and polygons that may be determined during a registration process based on image data, such as RGB data and depth data.

In an exemplary implementation, the environment 900 includes a frame-specific representation instruction set 940 configured with instructions executable by a processor to generate representation data 942 from live image data (e.g., sensor data 915) and illumination data (e.g., live illumination data 917) from a current environment, which may include a set of values (e.g., appearance values, depth values, etc.) that represent the 3D shape and appearance of a user's face at a point in time. In some implementations, the sensor data 915 includes information regarding lighting conditions of the environment in which the sensor data 915 was collected (e.g., live lighting data 814). For example, the frame-specific representation instruction set 940 gathers sensor data 915 from the sensor 912, such as light intensity image data (e.g., live camera feed, such as RGB data from a light intensity camera), depth image data (e.g., depth image data from a depth camera, such as an infrared or time-of-flight sensor), live illumination data, and other physical environment information sources (e.g., camera positioning information, such as position and orientation data from a position sensor, such as gesture data) of a user in a physical environment (e.g., user 25 in physical environment 105 of fig. 1), and generates parameterized data (e.g., muscle activation, geometry, potential space of facial expression, etc.) for representing facial parameterization of the data 942. For example, the parametric data may be represented by a parametric image 946 (e.g., the parametric image 320 discussed herein with respect to fig. 3B) by changing parameters such as appearance values (such as texture data, color data, opacity, etc.) and depth values of different points of the face based on the sensor data 915. Facial parameterization techniques for the frame-specific representation instruction set 940 may include acquiring a local view acquired from the sensor data 915 and determining a small parameter set (e.g., muscle of the face) from the geometric model to update the user representation. For example, the geometric model may include a dataset for the eyebrows, eyes, cheeks under the eyes, mouth regions, mandibular regions, and the like. Parameterized tracking of the frame-specific representation instruction set 940 may provide geometry of facial features of the user. In addition, the frame-specific representation instruction set 940 determines reference data 944 that associates the representation data 942 with a 3D reference point (e.g., 3D reference point 510 of fig. 5) defined relative to the skeletal representation of the user. For example, the frame-specific representation instruction set 940 may generate the representation data 942 and generate the parametric image 946 (e.g., the parametric image 320 of fig. 3).

In the exemplary implementation, environment 900 also includes a combined representation instruction set 950. In an exemplary implementation, the combined representation instruction set 950 is located at a receiving device for displaying the combined representation, and the sensor 912 and other instruction sets (e.g., the registration instruction set 920, the predetermined representation instruction set 930, and the frame-specific representation instruction set 940) are located at another device (e.g., a device of the user that will generate his head portrait at the receiving device of the combined representation data). Alternatively, in some implementations, the combined representation instruction set 950, the sensor 912, and other instruction sets (e.g., the registration instruction set 920, the predetermined representation instruction set 930, and the frame-specific representation instruction set 940) are located at another device (e.g., a transmitting device) such that the receiving device will receive data for the combined representation from the other device for display.

The combined representation instruction set 950 is configured with instructions that are executable by a processor to generate a representation (e.g., a 3D avatar) of a user from the combined representation data 934 (e.g., extinguished representation data such as the extinguished 3D representation data 835 of fig. 8) and the reference data 936 based on the 3D reference points provided by the reference data 936, 944 (e.g., via the alignment module 954). The alignment module 954 provides instructions to identify 3D reference points in both the reference data 936 and the reference data 944 in order to align the data sets. In some embodiments, the combined representation instruction set 950 includes a relighting module 952 (e.g., relighting module 840) configured to add live illumination data 917 to the extinguished 3D representation 938 and generate an extinguished 3D representation 958 (e.g., 3D representation 852 of fig. 8) of the combined representation data 956. In addition, the combined representation instruction set 950 is configured with instructions that are executable by the processor to display portions of the representation based on corresponding alignment data when the instructions are updated with the representation data 942 (e.g., when the live image data and the live illumination data 917 are acquired and processed from another device by the frame specific representation instruction set 940). For example, the combined representation instruction set 950 collects the extinguisheable representation data 934 from the predetermined representation instruction set 930, collects the live illumination data 917 and the representation data 942 (e.g., updated appearance and depth values from the live image data) from the frame specific representation instruction set 940, and generates the combined representation data 954 (e.g., a real-time representation of the user, such as a 3D avatar).

In some implementations, the set of combined representation instructions 950 may be repeated for each frame captured during each moment/frame of a live communication session or other experience. For example, for each iteration, while the user is using the device (e.g., wearing an HMD), the example environment 900 may involve continuously obtaining representation data 942 (e.g., appearance values and depth values) and live illumination data 917, and for each frame, updating portions of the display of the 3D representation 958 based on the updated values. For example, for each new frame of parametric data and lighting data, the system may update the display of the 3D representation 958 (e.g., live avatar) based on the new data.

Fig. 10 illustrates an exemplary electronic device operating in different physical environments during a communication session of a first user at a first device and a second user at a second device, and a view of a 3D representation of the second user of the first device, according to some implementations. In particular, FIG. 10 illustrates an exemplary operating environment 1000 of electronic devices 1010, 1065 operating in different physical environments 1002, 1050, respectively, during a communication session (e.g., when the electronic devices 1010, 1065 share information with each other or with an intermediary device (such as a communication session server)). In this example of fig. 10, the physical environment 1002 is a room that includes wall-mounted ornaments 1012, plants 1014, and tables 1016. The electronic device 1010 includes one or more cameras, microphones, depth sensors, or other sensors that may be used to capture information about the physical environment 1002 and objects therein and information about the user 1025 of the electronic device 1010 as well as evaluate the physical environment and objects therein. Information about the physical environment 1002 and/or the user 1025 may be used to provide visual content (e.g., for user presentation) and audio content (e.g., for text transcription) during a communication session. For example, a communication session may provide one or more participants (e.g., users 1025, 1060) with the following views: a 3D environment generated based on camera images and/or depth camera images of physical environment 1002, a representation of user 1025 based on camera images and/or depth camera images of user 1025, and/or a textual transcription of audio spoken by the user (e.g., transcribing bubbles). As shown in FIG. 10, user 1025 is speaking to user 1060, as shown by spoken word 1015.

In this example, the physical environment 1050 is a room that includes wall-mounted ornaments 1052, sofas 1054, and coffee tables 1056. The electronic device 1065 includes one or more cameras, microphones, depth sensors, or other sensors that may be used to capture information about the physical environment 1050 and objects therein and information about the user 1060 of the electronic device 1065 as well as to evaluate the physical environment and objects therein. Information about the physical environment 1050 and/or the user 1060 may be used to provide visual and audio content during a communication session. For example, the communication session may provide a view of the 3D environment generated based on the camera image and/or depth camera image of the physical environment 1050 (from the electronic device 1065) and a representation of the user 1060 based on the camera image and/or depth camera image of the user 1060 (from the electronic device 1065). For example, in communication with device 1065 through communication session instruction set 1090, the 3D environment may be transmitted by device 1010 through communication session instruction set 1080 (e.g., via network connection 1085). As shown in fig. 10, audio spoken by user 1025 (e.g., spoken word 1015) is transcribed at device 1065 (or via a remote server) (e.g., via communication instruction set 1090), and view 1066 provides user 1060 with a textual transcription of audio spoken by speaker (user 1025) via transcription bubble 1076 (e.g., "nice avatar-.

Fig. 10 illustrates an example of a view 1005 of a virtual environment (e.g., 3D environment 1030) at a device 1010, where a representation 1032 of a wall-hung ornament 1052 and a user representation 1040 (e.g., a head portrait of user 1060) are provided, provided that the user representations of each user are agreed to be viewed during a particular communication session. In particular, user representation 1040 of user 1060 is generated based on the combined user representation techniques described herein (e.g., for real-time generation of more realistic avatars). In some implementations, the user representation 1040 is generated using illumination data from the environment 1050, as opposed to using illumination data from a registered 3D representation (e.g., the registered representation data is extinguished and the extinguished registered representation data is re-lit with current illumination data of the environment). Alternatively, in some implementations, the user representation 1040 is generated using illumination data from the environment 1002. For example, the user representation 1040 (e.g., avatar) may be overlaid onto a live view (e.g., view 1005) of the environment 1002 of the user 1002, and the lighting conditions of the environment 1025 may be used in generating the user representation 1040. In other words, the ambient lighting of the environment 1002 may match the view of the user representation 1040.

In addition, the electronic device 1065 within the physical environment 1050 provides a view 1066 that enables the user 1060 to view a representation 1075 (e.g., a avatar) of at least a portion of the user 1025 (e.g., from the middle of the torso upward) and a transcription of words spoken by the user 1025 via the transcription bubble 1076 (e.g., "nice avatar |") within the 3D environment 1070. In other words, at device 1010, a more realistic looking avatar (e.g., user representation 1040 of user 1060) is generated by generating a combined 3D representation of user 1060 for a plurality of moments in time in a certain period of time based on data obtained from device 1065 (e.g., a predetermined 3D representation of user 1060 and a corresponding frame-specific 3D representation of user 1060). Alternatively, in some embodiments, a user representation 1040 of the user 1060 is generated at a device 1065 (e.g., a transmitting device of the speaker) and transmitted to the device 1010 (e.g., a viewing device that views the avatar of the speaker). Specifically, each of the combined 3D representations 1040 of the user 1060 is generated by combining the predetermined 3D representation of the user 1060 with the corresponding frame-specific 3D representation of the user 1060 based on alignment (e.g., aligning 3D reference points) in accordance with the techniques described herein.

In the example of fig. 10, electronic devices 1010 and 1065 are shown as handheld devices. The electronic devices 1010 and 1065 may be mobile phones, tablets, laptops, etc. In some implementations, the electronic devices 1010 and 1065 can be worn by a user. For example, the electronic devices 1010 and 1065 may be watches, head Mounted Devices (HMDs), head mounted devices (eyeglasses), headphones, ear-mounted devices, and the like. In some implementations, the functionality of devices 1010 and 1065 is implemented via two or more devices, such as a mobile device and a base station or a head-mounted device and an ear-mounted device. Various functions may be distributed among multiple devices including, but not limited to, a power function, a CPU function, a GPU function, a storage function, a memory function, a visual content display function, an audio content production function, and the like. Multiple devices that may be used to implement the functionality of electronic devices 1010 and 1065 may communicate with each other via wired or wireless communication. In some implementations, each device communicates with a separate controller or server to manage and coordinate the user's experience (e.g., a communication session server). Such controllers or servers may be located in physical environment 1002 and/or physical environment 1050 or may be remote from the physical environment.

In addition, in the example of fig. 10, 3D environments 1030 and 1070 are XR environments based on a common coordinate system that may be shared with other users (e.g., virtual rooms for head-mounted images of a multi-person communication session). In other words, the common coordinate system of 3D environments 1030 and 1070 is different from the coordinate system of physical environments 1002 and 1050, respectively. For example, a common reference point may be used to align the coordinate systems. In some implementations, the common reference point may be a virtual object within the 3D environment that each user can visualize within their respective views. For example, the user represents a common center piece table (e.g., the user's avatar) positioned around it within the 3D environment. Alternatively, the common reference point is not visible within each view. For example, a common coordinate system of the 3D environment may use a common reference point to locate each respective user representation (e.g., around a table/desk). Thus, if the common reference point is visible, each view of the device will be able to visualize the "center" of the 3D environment for perspective when viewing other user representations. The visualization of the common reference point may become more relevant to the multi-user communication session such that the view of each user may add perspective to the location of each other user during the communication session.

In some implementations, the representation of each user may be authentic or unrealistic and/or may represent the current and/or previous appearance of the user and may match the sender or viewer's illumination data. For example, a photo-realistic representation of the user 1025 or 1060 may be generated based on a combination of the user's live image and live illumination data and a previous image (e.g., enrollment data). The previous image may be used to generate a portion of the representation where actual image data is not available (e.g., a portion of the user's face that is not in the field of view of the camera or sensor of the electronic device 1010 or 1065 or may be obscured, for example, by a headset or other means). In one example, the electronic devices 1010 and 1065 are head-mounted devices (HMDs), and the live image data of the user's face includes downward facing camera images of the user's cheeks and mouth and inward facing camera images of the user's eyes, which may be combined with previous image data of other portions of the user's face, head, and torso that are not currently observable from the device's sensors. The prior data regarding the user appearance may be obtained at an earlier time during the communication session, during a prior use of the electronic device, during a registration process for obtaining sensor data of the user appearance from multiple perspectives and/or conditions, or otherwise.

Some implementations provide for a representation of at least a portion of a user within a 3D environment other than the user's physical environment during a communication session, and providing for a representation of another object of the user's physical environment to provide a context based on detecting a condition. For example, during a communication session, representations of one or more other objects of the physical environment may be displayed in the view. For example, based on determining that user 1025 is interacting with a physical object in physical environment 1002, a representation (e.g., a real or proxy) may be displayed in the view to provide a context for the user 1025's interaction. For example, if the first user 1025 picks up an object such as a home photo frame to show to another user, the view may include a real view of the photo frame (e.g., live video). Thus, in displaying an XR environment, the view may present a virtual object representing a user picking up a generic object, display a virtual object similar to a photo frame, display a previously acquired image of an actual photo frame from an obtained 3D scan, and so forth.

Fig. 11 is a flow chart illustrating an exemplary method 1100. In some implementations, a device (e.g., device 10 of fig. 1 or device 1065 of fig. 10) performs the techniques of method 1100 to generate a combined 3D representation of a user for two user representations for multiple times in a period of time based on a quench/relighting technique according to some implementations. In some implementations, the techniques of method 1100 are performed on a mobile device, desktop computer, laptop computer, HMD, or server device. In some implementations, the method 1100 is performed on processing logic (including hardware, firmware, software, or a combination thereof). In some implementations, the method 1100 is performed on a processor executing code stored in a non-transitory computer readable medium (e.g., memory).

In some implementations, the method 1100 is implemented at a processor of a device, such as a viewing device, that presents a combined 3D representation (e.g., the device 1010 of fig. 10 presents a 3D representation 1040 (avatar) of the user 1060 according to time obtained from the device 1065).

At block 1110, the method 1100 obtains a first user representation of at least a first portion of the user, wherein the first user representation is generated via a first technique based on first image data obtained in a first physical environment via a first set of sensors, and the first user representation represents the user under a first lighting condition. The first technique further includes obtaining a sequence of frame-specific 3D representations corresponding to a plurality of time instants within a time period, each frame-specific 3D representation of the frame-specific 3D representations representing a second portion of the user at a respective one of the plurality of time instants within the time period, and each frame-specific 3D representation of the frame-specific 3D representations being associated with a 3D reference point and including light data associated with a first lighting condition (e.g., lighting information of a live environment of the user).

In some implementations, the first user representation includes texture data generated via a machine learning model that is trained using training data obtained in one or more environments having the first lighting conditions via one or more sensors. In some implementations, a first lighting condition is provided in the one or more environments using a plurality of lights positioned in a fixed positional relationship to provide a uniform light distribution over the face of the training subject. In some implementations, the first lighting condition is based on uniformly distributed light. In some implementations, the first lighting condition is a lighting condition of a first physical environment (e.g., current lighting in a live environment of a user).

At block 1120, the method 1100 obtains a second user representation of at least a second portion of the user, the second user representation generated by the steps of blocks 1122, 1124, and 1126.

At block 1122, the method 1100 generates an initial user representation of the user based on second image data obtained via a second set of sensors in a second physical environment having second lighting conditions. For example, as discussed in fig. 8, during the enrollment process, a predetermined 3D representation 832 (e.g., the predetermined 3D representation 714 of fig. 7), which may be different lighting data than the live lighting data 814 (e.g., two different physical environments at the time of enrollment and during generation of the avatar), is acquired in a particular environment (also referred to herein as an "enrollment environment") that includes some lighting condition information such as enrollment lighting data 834 (e.g., brightness values and other lighting attributes).

At block 1124, the method 1100 extinguishes the initial user representation based on the illuminated representation of the second physical environment. For example, as discussed in fig. 8, the extinction module 830 may extinguish the predetermined 3D representation data 832 by removing the registered lighting data 834 from the data set associated with the predetermined 3D representation data 832 to generate an extinguished 3D representation 835.

At block 1126, method 1100 generates a second user representation by re-lighting the extinguished initial user representation based on the first lighting condition. For example, the extinguished initial user representation is re-lit to match the illumination in the second physical environment. For example, as discussed in fig. 8, the relighting module 840 may then relight the extinguished 3D representation 835 with the live illumination data 814 to generate a relight predetermined 3D representation 842. Thus, the re-lit predetermined 3D representation 842 may be used for re-lit registered user representation data 820 to be used in combination with live user representation data 810 to generate combined user representation data 850.

In some implementations, the illumination representation of the second physical environment includes an omnidirectional image representation (e.g., IBL-based cube map) of the second illumination condition of the second physical environment. In some implementations, an omnidirectional image representation of a second lighting condition of a second physical environment is generated by the machine learning model based on second image data obtained via sensor data (e.g., sensor data captured during registration).

In some implementations, re-illuminating the second user representation includes matching the lighting attribute of the second lighting condition with the lighting attribute of the first user representation. In some implementations, the first lighting condition of the first physical environment is different from the second lighting condition of the second physical environment. For example, the same environment but different lighting conditions may be for the enrollment and live data (e.g., different lighting effects may be at the time of the enrollment versus at the time of the live session). In some implementations, the first lighting condition may correspond to equipment lighting used to generate training data for the machine learning model of the first user representation, and the second lighting corresponds to lighting of the check-in environment.

In some implementations, the second user representation is a predetermined 3D representation of at least the first portion of the user, the predetermined 3D representation being associated with a 3D reference point defined relative to the skeletal representation of the user. For example, the predetermined 3D representation may represent the upper body and head of the user. The predetermined 3D representation may be generated using a pixel alignment implicit function (PIFu) technique that locally aligns pixels of the 2D registered image with the global background to form the predetermined 3D representation (e.g., representation 200 of fig. 2).

In some implementations, the 3D reference points are associated with 3D locations of atlas joints represented by bones of the user. For example, the 3D reference point may be a head/atlas joint, which may be determined by tracking the x-axis aligned with the ear canal and/or the z-axis in the frankfurt plane. In some implementations, the 3D reference point is associated with a center of the user's eye, the center defined at a location offset from the atlas joint. For example, during the registration process, an offset of the pupil origin in the provision of the parameterized mesh may be determined. In some implementations, the 3D reference point can be a point centered between the eyes of the user based on the bone's atlas joint and the user's specific head shape characteristics (e.g., offset position of the 3D reference point 510 associated with the position of the atlas joint 520 determined based on the offset 515 in fig. 5A and 5B).

At block 1130, the method 1100 generates a combined user representation based on the first user representation and the second user representation. In some implementations, the combined user representation is generated for a plurality of moments within the time period and is generated by combining the predetermined 3D representation with the corresponding frame-specific 3D representation based on an alignment, wherein the alignment is based on the 3D reference point. In some implementations, the frame-specific 3D representations may each represent a second portion of the user that is a sub-portion (e.g., face-only) of the first portion (e.g., represented face), such that the frame-specific 3D representation may be combined with the predetermined 3D representation by simply adjusting the sub-portion (e.g., face portion) of the predetermined 3D representation.

In some implementations, the method 1100 further includes providing a view of the adjusted combined user representation in the 3D environment, wherein the adjusted combined user representation is generated by adjusting the combined user representation based on at least one of one or more color attributes or one or more light attributes of the 3D environment. For example, the adjusted combined user representation may include an ambience lighting effect that matches the 3D environment being viewed. For example, a red lighting effect may be presented to a viewer in an environment, and the combined user representation of the sender may be adjusted to match the red lighting effect of the viewing environment.

In some implementations, the first user representation is a frame-specific 3D representation that may represent the user's face, so the first and second portions may overlap and represent some common area of the user. In some implementations, the second portion of the user is a sub-portion (e.g., face only) of the first portion (e.g., face represented). In some implementations, the first portion of the user includes a face portion and an additional portion of the user, and the second portion of the user includes the face portion without the additional portion of the user.

In some implementations, each frame-specific 3D representation may be generated based on sensor data captured by a sending device during a communication (e.g., coexistence) session. In some implementations, each frame-specific 3D representation may be generated using sensor data from an inward/downward facing camera and using registration data (e.g., images of faces of different expressions, images of facial portions that cannot be captured when the user wears the HMD, or images of portions that cannot be otherwise captured during actual use). In some implementations, each frame-specific 3D representation may represent a user's face using a curved parameterized grid positioned relative to 3D reference points.

In some implementations, the sequence of frame-specific 3D representations corresponding to the plurality of time instants in a certain time period is based on generating a set of values representing the user based on the sensor data, wherein the set of values (e.g., parameterized values) may include: i) The plurality of 3D positions relative to the point of the projection surface define depth values of the 3D positions of the portions of the user, and ii) appearance values defining the appearance of the portions of the user. For example, generating a set of values (e.g., RGB values, alpha values, and depth values-RGBDA) representative of the user based on the sensor data may involve using live sensor data from an inward/downward facing camera and using enrollment data, e.g., images of faces in different expressions without wearing an HMD. In some implementations, generating the set of values may involve using a machine learning model trained to generate the set of values.

The set of values may include depth values defining a 3D position of the portion of the user relative to a plurality of 3D positions of the points of the projection surface. For example, the depth value of a point may define a depth D1 of a portion of the face after the position of the point on the surface, e.g., at the depth D1 along a ray (e.g., ray 332 of fig. 3B) that begins at the point. In some implementations, the depth value defines a distance between a portion of the user and a corresponding point of the projection surface that is located at a position of the corresponding point along a ray perpendicular to the projection surface. The techniques described herein use depth values that are different from those in existing RGBDA images, which define the depth of content relative to a single camera position. The appearance value may include a value defining an appearance of a portion of the user, such as RGB data and alpha data. For example, the appearance values may include color, texture, opacity, and the like.

In some implementations, the term "surface" refers to a 2D manifold that may be planar or non-planar. In some implementations, the points of the surface (e.g., a 2D manifold surface) are spaced at regular intervals along vertical and horizontal lines on the surface. In some implementations, the points are regularly spaced along the vertical and horizontal grid lines on the partially cylindrical surface, as shown in fig. 3A and 3B. Alternatively, other planar and non-planar surfaces may be utilized. For example, the planar surface of the cylindrical surface may be oriented/curved about different axes. Additionally or alternatively, the planar surface may be hemispherical in shape. In some implementations, the points may be unevenly distributed across the 2D manifold surface, such as irregularly spaced along vertical and horizontal grid lines around the surface, but may be focused on specific areas of the user's face. For example, some regions in the facial structure where more detail/motion may be present may have more points, and some points may have fewer points in regions where less detail/motion may be present, such as forehead (less detail) and nose (less motion). For example, a higher density of dots may be around the eyes and mouth.

In some implementations, the set of values is generated based on the alignment such that a subset of the points on a central region of the surface corresponds to a central portion of the user's face. For example, as shown in fig. 3B, the focal region of the user's nose is at region 330, and the characteristic point of ray 332 is the tip of the person's nose.

In some implementations, generating the set of values is further based on images of the user's face captured in different poses, and/or when the user is expressing a plurality of different facial expressions. For example, the set of values is determined based on registered images of the face when the user is facing the camera, left side of the camera, and right side of the camera, and/or when the user smiles, lifts the eyebrows, swells the cheeks, and the like. In some implementations, the sensor data corresponds to only a first region of the user (e.g., a portion not obscured by a device (such as an HMD)), and the set of image data (e.g., registration data) corresponds to a second region that includes a third region that is different from the first region. For example, the second region may include some of the portions that are obscured by the HMD when the HMD is worn by the user.

In some implementations, determining the user-specific parameterization (e.g., generating the set of values) may be appropriate for each particular user. For example, the parameterization may be fixed based on the registered identity (e.g., to better cover the head size or nose shape of the person), or the parameterization may be based on the current expression (e.g., the parameterization may become longer when the mouth is open). In an exemplary implementation, the method 1100 may further include: obtaining additional sensor data of a user associated with the second time period; updating the set of values representing the user based on the additional sensor data for the second time period; and providing the updated set of values, wherein the user's depiction is updated for a second period of time based on the updated set of values (e.g., the set of values is updated based on the current expression such that the parameterization also becomes longer when the mouth is open).

In some implementations, generating the set of values representing the user is based on a machine learning model trained to generate the set of values. For example, a process for generating the representation data 934 of the predetermined representation instruction set 930 and/or the parameterized data of the frame-specific representation instruction set 940 is provided by a machine learning model (e.g., a trained neural network) to identify patterns in textures (or other features) in the registration data 922 and the sensor data 915 (live image data, such as image 916). In addition, machine learning models may be used to match these patterns with learned patterns (such as smiles, frowns, conversations, etc.) corresponding to user 25. For example, when determining a smile pattern from tooth exposure, there may also be a determination of other portions of the face (e.g., cheek movements, eyebrows, etc.) that he or she also changes when the user smiles. In some implementations, the techniques described herein may learn a pattern specific to a particular user 25 of fig. 1.

In some implementations, obtaining the frame-specific 3D representation may involve adjusting the positions of some of the vertices of the predetermined 3D representation, and then applying texture/color based on each of the frame-specific 3D representations. This may deform and color the predetermined sub-portion (e.g., face) to correspond to the real-time shape and color of the portion (e.g., face) of the user at each moment.

In some implementations that use a parameterized grid (e.g., a high field graph) to define a frame-specific 3D representation, the combination of data may be facilitated by mapping vertices of a predetermined 3D representation to locations on the parameterized grid based on 3D reference points (e.g., 3D reference points 602 and 624 of fig. 6). Mapping using 3D reference points enables frame specific face data specified on a parameterized mesh to be used directly to adjust the position of vertices of a predetermined 3D representation. In some implementations, the position of the vertex can be adjusted by mixing the predetermined vertex position of the vertex with its frame specific data vertex position (e.g., using a specified alpha value). In other words, the predetermined 3D representation vertices may be mapped onto a parameterized mesh, the parameterized mesh may be adjusted using live face data, and the adjusted parameterized mesh represents a combined 3D representation in which the user combines the predetermined 3D representation with one of the frame-specific 3D representations. In some implementations, the mapping may be determined during a registration process.

In some implementations, combining the predetermined 3D representation with the corresponding frame-specific 3D representation includes adjusting a sub-portion (e.g., a face portion) of the predetermined 3D representation. In some implementations, adjusting the sub-portion of the predetermined 3D representation includes adjusting a position of a vertex of the predetermined 3D representation and applying the texture based on each of the frame-specific 3D representations. For example, the adjustment may deform and color a predetermined sub-portion (e.g., a face) to correspond to the real-time shape and color of the portion (e.g., face) of the user at each moment in time.

In some implementations, the method 1100 may further include presenting a view of the combined 3D representation. In some implementations, as shown in fig. 10, the rendering occurs during a communication session in which a second device (e.g., device 1065) captures sensor data (e.g., image data of user 1060 and a portion of environment 1050) and provides a sequence of frame-specific 3D representations corresponding to a plurality of moments in time in a period of time based on the sensor data. For example, the second device 1065 provides/transmits a sequence of frame-specific 3D representations to the device 1010, and the device 1010 generates a live 3D video-like facial depiction (e.g., a realistic moving head) that combines the 3D representations to display the user 1060 (e.g., the representation 1040 of the user 1060). Alternatively, in some implementations, the second device provides a predetermined 3D representation of the user (e.g., representation 1040 of user 1060) during the communication session (e.g., a realistic-looking mobile avatar). For example, a combined representation is determined at device 1065 and sent to device 1010. In some implementations, the views of the combined 3D representation are displayed on a device (e.g., device 1010) in real-time relative to multiple moments in time. For example, the depiction of the user is displayed in real-time and based on live lighting data (e.g., an avatar shown to the second user on a display of the second user's second device).

In some implementations, providing the set of values includes transmitting a sequence of frames of 3D video data during a communication session with a second device, the sequence of frames including frames containing the set of values, wherein the second device presents an animated depiction of the user based on the sequence of frames of 3D video data. For example, the set of points may be frames of 3D video data that are transmitted during a communication session with another device, and the other device uses the set of values (along with information on how to interpret the depth values) to present a view of the user's face. Additionally or alternatively, successive frames of face data (sets of values representing the 3D shape and appearance of the user's face at different points in time) may be transmitted and used to display a live 3D video-like facial depiction.

In some implementations, the user's depiction may include enough data to enable a stereoscopic view (e.g., left/right eye view) of the user so that the face may be perceived at a depth. In one implementation, the depiction of the face includes a 3D model of the face, and views of the representation as seen from the left eye position and the right eye position are generated to provide a stereoscopic view of the face.

In some implementations, certain portions of the face (such as the eyes and mouth) that may be important to convey a realistic or faithful appearance may be generated differently than other portions of the face. For example, portions of the face that may be important for conveying a realistic or faithful appearance may be based on current camera data, while other portions of the face may be based on previously obtained (e.g., registered) face data.

In some implementations, representations of faces are generated in textures, colors, and/or geometries of various face portions, which identify how confident the generation technique accurately corresponds to an estimate of how much true textures, colors, and/or geometries of those face portions are for such textures, colors, and/or geometries per data frame based on depth values and appearance values. In some implementations, the depiction is a 3D avatar. For example, the representation is a 3D model representing a user (e.g., user 25 of fig. 1).

In some implementations, the predetermined 3D representation and/or frame specific 3D representation sequence is based on obtaining sensor data of the user. For example, sensor data (e.g., live data, such as video content including light intensity data (RGB) and depth data) is associated with a point in time, such as images from inward/downward facing sensors (e.g., sensors 35a, 35b, 35c shown in fig. 1) are associated with frames when the user wears the HMD. In some implementations, the sensor data includes depth data (e.g., infrared, time of flight, etc.) and light intensity image data obtained during the scanning process.

In some implementations, obtaining sensor data may include obtaining a first set of data (e.g., registration data) (e.g., registration image data 710 of fig. 7) corresponding to features of a user's face (e.g., texture, muscle activation, shape, depth, etc.) from a device in a plurality of configurations. In some implementations, the first set of data includes unobstructed image data of a face of the user. For example, an image of a face may be captured when a user smiles, lifts an eyebrow, swells a cheek, and the like. In some implementations, the registration data may be obtained by: the user removes the device (e.g., HMD) and captures an image if the device does not cover the face, or uses another device (e.g., mobile device) if the device (e.g., HMD) does not cover the face. In some implementations, registration data (e.g., a first set of data) is acquired from a light intensity image (e.g., an RGB image). The registration data may include texture, muscle activation, etc. of most, if not all, of the user's face. In some implementations, the enrollment data may be captured when different instructions for acquiring different poses of the user's face are provided to the user. For example, the user interface guide may instruct the user to "lift your eyebrows", "smile", "frowning", etc. to provide the system with a series of facial features for the registration process.

In some implementations, obtaining sensor data may include obtaining a second set of data corresponding to one or more partial views of the face from one or more image sensors while the user is using (e.g., wearing) an electronic device (e.g., HMD). For example, obtaining sensor data includes live image data 720 of fig. 7. In some implementations, the second set of data includes a partial image of the user's face, and thus may not represent all features of the face represented in the enrollment data. For example, the second set of images may include images (e.g., facial feature characteristic data 40 a) from some of the front face/eyebrows of the upward facing sensor (e.g., sensor 35a of fig. 1). Additionally or alternatively, the second set of images may include images (e.g., eye gaze characteristic data 40 b) from some of the eyes of the inward facing sensor (e.g., sensor 35a of fig. 1). Additionally or alternatively, the second set of images may include images (e.g., facial feature characteristic data 40 c) from some of the cheeks, mouths, and mandibles of the downward facing sensor (e.g., sensor 35c of fig. 1). In some implementations, the electronic device includes a first sensor (e.g., sensor 35a of fig. 1) and a second sensor (e.g., sensor 35c of fig. 1), wherein the second set of data (e.g., a plurality of IFC cameras are used to capture different viewpoints of the user's face and body motion) is obtained from at least one partial image of the user's face from the first viewpoint (e.g., upper facial characteristic data 40 a) from the first sensor and from at least one partial image of the user's face from a second viewpoint from the second sensor (e.g., lower facial characteristic data 40 c) from a second viewpoint different from the first viewpoint.

In some implementations, the method 1100 may be repeated for each frame captured during each moment/frame of a live communication session or other experience. For example, for each iteration, while the user is using the device (e.g., wearing an HMD), method 1100 may involve continuously obtaining live sensor data (e.g., eye gaze characteristic data and facial feature data), and for each frame, updating a displayed portion of the representation based on updated parameterized values (e.g., RGBDA values) of the frame-specific 3D representation sequence. For example, for each new frame, the system may update the parameterized value to update the display of the 3D avatar based on the new data.

Fig. 12 is a block diagram of an exemplary device 1200. Device 1200 illustrates an exemplary device configuration of devices described herein (e.g., device 10, device 1065, etc.). While certain specific features are shown, those of ordinary skill in the art will appreciate from the disclosure that various other features are not shown for brevity and so as not to obscure more pertinent aspects of the implementations disclosed herein. To this end, as a non-limiting example, in some implementations, the device 1200 includes one or more processing units 1202 (e.g., microprocessors, ASIC, FPGA, GPU, CPU, processing cores, and the like), one or more input/output (I/O) devices and sensors 1206, one or more communication interfaces 1208 (e.g., ,USB、FIREWIRE、THUNDERBOLT、IEEE 802.3x、IEEE 802.11x、IEEE 802.16x、GSM、CDMA、TDMA、GPS、IR、BLUETOOTH、ZIGBEE、SPI、I2C and/or similar types of interfaces), one or more programming (e.g., I/O) interfaces 1210, one or more displays 1212, one or more inwardly and/or outwardly facing image sensor systems 1214, memory 1220, and one or more communication buses 1204 for interconnecting these components and various other components.

In some implementations, one or more of the communication buses 1204 include circuitry that interconnects and controls communications between system components. In some implementations, the one or more I/O devices and sensors 1206 include at least one of: an Inertial Measurement Unit (IMU), accelerometer, magnetometer, gyroscope, thermometer, one or more physiological sensors (e.g., blood pressure monitor, heart rate monitor, blood oxygen sensor, blood glucose sensor, etc.), one or more microphones, one or more speakers, a haptic engine, or one or more depth sensors (e.g., structured light, time of flight, etc.), and the like.

In some implementations, the one or more displays 1212 are configured to present a view of the physical environment or graphical environment to a user. In some implementations, the one or more displays 1212 correspond to holographic, digital Light Processing (DLP), liquid Crystal Displays (LCD), liquid crystal on silicon (LCoS), organic light emitting field effect transistors (OLET), organic Light Emitting Diodes (OLED), surface conduction electron emitter displays (SED), field Emission Displays (FED), quantum dot light emitting diodes (QD-LED), microelectromechanical systems (MEMS), and/or similar display types. In some implementations, the one or more displays 1212 correspond to diffractive, reflective, polarizing, holographic, etc. waveguide displays. For example, the device 10 includes a single display. As another example, the device 10 includes a display for each eye of the user.

In some implementations, the one or more image sensor systems 1214 are configured to obtain image data corresponding to at least a portion of the physical environment 105. For example, the one or more image sensor systems 1214 include one or more RGB cameras (e.g., with Complementary Metal Oxide Semiconductor (CMOS) image sensors or Charge Coupled Device (CCD) image sensors), monochrome cameras, IR cameras, depth cameras, event based cameras, and the like. In various implementations, the one or more image sensor systems 1214 also include an illumination source, such as a flash, that emits light. In various implementations, the one or more image sensor systems 1214 also include an on-camera Image Signal Processor (ISP) configured to perform a plurality of processing operations on the image data.

Memory 1220 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices. In some implementations, the memory 1220 includes non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid-state storage devices. Memory 1220 optionally includes one or more storage devices remotely located from the one or more processing units 1202. Memory 1220 includes a non-transitory computer-readable storage medium.

In some implementations, memory 1220 or a non-transitory computer-readable storage medium of memory 1220 stores an optional operating system 1230 and one or more instruction sets 1240. Operating system 1230 includes processes for handling various basic system services and for performing hardware-related tasks. In some implementations, the instruction set 1240 includes executable software defined by binary information stored in a charged form. In some implementations, the instruction set 1240 is software that can be executed by the one or more processing units 1202 to implement one or more of the techniques described herein.

Instruction set 1240 includes a registration instruction set 1242, a predetermined representation instruction set 1244, a frame specific representation instruction set 1246, and a combined representation instruction set 1248. The instruction set 1240 may be embodied as a single software executable or as a plurality of software executable files.

In some implementations, the registration instruction set 1242 is executable by the processing unit 1202 to generate registration data from image data. The registration instruction set 1242 (e.g., registration instruction set 820 of fig. 8) may be configured to provide instructions to the user to gather image information to generate a registration avatar (e.g., registration image data 824) and to determine whether additional image information is needed to generate an accurate registration avatar to be used by the avatar display process. For these purposes, in various implementations, the instructions include instructions and/or logic for the instructions as well as heuristics and metadata for the heuristics.

In some implementations, the predetermined presentation instruction set 1244 (e.g., the predetermined presentation instruction set 930 of fig. 9) can be executed by the processing unit 1202 to generate a 3D representation (e.g., extinguished PIFu information) of the user based on the enrollment data by one or more of the techniques discussed herein or another potentially suitable technique. For these purposes, in various implementations, the instructions include instructions and/or logic for the instructions as well as heuristics and metadata for the heuristics.

In some implementations, the frame-specific representation instruction set 1246 (e.g., the frame-specific representation instruction set 940 of fig. 9) can be executed by the processing unit 1202 to parameterize the facial features and eye gaze characteristics of the user (e.g., generate appearance values and depth values) using one or more of the techniques discussed herein or other techniques that may be appropriate. For these purposes, in various implementations, the instructions include instructions and/or logic for the instructions as well as heuristics and metadata for the heuristics.

In some implementations, the combined representation instruction set 1248 (e.g., the combined representation instruction set 950 of fig. 9) can be executed by the processing unit 1202 to generate and display a combined representation (e.g., a 3D avatar) of the user's face based on the predetermined representation (e.g., PIFu data) and the second data set (e.g., parameterized data). For these purposes, in various implementations, the instructions include instructions and/or logic for the instructions as well as heuristics and metadata for the heuristics.

While instruction set 1240 is shown as residing on a single device, it should be understood that in other implementations, any combination of elements may be located in separate computing devices. In addition, FIG. 12 is intended to serve more as a functional description of the various features that may be present in a particular implementation, as opposed to the structural schematic of the implementations described herein. As will be appreciated by one of ordinary skill in the art, the individually displayed items may be combined and some items may be separated. The actual number of instruction sets, and how features are distributed among them, will vary depending upon the particular implementation, and may depend in part on the particular combination of hardware, software, and/or firmware selected for the particular implementation.

Fig. 13 illustrates a block diagram of an exemplary head mounted device 1300 in accordance with some implementations. The head mounted device 1300 includes a housing 1301 (or shell) that houses the various components of the head mounted device 1300. Housing 1301 includes (or is coupled to) an eye pad (not shown) disposed at the proximal (user 25) end of housing 1301. In various implementations, the eye pad is a plastic or rubber piece that comfortably and snugly holds the head-mounted device 1300 in place on the face of the user 25 (e.g., around the eyes of the user 25).

Housing 1301 houses a display 1310 that displays an image, emits light toward or onto the eyes of user 25. In various implementations, display 1310 emits light through an eyepiece having one or more optical elements 1305 that refract light emitted by display 1310, causing display to appear to user 25 as a virtual distance greater than the actual distance from the eye to display 1310. For example, the optical element 1305 may include one or more lenses, waveguides, other Diffractive Optical Elements (DOEs), and the like. To enable user 25 to focus on display 1310, in various implementations, the virtual distance is at least greater than a minimum focal length of the eye (e.g., 7 cm). Furthermore, in order to provide a better user experience, in various implementations, the virtual distance is greater than 1 meter.

Housing 1301 also houses a tracking system that includes one or more light sources 1322, a camera 1324, a camera 1332, a camera 1334, and a controller 1380. The one or more light sources 1322 emit light onto the eyes of the user 25 that is reflected as a pattern of light (e.g., a flash) that is detectable by the camera 1324. Based on the light pattern, controller 1380 may determine eye-tracking characteristics of user 25. For example, controller 1380 may determine a gaze direction and/or blink status (open or closed) of user 25. As another example, controller 1380 may determine pupil center, pupil size, or point of interest. Thus, in various implementations, light is emitted by one or more light sources 1322, reflected from the eyes of user 25, and detected by camera 1324. In various implementations, light from the eyes of user 25 is reflected from a hot mirror or passed through an eyepiece before reaching camera 1324.

The display 1310 emits light in a first wavelength range and the one or more light sources 1322 emit light in a second wavelength range. Similarly, the camera 1324 detects light in a second wavelength range. In various implementations, the first wavelength range is a visible wavelength range (e.g., a wavelength range of approximately 400nm-700nm in the visible spectrum) and the second wavelength range is a near infrared wavelength range (e.g., a wavelength range of approximately 700nm-1400nm in the near infrared spectrum).

In various implementations, eye tracking (or in particular, a determined gaze direction) is used to enable a user to interact (e.g., user 25 selects it by viewing an option on display 1310), provide a rendering of holes (e.g., presenting higher resolution in the area of display 1310 that user 25 is viewing and lower resolution elsewhere on display 1310), or correct distortion (e.g., for images to be provided on display 1310). In various implementations, the one or more light sources 1322 emit light toward the eyes of the user 25, which is reflected in the form of a plurality of flashes.

In various implementations, the camera 1324 is a frame/shutter based camera that generates images of the eyes of the user 25 at a particular point in time or points in time at a frame rate. Each image comprises a matrix of pixel values corresponding to pixels of the image, which pixels correspond to the positions of the photo sensor matrix of the camera. In implementations, each image is used to measure or track pupil dilation by measuring changes in pixel intensities associated with one or both of the user's pupils.

In various implementations, the camera 1324 is an event camera that includes a plurality of light sensors (e.g., a matrix of light sensors) at a plurality of respective locations that generates an event message indicating a particular location of a particular light sensor in response to the particular light sensor detecting a change in light intensity.

In various implementations, the cameras 1332 and 1334 are frame/shutter based cameras that may generate images of the face of the user 25 at a particular point in time or points in time at a frame rate. For example, camera 1332 captures an image of the user's face below the eyes, and camera 1334 captures an image of the user's face above the eyes. The images captured by the cameras 1332 and 1334 may include light intensity images (e.g., RGB) and/or depth image data (e.g., time of flight, infrared, etc.).

It should be understood that the implementations described above are cited by way of example, and that the present disclosure is not limited to what has been particularly shown and described hereinabove. Rather, the scope includes both combinations and subcombinations of the various features described hereinabove as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art.

As described above, one aspect of the present technology is to collect and use physiological data to improve the user's electronic device experience in interacting with electronic content. The present disclosure contemplates that in some cases, the collected data may include personal information data that uniquely identifies a particular person or that may be used to identify an interest, characteristic, or predisposition of a particular person. Such personal information data may include physiological data, demographic data, location-based data, telephone numbers, email addresses, home addresses, device characteristics of personal devices, or any other personal information.

The present disclosure recognizes that the use of such personal information data in the present technology may be used to benefit users. For example, personal information data may be used to improve the interaction and control capabilities of the electronic device. Thus, the use of such personal information data enables planned control of the electronic device. In addition, the present disclosure contemplates other uses for personal information data that are beneficial to the user.

The present disclosure also contemplates that entities responsible for the collection, analysis, disclosure, transmission, storage, or other use of such personal information and/or physiological data will adhere to established privacy policies and/or privacy practices. In particular, such entities should exercise and adhere to privacy policies and practices that are recognized as meeting or exceeding industry or government requirements for maintaining the privacy and security of personal information data. For example, personal information from a user should be collected for legal and legitimate uses of an entity and not shared or sold outside of those legal uses. In addition, such collection should be done only after the user's informed consent. In addition, such entities should take any required steps to secure and protect access to such personal information data and to ensure that other people who are able to access the personal information data adhere to their privacy policies and procedures. In addition, such entities may subject themselves to third party evaluations to prove compliance with widely accepted privacy policies and practices.

Regardless of the foregoing, the present disclosure also contemplates implementations in which a user selectively prevents use or access to personal information data. That is, the present disclosure contemplates that hardware elements or software elements may be provided to prevent or block access to such personal information data. For example, with respect to content delivery services customized for a user, the techniques of the present invention may be configured to allow the user to choose to "join" or "leave" to participate in the collection of personal information data during the registration service. In another example, the user may choose not to provide personal information data for the targeted content delivery service. In yet another example, the user may choose not to provide personal information, but allow anonymous information to be transmitted for improved functionality of the device.

Thus, while the present disclosure broadly covers the use of personal information data to implement one or more of the various disclosed embodiments, the present disclosure also contemplates that the various embodiments may be implemented without accessing such personal information data. That is, various embodiments of the present technology do not fail to function properly due to the lack of all or a portion of such personal information data. For example, the content may be selected and delivered to the user by inferring preferences or settings based on non-personal information data or absolute minimum personal information such as content requested by a device associated with the user, other non-personal information available to the content delivery service, or publicly available information.

In some embodiments, the data is stored using a public/private key system that only allows the owner of the data to decrypt the stored data. In some other implementations, the data may be stored anonymously (e.g., without identifying and/or personal information about the user, such as legal name, user name, time and location data, etc.). Thus, other users, hackers, or third parties cannot determine the identity of the user associated with the stored data. In some implementations, a user may access stored data from a user device other than the user device used to upload the stored data. In these cases, the user may need to provide login credentials to access their stored data.

Numerous specific details are set forth herein to provide a thorough understanding of the claimed subject matter. However, it will be understood by those skilled in the art that the claimed subject matter may be practiced without these specific details. In other instances, methods, devices, or systems known by those of ordinary skill have not been described in detail so as not to obscure the claimed subject matter.

Unless specifically stated otherwise, it is appreciated that throughout the description, discussions utilizing terms such as "processing," "computing," "calculating," "determining," or "identifying" or the like, refer to the action or processes of a computing device, such as one or more computers or similar electronic computing devices, that manipulate or transform data represented as physical, electronic, or magnetic quantities within the computing platform's memory, registers, or other information storage device, transmission device, or display device.

The one or more systems discussed herein are not limited to any particular hardware architecture or configuration. The computing device may include any suitable arrangement of components that provide results conditioned on one or more inputs. Suitable computing devices include a multi-purpose microprocessor-based computer system that accesses stored software that programs or configures the computing system from a general-purpose computing device to a special-purpose computing device that implements one or more implementations of the subject invention. The teachings contained herein may be implemented in software for programming or configuring a computing device using any suitable programming, scripting, or other type of language or combination of languages.

Implementations of the methods disclosed herein may be performed in the operation of such computing devices. The order of the blocks presented in the above examples may be varied, e.g., the blocks may be reordered, combined, or divided into sub-blocks. Some blocks or processes may be performed in parallel.

The use of "adapted" or "configured to" herein is meant to be an open and inclusive language that does not exclude devices adapted or configured to perform additional tasks or steps. In addition, the use of "based on" is intended to be open and inclusive in that a process, step, calculation, or other action "based on" one or more of the stated conditions or values may be based on additional conditions or beyond the stated values in practice. Headings, lists, and numbers included herein are for ease of explanation only and are not intended to be limiting.

It will also be understood that, although the terms "first," "second," etc. may be used herein to describe various objects, these objects should not be limited by these terms. These terms are only used to distinguish one object from another. For example, a first node may be referred to as a second node, and similarly, a second node may be referred to as a first node, which changes the meaning of the description, so long as all occurrences of "first node" are renamed consistently and all occurrences of "second node" are renamed consistently. The first node and the second node are both nodes, but they are not the same node.

The terminology used herein is for the purpose of describing particular implementations only and is not intended to be limiting of the claims. As used in the description of this specification and the appended claims, the singular forms "a," "an," and "the" are intended to cover the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term "or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms "comprises" and "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or groups thereof.

As used herein, the term "if" may be interpreted to mean "when the prerequisite is true" or "in response to a determination" or "upon a determination" or "in response to detecting" that the prerequisite is true, depending on the context. Similarly, the phrase "if it is determined that the prerequisite is true" or "if it is true" or "when it is true" is interpreted to mean "when it is determined that the prerequisite is true" or "in response to a determination" or "upon determination" that the prerequisite is true or "when it is detected that the prerequisite is true" or "in response to detection that the prerequisite is true", depending on the context.

The foregoing description and summary of the invention should be understood to be in every respect illustrative and exemplary, but not limiting, and the scope of the invention disclosed herein is to be determined not by the detailed description of illustrative implementations, but by the full breadth permitted by the patent laws. It is to be understood that the specific implementations shown and described herein are merely illustrative of the principles of this invention and that various modifications may be implemented by those skilled in the art without departing from the scope and spirit of the invention.

Claims

1. A method, comprising:

At a processor of the device:

Obtaining a first user representation of at least a first portion of a user, wherein the first user representation is generated via a first technique based on first image data obtained in a first physical environment via a first set of sensors, and the first user representation represents the user under a first lighting condition;

Obtaining a second user representation of at least a second portion of the user, wherein the second user representation is generated by:

Generating an initial user representation of the user based on second image data obtained via a second set of sensors in a second physical environment having a second lighting condition;

extinguishing the initial user representation based on the illuminated representation of the second physical environment; and

Generating the second user representation by re-lighting the extinguished initial user representation based on the first lighting condition; and

A combined user representation is generated based on the first user representation and the second user representation.

2. The method of claim 1, wherein the first user representation comprises texture data generated via a machine learning model that is trained using training data obtained via one or more sensors in one or more environments having the first lighting conditions.

3. The method of claim 2, wherein the first lighting condition is provided in the one or more environments using a plurality of lights positioned in a fixed positional relationship to provide uniform light distribution over a face of a training subject.

4. The method of claim 1, wherein the illumination representation of the second physical environment comprises an omnidirectional image representation of the second illumination condition of the second physical environment.

5. The method of claim 4, wherein the omnidirectional image representation of the second lighting condition of the second physical environment is generated by a machine learning model based on the second image data obtained via sensor data.

6. The method of claim 1, wherein re-illuminating the second user representation comprises: and matching the illumination attribute of the second illumination condition with the illumination attribute of the first user representation.

7. The method of claim 1, wherein the first lighting condition of the first physical environment is different from the second lighting condition of the second physical environment.

8. The method of claim 1, further comprising:

providing a view of an adjusted combined user representation in a three-dimensional (3D) environment, wherein the adjusted combined user representation is generated by adjusting the combined user representation based on at least one of one or more color attributes or one or more light attributes of the 3D environment.

9. The method of claim 1, wherein the first physical environment is different from the second physical environment.

10. The method of claim 1, wherein the first portion comprises a representation of the face and hair of the user, and the second portion represents other portions of the user that are different from the first portion.

11. The method of claim 1, wherein the combined user representation is a three-dimensional (3D) user representation.

12. An apparatus, comprising:

A non-transitory computer readable storage medium; and

One or more processors coupled to the non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium includes program instructions that, when executed on the one or more processors, cause the one or more processors to perform operations comprising:

13. The apparatus of claim 12, wherein the first user representation comprises texture data generated via a machine learning model that is trained using training data obtained via one or more sensors in one or more environments having the first lighting conditions.

14. The apparatus of claim 13, wherein the first lighting condition is provided in the one or more environments using a plurality of lights positioned in a fixed positional relationship to provide uniform light distribution over a face of a training subject.

15. The apparatus of claim 12, wherein the illumination representation of the second physical environment comprises an omni-directional image representation of the second illumination condition of the second physical environment.

16. The apparatus of claim 15, wherein the omnidirectional image representation of the second lighting condition of the second physical environment is generated by a machine learning model based on the second image data obtained via sensor data.

17. The apparatus of claim 12, wherein re-illuminating the second user representation comprises: and matching the illumination attribute of the second illumination condition with the illumination attribute of the first user representation.

18. The apparatus of claim 12, wherein the first lighting condition of the first physical environment is different from the second lighting condition of the second physical environment.

19. The apparatus of claim 12, wherein the non-transitory computer-readable storage medium comprises program instructions that, when executed on the one or more processors, further cause the one or more processors to perform operations comprising:

20. A non-transitory computer-readable storage medium storing program instructions executable on a device to perform operations comprising: