CN118251699A

CN118251699A - Non-geometric neural scene representation for efficient object-centric new view synthesis

Info

Publication number: CN118251699A
Application number: CN202280075251.9A
Authority: CN
Inventors: S·M·M·萨加迪; H·迈耶; E·F·R·波特; U·M·伯格曼; K·格雷夫; N·拉德万; S·D-R·沃拉; M·卢奇克; D·C·达克沃思; T·A·方克豪泽; A·塔格利亚萨奇; T·基普; F·帕维蒂克; L·J·吉巴斯; A·马亨德兰; S·J·范斯蒂恩基斯特
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2021-11-16
Filing date: 2022-11-15
Publication date: 2024-06-25

Abstract

A machine learning model is provided that generates a geometrically free neural scene representation by efficient object-centric new view synthesis. In particular, one example aspect of the present disclosure provides a new framework in which an encoder model (e.g., an encoder converter network) processes one or more RGB images (with or without pose) to produce a full potential scene representation that can be passed to a decoder model (e.g., a decoder converter network). In view of one or more target poses, the decoder model may synthesize an image in a single forward pass. In some example implementations, because a transducer is used instead of a convolutional or MLP network, the encoder may learn an attention model that extracts enough 3D information about the scene from a small set of images to render a new view with correct projection, parallax, occlusion, and even semantics without explicit geometry.

Description

Non-geometric neural scene representation for efficient object-centric new view synthesis

Priority claim

The present application is based on and claims U.S. provisional patent application number 63/279,875, at day 2021, 11, month 16, and U.S. provisional patent application number 63/343,882, at day 2022, month 5, and 19, each of which is incorporated herein by reference in its entirety.

Technical Field

The present disclosure relates generally to machine learning. More particularly, the present disclosure relates to training machine learning models to generate a no geometry neural scene representation through new view synthesis.

Background

A classical problem in computer vision is to infer a three-dimensional (3D) scene representation from one or more images (e.g., so that the scene representation can be used to render a new view at an interaction rate). Previous methods reconstruct either explicit 3D representations (e.g., textured meshes) or implicit representations (e.g., radiated fields). However, they typically require an input image with accurate camera pose and long processing time for each new scene.

Conventional approaches have established explicit 3D representations such as color point clouds, grids, voxels, octrees, and multi-plane images. While they are efficient for interactive rendering, expensive and fragile reconstruction processes are often required and produce discrete representations of limited resolution.

Recent studies have explored the use of purely implicit representations to represent scenes. For example, a neural radiation field (NeRF) trains a multi-layer perceptron (MLP) that produces density and outgoing RGB radiation for any 5D rays from which a new view can be synthesized by volume rendering. However NeRF requires a very expensive training and rendering process because they are learned independently for each scene and require many MLP evaluations to volume render each ray.

A Light Field Network (LFN) trains the MLP to generate RGB radiation for 4D ray input and learns a priori of scene representations by training on an example database using a meta-learning framework. However, LFNs require accurate camera pose and expensive auto-decoder optimization procedures for each new scene. It is only demonstrated for a composite image of an independent ShapeNet object.

In particular, the development of object-centric geometric world understanding is considered a cornerstone of human cognition. Duplicating these functions in a machine learning model has been the primary focus in the computer vision and related fields. However, conventional supervised learning paradigms present several challenges. For example, explicit supervision requires a large amount of carefully annotated data and is affected by obstacles such as rare or new object categories. Furthermore, obtaining accurate true value 3D scenes and object geometry is extremely challenging. Thus, there is a need for an efficient method to implement a machine learning model for object-centric new view synthesis.

Disclosure of Invention

Aspects and advantages of embodiments of the disclosure will be set forth in part in the description which follows, or may be learned by practice of the embodiments.

One example aspect relates to a computer-implemented method for more efficiently generating a new view of a scene. The method comprises the following steps: obtaining, by a computing system comprising one or more computing devices, one or more input images depicting a scene; generating, by the computing system, one or more image embeddings for the one or more input images, respectively; processing, by the computing system, the one or more image embeddings using the machine-learning encoder model to generate a scene embedment representative of the scene; obtaining, by a computing system, ray data describing one or more ray projections for a predicted image of a scene; processing, by the computing system, the scene embedding and the ray data using the machine-learned decoder model to generate composite image data for one or more ray projections of a predicted image of the scene; and providing, by the computing system, as output, a predicted image of the scene.

In some implementations, one or both of the machine learning encoder model and the machine learning decoder model include a self-attention model.

In some implementations, the machine-learned encoder model and the machine-learned decoder model have been jointly trained using a shared loss function.

In some implementations, at least the machine learning encoder model has been pre-trained using different images depicting different scenes.

In some implementations, generating, by the computing system, one or more image embeddings for the one or more input images, respectively, includes processing, by the computing system, the one or more input images using the convolutional neural network to generate the one or more image embeddings, respectively.

In some implementations, generating, by the computing system, one or more image embeddings for the one or more input images, respectively, includes generating, by the computing system, one or more learning location embeddings for the one or more input images.

In some implementations, the machine learning decoder model includes a self-attention model; and processing, by the computing system, the scene embedding and the ray data using the machine learning decoder model includes: generating key and value data elements from scene embedding; generating query data elements from the ray data; and processing, by the computing system, the keys, values, and query data elements using the machine learning decoder model to generate composite image data for one or more ray casting.

In some implementations, the composite image data for each ray casting includes color data for pixels of the predicted image corresponding to the ray casting.

In some implementations, the one or more input images include a plurality of input images captured at a plurality of different poses, respectively, relative to the scene.

In some implementations, the one or more input images include an unset image having an unspecified pose with respect to the scene.

In some implementations, the method includes: evaluating, by the computing system, a loss function that compares the composite image data for the one or more ray projections to the true value image data for the one or more ray projections; and modifying, by the computing system, one or more values of one or more parameters of the machine learning decoder model based at least in part on the loss function.

In some implementations, the method includes: evaluating, by the computing system, a loss function that compares the composite image data for the one or more ray projections to the true value image data for the one or more ray projections; and modifying, by the computing system, one or more values of one or more parameters of both the machine-learned decoder model and the machine-learned encoder model based at least in part on the loss function.

Another example aspect relates to a computing system for more efficiently generating scene-specific predicted images, the computing system comprising: one or more processors; and one or more non-transitory computer-readable media collectively storing instructions that, when executed by the one or more processors, cause the computing system to perform operations. These operations include: obtaining a scene embedding generated by the machine-learned encoder model from one or more images of the scene, wherein the scene embedding represents the scene; obtaining, by a computing system, ray data describing one or more ray projections for a predicted image of a scene; processing, by the computing system, the scene embedding and the ray data using the machine-learned decoder model to generate composite image data for one or more ray projections of a predicted image of the scene; and providing, by the computing system, as output, a predicted image of the scene.

Another example aspect relates to one or more non-transitory computer-readable media storing instructions that, when executed by a computing system, cause the computing system to perform operations. These operations include: obtaining, by a computing system, one or more input images depicting a scene; generating, by the computing system, one or more image embeddings for the one or more input images, respectively; processing, by the computing system, the one or more image embeddings using the machine-learning encoder model to generate a scene embedment representative of the scene; obtaining, by a computing system, ray data describing one or more ray projections for a predicted image of a scene; processing, by the computing system, the scene embedding and the ray data using the machine-learned decoder model to generate composite image data for one or more ray projections of a predicted image of the scene; evaluating, by the computing system, a loss function that compares the composite image data for the one or more ray projections to the true value image data for the one or more ray projections; and modifying, by the computing system, one or more values of one or more parameters of the machine learning decoder model based at least in part on the loss function.

In some implementations, the operations further include modifying, by the computing system, one or more values of one or more parameters of the machine-learned encoder model based at least in part on the loss function.

Another example aspect of the present disclosure relates to a computer-implemented method for efficient object-centric new view synthesis. The method includes obtaining, by a computing system including one or more computing devices, a plurality of potential representation encodings of one or more input images representing a scene depicting a plurality of objects, wherein the plurality of potential representation encodings correspond to a plurality of portions of the scene, wherein at least a subset of the plurality of portions depict the plurality of objects. The method comprises the following steps: for each of a plurality of query ray projections respectively associated with a plurality of pixels, a plurality of potential representation encodings and corresponding query ray projections are processed by a computing system utilizing a transducer sub-model of a machine learning decoding model to generate feature embeddings. The method includes processing, by a computing system, a feature embedding and a fixed ordering of a plurality of potential representation encodings using a weighted sub-model of a machine-learning decoding model to obtain a weighted average of the plurality of potential representation encodings for a corresponding query ray casting. The method includes processing, by the computing system, the respective query ray projections and the weighted average using a rendering sub-model of the machine learning decoding model to obtain color predictions for respective pixels of the plurality of pixels.

Another example aspect of the present disclosure relates to a computing system for efficient object-centric new view synthesis. The computing system includes one or more processors. The computing system includes one or more non-transitory computer-readable media storing instructions that, when executed by one or more processors, cause the computing system to perform operations. The operations include obtaining a plurality of potential representation encodings of one or more input images representing scenes depicting the plurality of objects, wherein the plurality of potential representation encodings correspond to portions of the scene, wherein at least a subset of the plurality of portions depicts the plurality of objects. These operations include: for each of a plurality of query ray projections respectively associated with a plurality of pixels, a plurality of potential representation encodings and corresponding query ray projections are processed with a transducer submodel of a machine-learned decoding model to generate feature embeddings. The operations include processing the feature embedding and the fixed ordering of the plurality of potential representation encodings with a weighted sub-model of the machine-learning decoding model to obtain a weighted average of the plurality of potential representation encodings for the corresponding query ray casting. The operations include processing the respective query ray projections and the weighted average using a rendering sub-model of the machine learning decoding model to obtain color predictions for respective ones of the plurality of pixels.

Another example aspect of the disclosure relates to one or more non-transitory computer-readable media storing instructions that, when executed by one or more processors, cause a computing system to perform operations. The operations include obtaining a plurality of potential representation encodings of one or more input images representing scenes depicting the plurality of objects, wherein the plurality of potential representation encodings correspond to portions of the scene, wherein at least a subset of the plurality of portions depicts the plurality of objects. These operations include: for each of a plurality of query ray projections respectively associated with a plurality of pixels, a plurality of potential representation encodings and corresponding query ray projections are processed with a transducer submodel of a machine-learned decoding model to generate feature embeddings. The operations include processing the feature embedding and the fixed ordering of the plurality of potential representation encodings with a weighted sub-model of the machine-learning decoding model to obtain a weighted average of the plurality of potential representation encodings for the corresponding query ray casting. The operations include processing the respective query ray projections and the weighted average using a rendering sub-model of the machine learning decoding model to obtain color predictions for respective ones of the plurality of pixels.

Another example aspect of the present disclosure relates to a computer-implemented method for efficient object-centric new view synthesis. The method includes processing, by a computing system including one or more computing devices, the one or more input images using a machine learning coding model to obtain a scene embedding of a scene depicted by the one or more input images, wherein the scene depicts a plurality of objects. The method includes determining, by the computing system, a plurality of potential representation encodings from the scene embedding using the machine-learned attention model, wherein the plurality of potential representation encodings correspond to a plurality of portions of the scene, wherein at least a subset of the plurality of portions depict a plurality of objects. The method includes processing, by a computing system, a plurality of potential representation encodings and corresponding query ray projections using a transducer sub-model of a machine-learning decoding model to generate feature embeddings. The method includes processing, by a computing system, a feature embedding and a fixed ordering of a plurality of potential representation encodings using a weighted sub-model of a machine-learning decoding model to obtain a weighted average of the plurality of potential representation encodings for a corresponding query ray casting. The method includes processing, by the computing system, the respective query ray projections and the weighted average using a rendering sub-model of the machine learning decoding model to obtain color predictions for respective pixels of the plurality of pixels.

Other aspects of the disclosure relate to various systems, devices, non-transitory computer-readable media, user interfaces, and electronic apparatus.

These and other features, aspects, and advantages of various embodiments of the present disclosure will become better understood with reference to the following description and appended claims. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description, serve to explain the principles of interest.

Drawings

A detailed discussion of embodiments directed to one of ordinary skill in the art is set forth in the specification with reference to the accompanying drawings, in which:

FIG. 1 depicts a block diagram of an example machine learning model, according to an example embodiment of the present disclosure.

FIG. 2 depicts a block diagram of an example machine learning model, according to an example embodiment of the present disclosure.

Fig. 3A depicts a block diagram of an example computing system, according to an example embodiment of the present disclosure.

Fig. 3B depicts a block diagram of an example computing device, according to an example embodiment of the present disclosure.

Fig. 3C depicts a block diagram of an example computing device, according to an example embodiment of the present disclosure.

Fig. 4 illustrates a block diagram of an example model architecture, according to some embodiments of the present disclosure.

Fig. 5 is a block diagram illustrating an example decoder model.

Fig. 6 illustrates an example new composite image synthesized in accordance with some embodiments of the present disclosure.

Fig. 7 and 8 illustrate example intermediate object aware outputs according to some embodiments of the present disclosure.

Fig. 9 depicts a flowchart of an example method for performing a geometric neural scene-free representation by new view synthesis, according to an example embodiment of the present disclosure.

FIG. 10 depicts a flowchart of an example method for performing object-centric new view synthesis in accordance with an example embodiment of the present disclosure.

Repeated reference characters among the figures are intended to identify identical features in various implementations.

Detailed Description

SUMMARY

The present disclosure relates generally to machine learning models that generate a geometrically neural scene free representation by new view synthesis. In particular, one example aspect of the present disclosure provides a new framework in which an encoder model (e.g., an encoder converter network) processes one or more RGB images (with or without pose) to produce a full potential scene representation that can be passed to a decoder model (e.g., a decoder converter network). In view of one or more target poses, the decoder model may synthesize an image in a single forward pass. In some example implementations, because a transducer is used instead of a convolutional or MLP network, the encoder may learn an attention model that extracts enough 3D information about the scene from a small set of images to render a new view with correct projection, parallax, occlusion, and even semantics without explicit geometry.

More specifically, one example objective of the present disclosure is to synthesize a new image (e.g., an RGB image) from one or more input images captured in an environment (e.g., an outdoor environment) at an interaction rate. This goal is important for virtual exploration of urban spaces such as streets, as well as other mapping, visualization, and AR/VR applications. The main challenge is to learn a scene representation that encodes enough 3D information to render a new view with the correct disparities and occlusions.

The example methods provided herein train an encoder transducer that receives one or more images (optionally with pose) and generates a potential scene representation. The scene representation is input to a decoder transducer along with camera rays and produces output image data (e.g., RGB radiation).

The encoder and decoder may be co-trained on a large image database. For example, the training images may include multiple sets of images (e.g., image tuples), wherein all images in one set (e.g., tuple) view overlapping regions of the same scene. Once the encoder and decoder are pre-trained, the encoder may be used to generate a potential scene representation from any set of one or more new images, and the decoder may be used to generate multiple new images directly from the potential scene representation without further training.

Because of the above method, reasoning about the new view is extremely efficient. This is in contrast to previous approaches that require training a completely new model for each different scene and/or performing expensive rendering calculations such as integration and distribution sampling. Thus, the present disclosure is able to render composite images in a more computationally efficient manner, thereby saving computing resources such as processor usage, memory usage, and network bandwidth. The saving of computing resources is a technical effect and advantage and represents an advancement in the computer itself.

One principle of this approach is to learn a priori the scene representation using a large image collection database. In some implementations, the encoder does not need an accurate camera pose for projection because it learns an attention model that extracts a 3D scene representation that allows new view synthesis without explicit 3D to 2D projection. Thus, it can be trained on a large number of multi-view image datasets where the approximate geographic location is known, but the exact camera pose is not known (one example is street view, but any geotagged image set or autonomous driving dataset can be used). In particular, in some implementations, no camera pose is required at all to render the new scene—the "approximate geographic location" has been implied by the fact that the input images are "nearby" and/or at least partially overlapping.

Furthermore, while example implementations are discussed with respect to generating a composite RGB image, the learned potential scene representation also encodes enough information to perform semantic segmentation and other image or scene analysis tasks even if it is trained only for new view synthesis.

Further, other aspects of the disclosure generally relate to image synthesis. More particularly, these aspects of the present disclosure relate to object-centric new view image synthesis. As an example, a computing system may obtain a plurality of potential representation encodings (e.g., a "time slot" encoding) of one or more input images representing scenes depicting multiple objects. The plurality of potential representation encodings may correspond to a plurality of portions of the scene. At least a subset of the plurality of portions may depict a plurality of objects. For example, the first potential representation encoding may correspond to a first portion of the image depicting the first object. The second potential representation encoding may correspond to a second portion of the image depicting the first half of the second object, and the third potential representation encoding may correspond to a third portion of the image depicting the second half of the second object. The fourth potential representation encoding may correspond to a fourth portion of the non-depicted object of the image.

To synthesize a new view rendering of a scene, a computing system may utilize multiple query ray casting to determine colors of a corresponding plurality of pixels. In particular, for each of the multiple query ray projections, the computing system may generate feature embeddings by processing the multiple potential representation encodings and the respective query ray projections with a transducer sub-model of the machine-learned decoding model. The computing system may then process the feature embedding and the fixed ordering of the plurality of potential representation encodings (e.g., a matrix describing the fixed ordering) using a weighted sub-model of the machine-learning decoding model to obtain a weighted average of the plurality of potential representation encodings for the corresponding query ray casting. More generally, the weighted average may indicate the relevance of each potential representation code to the corresponding query ray casting.

Once the weighted average is obtained, the weighted average may be processed along the original query ray casting to obtain a color prediction for a corresponding pixel of the plurality of pixels. This may be performed iteratively for each of the plurality of pixels to collectively form image data representing a scene from a perspective different from an original perspective of the one or more input images. In this way, the machine-learned decoding model can be used to determine the predicted color of each pixel in a single "pass", thus efficiently synthesizing an object-centric new view of the scene.

The systems and methods of the present disclosure provide a variety of technical effects and advantages. As one example technical effect and advantage, conventional new view synthesis models typically require supervised learning. However, it is extremely difficult to collect annotated three-dimensional image data for such supervised learning. By implementing unsupervised training of machine learning models for object-centric new view synthesis, embodiments of the present disclosure substantially reduce or eliminate the high costs associated with supervised training.

As another example technical effect and advantage, conventional new synthesis techniques generally cannot be extended as the number of objects increases and/or visual complexity increases. Conventional models independently decode each object, which adds a significant product factor to an inherently expensive volume rendering process that typically requires hundreds of decoding steps. Thus, the requirement to perform thousands or more of decodes for each rendered pixel requires a significant amount of computational and memory resources. However, the machine learning model of the present disclosure can perform efficient pixel value prediction in a single pass, regardless of the number of objects, thus greatly reducing the computational resources (e.g., memory, power, computation cycles, storage, etc.) required for object-centric new view synthesis.

Referring now to the drawings, example embodiments of the present disclosure will be discussed in more detail.

FIG. 1 depicts a block diagram of an example machine learning model, according to an example embodiment of the present disclosure. In particular, fig. 1 depicts a block diagram of a method for more efficiently generating a new view of a scene. The model arrangement in fig. 1 encodes a set of images into a set of potential features that form a representation of a scene. The new view may be rendered at an interaction rate by focusing on the potential representation with rays (e.g., 6D light field rays).

Referring now to FIG. 1, a computing system may obtain one or more input images 12 depicting a scene. In some implementations, the one or more input images 12 may include a plurality of input images captured at a plurality of different poses, respectively, relative to the scene. In some implementations, one or more of the input images 12 are unset images having unspecified poses relative to the scene.

The computing system may generate one or more image embeddings 14 for one or more input images 12, respectively. In some implementations, generating, by the computing system, one or more image embeddings 14 for one or more input images 12, respectively, may include processing, by the computing system, the one or more input images 12 using the convolutional neural network 16 to generate the one or more image embeddings 14, respectively.

In some implementations, generating, by the computing system, one or more image embeddings 14 for the one or more input images 12, respectively, includes generating, by the computing system, one or more learning location embeddings for the one or more input images 12.

The computing system may process one or more image embeddings 14 using the machine-learning encoder model 18 to generate a scene embedment 20 representative of the scene.

The computing system may obtain ray data 22 describing one or more ray projections for a predicted image of the scene. As an example, the ray data 22 may include one or more sets of 5-D or 6-D ray information corresponding to one or more pixels in the predicted image, respectively.

The computing system may utilize the machine-learning decoder model 24 to process the scene embedding 20 and the ray data 22 to generate composite image data for one or more ray projections of a predicted image 26 of the scene. In some implementations, the composite image data for each ray casting may be or include color data for pixels of the predicted image 26 corresponding to the ray casting. The computing system may provide as output a predicted image 26 of the scene.

In some implementations, one or both of the machine-learning encoder model 18 and the machine-learning decoder model 24 may be or include a self-attention model, such as, for example, a transducer model.

For example, in some implementations, the machine learning decoder model 24 may include a self-attention model; and processing, by the computing system, the scene embedding 20 and the ray data 22 using the machine learning decoder model 24 may include: generating keys and value data elements from the scene embedding 20; generating query data elements from the ray data 22; and processing, by the computing system, the keys, values, and query data elements using the machine learning decoder model 24 to generate composite image data for one or more ray casting.

In some implementations, the machine learning encoder model 18 and the machine learning decoder model 24 have been jointly trained using a shared loss function. In some implementations, at least the machine learning encoder model 18 has been pre-trained using different images depicting different scenes.

In some implementations, during training, the computing system may evaluate a loss function that compares synthetic image data for one or more ray casting (e.g., included in image 26) to real-valued image data for one or more ray casting (e.g., included in a real-valued image of a scene that is not included in input image 12). For example, the loss function may evaluate the distance between the predicted color of each pixel in the color space and the true value color of that pixel.

During training, the computing system may modify one or more values of one or more parameters of the machine-learned decoder model 24 and/or the machine-learned encoder model 18 based at least in part on the loss function. For example, the loss function may be back-propagated through the decoder model 24, then optionally also through the encoder model 18, and optionally also through the feature extraction model (e.g., CNN 16).

FIG. 2 depicts a block diagram of an example machine learning model, according to an example embodiment of the present disclosure. In the example model shown in fig. 2, in view of the optionally located RGB inputs, the CNN extracts patch features to which a learning embedding of 2D locations and camera IDs is added. The transducer performs self-attention to patch embedding, thereby generating a scene representation. The decoder focuses the scene representation using a given 6D ray pose, producing a final RGB output.

Fig. 3A depicts a block diagram of an example computing system 100, according to an example embodiment of the disclosure. The system 100 includes a user computing device 102, a server computing system 130, and a training computing system 150 communicatively coupled by a network 180.

The user computing device 102 may be any type of computing device, such as, for example, a personal computing device (e.g., a laptop computer or desktop computer), a mobile computing device (e.g., a smart phone or tablet computer), a game console or controller, a wearable computing device, an embedded computing device, or any other type of computing device.

The user computing device 102 includes one or more processors 112 and memory 114. The one or more processors 112 may be any suitable processing device (e.g., a processor core, microprocessor, ASIC, FPGA, controller, microcontroller, etc.) and may be one processor or operatively connected multiple processors. Memory 114 may include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, disks, and the like, as well as combinations thereof. Memory 114 may store data 116 and instructions 118 that are executed by processor 112 to cause user computing device 102 to perform operations.

In some implementations, the user computing device 102 may store or include one or more machine learning models 120. For example, the machine learning model 120 may be or may otherwise include various machine learning models such as a neural network (e.g., deep neural network) or other types of machine learning models, including nonlinear models and/or linear models. The neural network may include a feed forward neural network, a recurrent neural network (e.g., a long and short term memory recurrent neural network), a convolutional neural network, or other form of neural network. Some example machine learning models may utilize an attention mechanism, such as self-attention. For example, some example machine learning models may include a multi-headed self-attention model (e.g., a transducer model).

In some implementations, the one or more machine learning models 120 may be received from the server computing system 130 over the network 180, stored in the user computing device memory 114, and then used or otherwise implemented by the one or more processors 112. In some implementations, the user computing device 102 may implement multiple parallel instances of a single machine learning model 120 (e.g., to perform parallel image synthesis or other image processing across multiple instances of an image scene).

Additionally or alternatively, one or more machine learning models 140 (e.g., machine learning decoding models, etc.) may be included in or otherwise stored and implemented by a server computing system 130 in communication with the user computing device 102 according to a client-server relationship. For example, the machine learning model 140 may be implemented by the server computing system 140 as part of a web service (e.g., an image processing service). Accordingly, one or more models 120 may be stored and implemented at the user computing device 102 and/or one or more models 140 may be stored and implemented at the server computing system 130.

The user computing device 102 may also include one or more user input components 122 that receive user input. For example, the user input component 122 may be a touch-sensitive component (e.g., a touch-sensitive display screen or touchpad) that is sensitive to touch by a user input object (e.g., a finger or stylus). The touch sensitive component may be used to implement a virtual keyboard. Other example user input components include a microphone, a conventional keyboard, or other device through which a user may provide user input.

The server computing system 130 includes one or more processors 132 and memory 134. The one or more processors 132 may be any suitable processing device (e.g., a processor core, microprocessor, ASIC, FPGA, controller, microcontroller, etc.) and may be one processor or operatively connected multiple processors. Memory 134 may include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, and the like, as well as combinations thereof. Memory 134 may store data 136 and instructions 138 that are executed by processor 132 to cause server computing system 130 to perform operations.

In some implementations, the server computing system 130 includes or is otherwise implemented by one or more server computing devices. Where the server computing system 130 includes multiple server computing devices, such server computing devices may operate in accordance with a sequential computing architecture, a parallel computing architecture, or some combination thereof.

As described above, the server computing system 130 may store or otherwise include one or more machine learning models 140. For example, model 140 may be or may otherwise include various machine learning models. Example machine learning models include neural networks or other multi-layer nonlinear models. Example neural networks include feed forward neural networks, deep neural networks, recurrent neural networks, and convolutional neural networks. Some example machine learning models may utilize an attention mechanism, such as self-attention. For example, some example machine learning models may include a multi-headed self-attention model (e.g., a transducer model).

The user computing device 102 and/or the server computing system 130 may train the models 120 and/or 140 via interactions with a training computing system 150 communicatively coupled through a network 180. The training computing system 150 may be separate from the server computing system 130 or may be part of the server computing system 130.

The training computing system 150 includes one or more processors 152 and memory 154. The one or more processors 152 may be any suitable processing device (e.g., a processor core, microprocessor, ASIC, FPGA, controller, microcontroller, etc.) and may be one processor or operatively connected multiple processors. Memory 154 may include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, disks, and the like, as well as combinations thereof. Memory 154 may store data 156 and instructions 158 that are executed by processor 152 to cause training computing system 150 to perform operations. In some implementations, the training computing system 150 includes or is otherwise implemented by one or more server computing devices.

Training computing system 150 may include a model trainer 160 that trains machine learning models 120 and/or 140 stored at user computing device 102 and/or server computing system 130 using various training or learning techniques (such as, for example, error back propagation). For example, the loss function may be counter-propagated through the model to update one or more parameters of the model (e.g., a gradient based on the loss function). Various loss functions may be used, such as mean square error, likelihood loss, cross entropy loss, hinge loss, and/or various other loss functions. Gradient descent techniques may be used to iteratively update parameters in multiple training iterations.

In some implementations, performing error back-propagation may include performing back-propagation of the truncated transit time. Model trainer 160 may perform a variety of generalization techniques (e.g., weight decay, sag (dropout), etc.) to enhance the generalization ability of the model being trained.

In particular, model trainer 160 may train machine learning models 120 and/or 140 based on a set of training data 162. In some implementations, if the user has provided consent, the training examples may be provided by the user computing device 102. Thus, in such implementations, the model 120 provided to the user computing device 102 may be trained by the training computing system 150 based on user-specific data received from the user computing device 102. In some instances, this process may be referred to as personalizing the model.

Model trainer 160 includes computer logic for providing the desired functionality. Model trainer 160 may be implemented in hardware, firmware, and/or software that controls a general purpose processor. For example, in some implementations, model trainer 160 includes program files stored on a storage device, loaded into memory, and executed by one or more processors. In other implementations, model trainer 160 includes one or more sets of computer-executable instructions stored in a tangible computer-readable storage medium (such as RAM, a hard disk, or an optical or magnetic medium).

The network 180 may be any type of communication network, such as a local area network (e.g., an intranet), a wide area network (e.g., the internet), or some combination thereof, and may include any number of wired or wireless links. In general, communication over the network 180 may be performed via any type of wired and/or wireless connection using a variety of communication protocols (e.g., TCP/IP, HTTP, SMTP, FTP), coding or formats (e.g., HTML, XML), and/or protection schemes (e.g., VPN, secure HTTP, SSL).

In some implementations, the input to the machine learning model of the present disclosure can be image data. The machine learning model may process the image data to generate an output. As one example, the machine learning model may process the image data to generate an image recognition output (e.g., recognition of the image data, potential embedding of the image data, encoded representation of the image data, hashing of the image data, etc.). As another example, the machine learning model may process the image data to generate an image segmentation output. As another example, the machine learning model may process image data to generate an image classification output. As another example, the machine learning model may process the image data to generate an image data modification output (e.g., a change in the image data, etc.). As another example, the machine learning model may process the image data to generate an encoded image data output (e.g., an encoded and/or compressed representation of the image data, etc.). As another example, the machine learning model may process image data to generate an enlarged image data output. As another example, the machine learning model may process image data to generate a prediction output.

In some implementations, the input of the machine learning model of the present disclosure may be potentially encoded data (e.g., a potential spatial representation of the input, etc.). The machine learning model may process the potentially encoded data to generate an output. As one example, the machine learning model may process the potentially encoded data to generate the recognition output. As another example, the machine learning model may process the potentially encoded data to generate a reconstructed output. As another example, the machine learning model may process the potentially encoded data to generate a search output. As another example, the machine learning model may process the potentially encoded data to generate a reclustering output. As another example, the machine learning model may process the potentially encoded data to generate a prediction output.

In some cases, the machine learning model may be configured to perform tasks that include encoding input data to enable reliable and/or efficient transmission or storage (and/or corresponding decoding). For example, the task may be an audio compression task. The input may comprise audio data and the output may comprise compressed audio data. In another example, the input includes visual data (e.g., one or more images or videos), the output includes compressed visual data, and the task is a visual data compression task. In another example, a task may include generating an embedding for input data (e.g., input audio or visual data).

In some cases, the input includes visual data and the task is a computer visual task. In some cases, pixel data including one or more images is input, and the task is an image processing task. For example, an image processing task may be an image classification, where the output is a set of scores, each score corresponding to a different object class and representing the likelihood that one or more images depict an object belonging to that object class. The image processing task may be object detection, wherein the image processing output identifies one or more regions in the one or more images, and for each region, a likelihood that the region depicts the object of interest. As another example, the image processing task may be image segmentation, wherein the image processing output defines a respective likelihood for each of a set of predetermined categories for each pixel in the one or more images. For example, the set of categories may be foreground and background. As another example, the set of classes may be object classes. As another example, the image processing task may be depth estimation, wherein the image processing output defines a respective depth value for each pixel in the one or more images. As another example, the image processing task may be motion estimation, wherein the network input comprises a plurality of images, and the image processing output defines, for each pixel of one of the input images, a motion of a scene depicted at the pixel between the images in the network input.

FIG. 3A illustrates one example computing system that may be used to implement the present disclosure. Other computing systems may also be used. For example, in some implementations, the user computing device 102 may include a model trainer 160 and a training data set 162. In such implementations, the model 120 may be trained and used locally at the user computing device 102. In some such implementations, the user computing device 102 may implement the model trainer 160 to personalize the model 120 based on user-specific data.

Fig. 3B depicts a block diagram of an example computing device 10, performed in accordance with an example embodiment of the present disclosure. The computing device 10 may be a user computing device or a server computing device.

Computing device 10 includes a plurality of applications (e.g., applications 1 through N). Each application contains its own machine learning library and machine learning model. For example, each application may include a machine learning model. Example applications include text messaging applications, email applications, dictation applications, virtual keyboard applications, browser applications, and the like.

As shown in fig. 3B, each application may communicate with many other components of the computing device (such as, for example, one or more sensors, a context manager, a device state component, and/or additional components). In some implementations, each application may communicate with each device component using an API (e.g., public API). In some implementations, the API used by each application is specific to that application.

Fig. 3C depicts a block diagram of an example computing device 50, performed in accordance with an example embodiment of the present disclosure. The computing device 50 may be a user computing device or a server computing device.

Computing device 50 includes a plurality of applications (e.g., applications 1 through N). Each application communicates with a central intelligent layer. Example applications include text messaging applications, email applications, dictation applications, virtual keyboard applications, browser applications, and the like. In some implementations, each application may use an API (e.g., a public API across all applications) to communicate with the central intelligence layer (and the models stored therein).

The central intelligence layer includes a plurality of machine learning models. For example, as shown in FIG. 3C, a respective machine learning model may be provided for each application and managed by a central intelligent layer. In other implementations, two or more applications may share a single machine learning model. For example, in some implementations, the central intelligence layer may provide a single model for all applications. In some implementations, the central intelligence layer is included within or otherwise implemented by the operating system of the computing device 50.

The central intelligence layer may communicate with the central device data layer. The central device data layer may be a centralized data store of computing devices 50. As shown in fig. 3C, the central device data layer may be in communication with many other components of the computing device (such as, for example, one or more sensors, a context manager, a device status component, and/or additional components). In some implementations, the central device data layer may communicate with each device component using an API (e.g., a proprietary API).

FIG. 5 is a block diagram illustrating an example decoder model;

FIG. 6 illustrates an example new composite image synthesized in accordance with some embodiments of the present disclosure; and

Fig. 7 and 8 illustrate example intermediate object aware outputs according to some embodiments of the present disclosure. It should be noted that the potential representation coding describes not only the object, but also part of the background. This is in part because the object casts a shadow on the background of the scene. For example, each ray may be used to determine the pixel in the time slot with the highest weight in the hybrid block (and each time slot has a different color for visualization). For example, if 32 potential representation encodings are utilized, but there are no 32 objects in the scene, some of the potential representation encodings are typically unused, or some of the objects are split between the potential representation encodings.

Fig. 9 depicts a flowchart of an example method for performing a geometric neural scene-free representation by new view synthesis, according to an example embodiment of the present disclosure. Fig. 9 depicts a flowchart of an example method for performing in accordance with an example embodiment of the present disclosure. Although fig. 9 depicts steps performed in a particular order for purposes of illustration and discussion, the methods of the present disclosure are not limited to the particular order or arrangement shown. The various steps of method 900 may be omitted, rearranged, combined, or adjusted in various ways without departing from the scope of the present disclosure.

At 902, a computing system may obtain one or more input images depicting a scene.

At 904, the computing system may generate one or more image embeddings for the one or more input images, respectively.

At 906, the computing system may process one or more image embeddings for the one or more input images, respectively.

At 908, the computing system may obtain ray data describing one or more ray casting of a predicted image for a scene.

At 910, a computing system provides as output a predicted image of a scene.

FIG. 10 depicts a flowchart of an example method for performing object-centric new view synthesis in accordance with an example embodiment of the present disclosure. Fig. 10 depicts a flowchart of an example method for performing in accordance with an example embodiment of the present disclosure. Although fig. 10 depicts steps performed in a particular order for purposes of illustration and discussion, the methods of the present disclosure are not limited to the particular order or arrangement shown. The various steps of method 1000 may be omitted, rearranged, combined, or adjusted in various ways without departing from the scope of the present disclosure.

At 1002, a computing system may obtain a plurality of potential representation encodings of one or more input images representing scenes depicting a plurality of objects, wherein the plurality of potential representation encodings correspond to a plurality of portions of the scene, wherein at least a subset of the plurality of portions depict the plurality of objects.

At 1004, the computing system may process the plurality of potential representation encodings and the respective query ray projections with a transducer sub-model of the machine learning decoding model for each of the plurality of query ray projections respectively associated with the plurality of pixels to generate feature embedding.

At 1006, the computing system may process, for each of a plurality of query ray projections respectively associated with a plurality of pixels, a feature embedding and a fixed ordering of a plurality of potential representation encodings using a weighted sub-model of a machine learning decoding model to obtain a weighted average of the plurality of potential representation encodings for the respective query ray projections.

At 1008, the computing system may process, for each of the plurality of query ray projections respectively associated with the plurality of pixels, the respective query ray projection and the weighted average using a rendering sub-model of the machine learning decoding model to obtain a color prediction for the respective pixel of the plurality of pixels.

Additional disclosure

The technology discussed herein relates to servers, databases, software applications, and other computer-based systems, as well as actions taken and information sent to and from such systems. The inherent flexibility of computer-based systems allows for a wide variety of possible configurations, combinations, and divisions of tasks and functions between components. For example, the processes discussed herein may be implemented using a single device or component or multiple devices or components working in combination. The database and application may be implemented on a single system or distributed across multiple systems. Distributed components may operate sequentially or in parallel.

While the present subject matter has been described in detail with respect to various specific example embodiments thereof, each example is provided by way of explanation and not limitation of the present disclosure. Modifications, variations and equivalents of such embodiments may readily occur to those skilled in the art upon review of the foregoing description. Accordingly, the present disclosure does not preclude inclusion of such modifications, variations and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. For example, features illustrated or described as part of one embodiment can be used with another embodiment to yield still a further embodiment. Accordingly, the present disclosure is intended to cover such alterations, modifications, and equivalents.

Claims

1. A computer-implemented method for more efficiently generating a new view of a scene, the method comprising:

Obtaining, by a computing system comprising one or more computing devices, one or more input images depicting a scene;

generating, by the computing system, one or more image embeddings for the one or more input images, respectively;

Processing, by the computing system, the one or more image embeddings with a machine-learning encoder model to generate a scene embedment representative of the scene;

obtaining, by the computing system, ray data describing one or more ray casting of a predicted image for the scene;

processing, by the computing system, the scene embedding and the ray data using a machine-learning decoder model to generate composite image data for the one or more ray projections of the predicted image of the scene; and

The predicted image of the scene is provided as an output by the computing system.

2. The computer-implemented method of any preceding claim, wherein one or both of the machine learning encoder model and the machine learning decoder model comprise a self-attention model.

3. The computer-implemented method of any preceding claim, wherein the machine learning encoder model and the machine learning decoder model have been jointly trained using a shared loss function.

4. The computer-implemented method of any preceding claim, wherein at least the machine learning encoder model has been pre-trained using different images depicting different scenes.

5. The computer-implemented method of any preceding claim, wherein generating, by the computing system, the one or more image embeddings for the one or more input images, respectively, comprises processing, by the computing system, the one or more input images with a convolutional neural network to generate the one or more image embeddings, respectively.

6. The computer-implemented method of any preceding claim, wherein generating, by the computing system, the one or more image embeddings for the one or more input images, respectively, comprises generating, by the computing system, one or more learning location embeddings for the one or more input images.

7. The computer-implemented method of any preceding claim, wherein:

the machine learning decoder model includes a self-attention model; and

Processing, by the computing system, the scene embedding and the ray data using the machine learning decoder model includes:

generating key and value data elements from the scene embedding;

Generating a query data element from the ray data; and

The key, value, and query data elements are processed by the computing system using the machine learning decoder model to generate the composite image data of the one or more ray projections.

8. The computer-implemented method of any preceding claim, wherein the composite image data for each ray casting includes color data for pixels of the predicted image corresponding to the ray casting.

9. The computer-implemented method of any preceding claim, wherein the one or more input images comprise a plurality of input images captured in a plurality of different poses, respectively, relative to the scene.

10. The computer-implemented method of any preceding claim, wherein the one or more input images comprise an unsettled image having an unspecified pose with respect to the scene.

11. The computer-implemented method of any preceding claim, further comprising:

Evaluating, by the computing system, a loss function that compares the composite image data of the one or more ray projections to real-valued image data of the one or more ray projections; and

One or more values of one or more parameters of the machine learning decoder model are modified by the computing system based at least in part on the loss function.

12. The computer-implemented method of any preceding claim, further comprising:

One or more values of one or more parameters of both the machine-learned decoder model and the machine-learned encoder model are modified by the computing system based at least in part on the loss function.

13. A computing system for more efficiently generating scene-specific predictive images, the computing system comprising:

One or more processors; and

One or more non-transitory computer-readable media collectively storing instructions that, when executed by the one or more processors, cause the computing system to perform operations comprising:

obtaining a scene embedding generated by a machine-learned encoder model from one or more images of a scene, wherein the scene embedding represents the scene;

14. The computing system of claim 13, wherein one or both of the machine learning encoder model and the machine learning decoder model comprise a self-attention model.

15. The computing system of claim 13 or 14, wherein the machine learning encoder model and the machine learning decoder model have been jointly trained using a shared loss function.

16. The computing system of claim 13, 14 or 15, wherein at least the machine learning encoder model has been pre-trained using different images depicting different scenes.

17. One or more non-transitory computer-readable media storing instructions that, when executed by a computing system, cause the computing system to perform operations comprising:

Obtaining, by the computing system, one or more input images depicting a scene;

processing, by the computing system, the scene embedding and the ray data using a machine-learning decoder model to generate composite image data for the one or more ray projections of the predicted image of the scene;

18. The one or more non-transitory computer-readable media of claim 17, wherein the operations further comprise modifying, by the computing system, one or more values of one or more parameters of the machine-learned encoder model based at least in part on the loss function.

19. The one or more non-transitory computer-readable media of claim 17 or 18, wherein one or both of the machine-learned encoder model and the machine-learned decoder model comprise a self-attention model.

20. The one or more non-transitory computer-readable media of claim 17, 18, or 19, wherein at least the machine-learning encoder model has been pre-trained using different images depicting different scenes.

21. A computer-implemented method for performing object-centric efficient new view synthesis, comprising:

obtaining, by a computing system comprising one or more computing devices, a plurality of potential representation encodings of one or more input images representing scenes depicting a plurality of objects, wherein the plurality of potential representation encodings correspond to a plurality of parts of the scene, wherein at least a subset of the plurality of parts depict the plurality of objects;

for each of a plurality of query ray casts respectively associated with a plurality of pixels:

Processing, by the computing system, the plurality of potential representation encodings and corresponding query ray projections using a transducer sub-model of a machine-learning decoding model to generate feature embedding;

processing, by the computing system, the feature embedding and the fixed ordering of the plurality of potential representation encodings with a weighted sub-model of the machine-learned decoding model to obtain a weighted average of the plurality of potential representation encodings for the respective query ray casting; and

Processing, by the computing system, the respective query ray projections and the weighted average using a rendering sub-model of the machine learning decoding model to obtain color predictions for respective pixels of the plurality of pixels.

22. The computer-implemented method of claim 21, wherein:

the one or more input images representing the scene from a first perspective;

the plurality of pixels collectively form composite image data representing the scene from a second view different from the first view.

23. The computer-implemented method of claim 22, wherein the method further comprises:

Evaluating, by the computing system, a loss function that compares the composite image data to real-valued image data representing the scene from the second perspective; and

Modifying, by the computing system, one or more values of one or more parameters of the machine learning decoding model based at least in part on the loss function.

24. The computer-implemented method of claim 21, wherein prior to obtaining the plurality of potential representation encodings, the method comprises:

The plurality of potential representation encodings is determined by the computing system from scene embedding representing the scene using a machine-learned attention model.

25. The computer-implemented method of claim 24, wherein prior to determining the plurality of potential representation encodings, the method comprises:

processing, by the computing system, the one or more input images using a machine learning coding model to obtain the scene embedding.

26. The computer-implemented method of claim 25, wherein processing the one or more input images comprises:

determining, by the computing system, a set of tokens based on the one or more images;

The set of tokens is processed by the computing system with a self-attention portion of the machine learning coding model to obtain the scene embedding.

27. The computer-implemented method of claim 25, wherein one or more of the machine learning coding model, the machine learning attention model, or the machine learning decoding model has been jointly trained using a shared loss function.

28. The computer-implemented method of any of claims 21 to 27, wherein the machine learning coding model comprises a self-attention model.

29. The computer-implemented method of any of claims 21 to 28, wherein the machine learning decoding model comprises a self-attention model.

30. The computer-implemented method of any of claims 21 to 29, wherein the feature embedding indicates a relevance of each of the plurality of potential representation encodings to the respective query ray casting.

31. The computer-implemented method of any of claims 21 to 30, wherein processing the feature embedding and the fixed ordering of the plurality of potential representation encodings with the weighted sub-model of the machine-learned decoding model comprises:

Determining, by the computing system, a corresponding plurality of scalar weight values for the plurality of potential representation encodings using the weighting sub-model of the machine learning decoding model; and

The weighted average is determined by the computing system based on the respective plurality of scalar weight values.

32. The computer-implemented method of any of claims 21 to 31, wherein the rendering sub-model includes one or more multi-layer perceptrons.

33. The computer-implemented method of any of claims 21 to 32, wherein the multiple query ray casting represents a six-dimensional light field parameterization of the scene.

34. The computer-implemented method of any of claims 21 to 33, wherein query ray casting corresponds to a camera position and a normalized ray direction from the camera pointing through a center of pixels respectively associated with the query ray casting.

35. A computing system for performing object-centric efficient new view synthesis, comprising:

one or more processors;

One or more non-transitory computer-readable media storing instructions that, when executed by the one or more processors, cause the computing system to perform operations comprising:

Obtaining a plurality of potential representation encodings of one or more input images representing scenes depicting a plurality of objects, wherein the plurality of potential representation encodings correspond to a plurality of parts of the scene, wherein at least a subset of the plurality of parts depicts the plurality of objects;

Processing the plurality of potential representation encodings and corresponding query ray projections with a transducer sub-model of a machine learning decoding model to generate feature embedding;

processing the feature embedding and the fixed ordering of the plurality of potential representation encodings with a weighted sub-model of the machine-learned decoding model to obtain a weighted average of the plurality of potential representation encodings for the respective query ray casting; and

Processing the respective query ray projections and the weighted average with a rendering sub-model of the machine learning decoding model to obtain color predictions for respective pixels of the plurality of pixels.

36. One or more non-transitory computer-readable media storing instructions that, when executed by one or more processors, cause a computing system to perform operations comprising:

37. A computer-implemented method for performing object-centric efficient new view synthesis, comprising:

Processing, by a computing system comprising one or more computing devices, one or more input images using a machine learning coding model to obtain a scene embedding of a scene depicted by the one or more input images, wherein the scene depicts a plurality of objects;

Determining, by the computing system, a plurality of potential representation encodings from the scene embedding using a machine learning attention model, wherein the plurality of potential representation encodings correspond to a plurality of portions of the scene, wherein at least a subset of the plurality of portions delineate the plurality of objects;

Processing, by the computing system, the feature embedding and the fixed ordering of the plurality of potential representation encodings with a weighted sub-model of the machine-learned decoding model to obtain a weighted average of the plurality of potential representation encodings for the respective query ray casting; and processing, by the computing system, the respective query ray projections and the weighted average using a rendering sub-model of the machine learning decoding model to obtain color predictions for respective pixels of the plurality of pixels.

38. The computer-implemented method of claim 37, wherein:

the one or more input images representing the scene from a first perspective;

39. The method of claim 38, wherein the method further comprises:

One or more values of one or more parameters of the machine learning coding model, the machine learning attention model, and/or the machine learning decoding model are modified by the computing system based at least in part on the loss function.