CN113255713A

CN113255713A - Machine learning for digital image selection across object variations

Info

Publication number: CN113255713A
Application number: CN202011260870.4A
Authority: CN
Inventors: A·亚恩; S·塔格拉; S·索尼; R·T·罗齐克; N·普里; J·S·罗德
Original assignee: Adobe Systems Inc
Current assignee: Adobe Inc
Priority date: 2020-01-28
Filing date: 2020-11-12
Publication date: 2021-08-13
Also published as: GB202018709D0; US11921777B2; AU2020273315A1; US20210232621A1; US20220253478A1; US11397764B2; DE102020007191A1; GB2591583A

Abstract

Embodiments of the present disclosure relate to machine learning for digital image selection that varies across objects. Digital image selection techniques are described that employ machine learning to select a digital image of an object from a plurality of digital images of the object. The plurality of digital images each capture an object for inclusion as part of generating digital content, such as a web page, a thumbnail for representing a digital video, and the like. In one example, a digital image selection technique is described that employs machine learning to select a digital image of an object from a plurality of digital images of the object. Thus, the service provider system may select a digital image of an object from among multiple digital images of the object with an increased likelihood of obtaining a desired result, and may handle many different ways in which the object may be presented to a user.

Description

Machine learning for digital image selection across object variations

Technical Field

Embodiments of the present disclosure relate to depicting objects in digital images, and more particularly to machine learning for digital image selection that varies across objects.

Background

The way in which an object is depicted in a digital image is one of the main ways to drive the user's interest in the object. For example, digital images may be configured to follow popular style trends, the subject matter of popular television shows, and so forth. In such instances, the characteristics of the object itself may remain the same (e.g., color, shape), but the manner in which the object is depicted in different digital images changes. Thus, one challenge for service provider systems in determining a potential digital image of interest relates to the manner in which the object is rendered in the digital image.

This challenge is exacerbated because user preferences may vary significantly between preferences for the manner in which objects are rendered. In fact, it has been shown that each user has its own choice and affinity for the way in which objects are rendered. For example, a first user may have a preference related to a favorite television show (e.g., showing an object in a modern setting in the middle of the century), while a second user may prefer to view the object in a neutral setting, e.g., on a white background for clarity of the color of the object. Thus, it would be difficult, if not impossible, for humans to determine which preferences are associated with each user, especially in the face of potentially millions of users who may access digital content with digital images (e.g., web pages that provide services in real-time). Conventional service provider systems also fail to address how objects are rendered in digital images. This is because conventional techniques rely on identifiers of digital images as a whole, and therefore, these techniques cannot handle (address) the actual visual characteristics of the depicted object, nor the way in which those characteristics are related to other digital images. As a result, conventional service provider systems may be inaccurate and may result in inefficient use of computing and hardware resources for recommending digital images of interest.

Disclosure of Invention

Digital image selection techniques are described that employ machine learning to select a digital image of an object from a plurality of digital images that involve a change in a manner in which the object is depicted. For example, multiple digital images may capture an object, but differ in the manner in which the object is depicted in the digital images, such as different models, backgrounds, etc. that are worn on the same piece of clothing. The likelihood of a result (e.g., a conversion of goods or services depicted by the object) is then increased by processing the user preferences relating to these changes to select a digital image for inclusion in the digital content.

In one example, a digital image selection technique is described that employs machine learning to select a digital image of an object from a plurality of digital images that relate to changes in the object. First, a user ID is received by a service provider system as part of a request to obtain digital content (e.g., a web page). The user profile is then obtained by the service provider system based on the user ID. The service provider system also selects a digital image from the plurality of digital images having the change in the object for inclusion as part of the digital content.

To this end, image metadata comprising features extracted from digital images (e.g., using a convolutional neural network) is used along with a user profile to generate a prediction score for each of a plurality of digital images having variations. The digital image indicated as most likely to produce the desired result (e.g., conversion) is then selected by the system for inclusion as part of the digital content (e.g., web page). As a result, the service provider system may select a digital image of the object from among multiple digital images of the object with an increased likelihood of obtaining a desired result, and may handle many different ways in which the object may be presented to the user.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. Thus, this summary is not intended to identify key features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

Drawings

Specific embodiments are described with reference to the accompanying drawings. Entities appearing in the figures may refer to one or more entities and thus reference may be made interchangeably to the singular or plural form of an entity in discussion.

FIG. 1 is an illustration of an environment in an example implementation that is operable to employ machine learning digital image selection techniques described herein.

FIG. 2 depicts a system in an example implementation that generates training data to train a machine learning model.

Fig. 3 is a flow diagram depicting a procedure in an example implementation in which an exploration (exploration)/development (exploration) technique is used to generate training data and a machine learning model is trained to select a digital image from a plurality of digital images of a subject to be used to generate digital content.

FIG. 4 depicts a system in an example implementation in which the training data of FIG. 2 is used to train a machine learning model.

FIG. 5 is a flow diagram depicting a procedure in an example implementation in which training data with interactivity events is used, the training data for interactivity events comprising: a user profile; image metadata, the image metadata including image features extracted from the digital image; and result data for training the machine learning model.

FIG. 6 depicts a system in an example implementation of selecting a digital image from a plurality of digital images of an object using the machine learning model trained in FIG. 4.

FIG. 7 is a flow diagram depicting a procedure in an example implementation in which digital content having digital images selected by a machine learning model is generated based on a user profile and image metadata.

Fig. 8 illustrates an example system including various components of an example device that may be implemented as any type of computing device as described with reference to fig. 1-7 and/or for implementing embodiments of the techniques described herein.

Detailed Description

SUMMARY

It has been observed that in real world scenes, even in instances where the visual characteristics of the object itself remain unchanged, each user has a different affinity for different aspects related to the variation in the way the object is depicted in the digital image. However, conventional techniques and systems for selecting a digital image for inclusion as part of digital content do not address such variations, such as a particular view of an object to be included in a web page, a thumbnail to be used to represent a digital video, a model for wearing a piece of clothing, and so forth. In contrast, conventional techniques rely on a generic approach when processing a particular object. In addition, conventional techniques typically train a dedicated machine learning model for each digital image, and thus cannot take advantage of visual similarity to other digital images, and encounter the "cold start" problem as described below. Thus, this may result in inefficient use of computing and network resources for providing and receiving digital content (e.g., web pages, web-enabled applications, etc.) that includes these digital images.

Thus, digital image selection techniques are described that employ machine learning to select a digital image of an object from a plurality of digital images of the object. The plurality of digital images each capture an object for inclusion as part of generating digital content, such as a web page, a thumbnail for representing a digital video, and the like. However, the plurality of digital images includes variations in the manner in which the objects are depicted, which may vary from user to user. In one example, a user ID is received by a service provider system as part of a request to obtain digital content (e.g., a web page). The user profile is then obtained by the service provider system based on the user ID. For example, the user profile may describe user interactions with digital content items, digital images, user demographic information, a location from which a digital content request originated, and so forth.

The service provider system then selects a digital image from the plurality of digital images of the object for inclusion as part of the digital content. For example, a plurality of digital images may be located based on an object ID associated with the requested digital content. In this example, the plurality of digital images each capture an object of interest to be represented in the digital content, but have a difference in at least one visual characteristic that supports a change in the manner in which the object is depicted. For example, the object may be a piece of clothing in a particular color, but worn by a different mannequin. Other examples of variations are also contemplated, including background characteristics, different angles, arrangements, orientations, lighting, etc. of the scene capturing the object.

In one example of digital image selection, the service provider system determines whether to explore or develop user behavior associated with a user ID in response to a request as part of selecting a digital image. The determination to explore user behavior involves selecting a digital image in order to learn more about user behavior with respect to objects depicted in the digital image, i.e., user preferences with respect to different depictions of objects. On the other hand, a determination is made to develop user behavior in order to maximize the likelihood of obtaining a desired result when exposed to an object via a digital image, such as recommending an item of interest, a conversion, and the like.

Thus, when making determinations to explore user behavior, the service provider system randomly selects a digital image from a plurality of digital images having objects. When a determination is made to develop user behavior, the service provider system selects a digital image from a plurality of digital images based on the user profile using a machine learning model, such as a neural network. However, in either instance, training data is generated based on user interaction with the selected digital images to train and/or update the training of the machine learning model, for example to capture current trends.

For example, the training data may be formed into a plurality of interactivity events, at least a portion of which correspond to requests made for digital content. Each interaction event may include a user ID that initiates the request, a user profile associated with the user ID, an image ID of the digital image selected in response to the request, image metadata associated with the digital image, and output data describing the results of exposure of the digital image as part of the digital content. For example, the results may describe a conversion, such as whether the digital image was selected (e.g., as a thumbnail of the originating digital video), whether a purchase of a corresponding good or service corresponding to the object depicted in the digital image resulted, and so forth.

Image metadata used as part of the training data and/or for selecting digital images for subsequent requests may support increased functionality over conventional techniques. In conventional systems, because a single model is trained for each image, the image ID is used only to identify the correspondence between the digital image, the user ID, and the resulting result. Thus, the image ID does not describe the visual characteristics extracted directly from the digital image and the variations of those visual characteristics, and thus cannot support the determination of the similarity between one digital image and another. Thus, conventional techniques suffer from the problem of "cold start" and the predictions about digital images are not accurate enough until a sufficient amount of training data is received (typically within weeks). To collect this data, it is resource and computationally intensive, and this leads to user frustration due to inaccurate predictions.

However, in the techniques described herein, the image metadata used to train the machine learning model utilizes image features extracted from the corresponding digital images, e.g., as vectors generated by feature extraction using neural networks. In this manner, the image metadata describes content that is visually contained within the digital image and can be described with increased accuracy over other techniques, such as image tagging that relies on the user's ability to express and manually tag the content contained in the digital image. By mapping features extracted from a digital image to a feature space using a machine learning model, visual similarity of the digital image to other digital images can be determined and used to avoid cold start problems of conventional techniques and variations in processing object delineation. This serves to improve the operation of and increase the accuracy of a computing device implementing these techniques.

To map the features and user profiles to the feature space, the machine learning model is trained and updated using the user profiles and image metadata included in the corresponding interaction events in the training data (e.g., image features extracted from the digital image using a neural network). For example, the service provider system may process the user profile and image metadata as part of machine learning along with a loss function using the corresponding result data. In this manner, a single machine learning model is trained to generate a predicted score for each combination of subsequent user profiles and image data extracted from corresponding digital images of the subject. This overcomes the limitations of conventional techniques that generate a dedicated machine learning model only for each digital image, thus not supporting similarity between digital images, and thus subject to the cold start problem as previously described.

Continuing with the above example, to select a digital image of the object from the plurality of digital images of the object in response to the development determination, the service provider system generates a prediction score for each digital image (e.g., derived based on the object ID) using the associated image metadata and the user profile corresponding to the user ID associated with the request. The digital image indicated as most likely to produce the desired result (e.g., conversion) is then selected by the system for inclusion as part of the digital content (e.g., web page).

Digital content (e.g., web pages) is then generated by the service provider system using the digital images (whether randomly selected as part of an exploration or based on a machine learning model as part of development), which is transmitted back to the originator of the request. As a result, the service provider system may select a digital image of the object from among multiple digital images of the object with an increased likelihood of obtaining a desired result, and may handle many different ways in which the object may be presented to the user. This cannot be performed by humans alone due to many differences in user affinity to different visual characteristics that humans cannot detect. Further discussion of these examples and others is included in the following sections and shown in the corresponding figures.

Term examples

"digital content" includes any type of data that can be rendered by a computing device. Examples of digital content include web pages, digital video, digital media, digital audio, digital images, user interfaces, and the like.

A "neural network" typically includes a series of layers modeled as connections between neurons having nodes (i.e., neurons) and processing data to obtain outputs, such as classifying inputs as exhibiting or not exhibiting particular characteristics. One example of a neural network is a convolutional neural network.

A "loss function" is a function that maps the values of one or more interpretation variables (e.g., features) to real numbers representing costs associated with events, and in an optimization design, the loss function is minimized in order to train a machine learning model. In classification, for example, the loss function is a penalty for incorrect classification, e.g., whether the result described in the output data does or does not occur.

"exploration/development" is utilized to determine whether to explore or develop user behavior. The determination to explore user behavior involves selecting a digital image to learn more about user behavior with respect to objects depicted in the digital image, e.g., user preferences for the manner in which the objects are depicted. On the other hand, a determination is made to develop user behavior in order to maximize the likelihood of obtaining a desired result when exposed to an object via a digital image, such as recommending an item of interest, a conversion, and the like.

"transformation" may correspond to various actions. Examples of such actions include whether an interaction (e.g., a hover or "click") occurred with the digital image, whether a corresponding product (e.g., object) or service was added to the shopping cart, whether a corresponding product or service was purchased, a selection to launch a thumbnail of a digital video or digital audio, and so forth.

In the following discussion, an example environment is described in which the techniques described herein may be employed. Example processes that can be performed in the example environment, as well as other environments, are also described. Thus, execution of the example processes is not limited to the example environment, and the example environment is not limited to execution of the example processes.

Example Environment

FIG. 1 is an illustration of a digital media environment 100 in an example implementation that is operable to employ machine learning and digital image selection techniques described herein that support variations in the manner in which objects are depicted within digital images. The illustrated environment 100 includes a service provider system 102 and a client device 104 communicatively coupled via a network 106 (e.g., the internet). The computing devices implementing service provider system 102 and client device 104 may be configured in various ways.

For example, the computing device may be configured as a desktop computer, a laptop computer, a mobile device (e.g., assuming a handheld configuration such as a tablet or mobile phone as illustrated for client device 104), and so forth. Thus, computing devices may range from full resource devices with substantial memory and processor resources (e.g., personal computers, game consoles) to low-resource devices with limited memory and/or processing resources (e.g., mobile devices). Additionally, although a single computing device is shown and described in some instances, the computing device may represent multiple different devices, such as multiple servers used by an enterprise to perform "on the cloud" operations as described in fig. 8.

The client device 104 includes a communication module 108 (e.g., a web browser, a web-enabled application, etc.) configured to formulate a request 110 for transmission to a digital content generation system 112 via the network 106. In response, the digital content generation system 112 generates the digital content 114 for transmission to the user interface 116 via the display device 118 of the client device 104 and presentation in the user interface 116. Digital content 114 may be configured in various ways, such as a web page, user interface screen, digital video, and so forth.

As part of generating digital content 114, the digital content generation system employs a digital image selection module 120 to select a digital image for inclusion as part of digital content 114. For example, digital image 122(N) may be selected from a plurality of digital images 122(1), 122(2), …, 122(N) stored by storage device 124, each of the plurality of digital images depicting an object but having a visual difference in the manner in which the object is depicted.

In the illustrated example, digital images 122(1) through 122(N) each include a pair of shoes, but are captured from different perspectives, have different arrangements, and so on. Thus, in this example, the visual characteristics of the object itself remain the same (e.g., color), but the manner in which the object is depicted is different. Other examples are also contemplated, such as differences in the background of the digital image, differences in the object itself, a human model displayed with the object, and so forth. Although the digital image selection module 120 is illustrated as being implemented at the service provider system 102, the functionality of the digital image selection module 120 may also be implemented in whole or in part at the client device 104.

To select digital images, the digital image selection module 120 employs a machine learning module 126 that implements a machine learning model 128 to select digital images that are most likely to achieve the desired result. For example, the results may include conversions, such as whether a corresponding good or service associated with the object depicted in the digital image was purchased, whether the digital image was selected (i.e., "clicked" to initiate output of corresponding digital content in the digital video scene), and so forth.

For example, digital image selection module 120 may receive request 110 for digital content 114 and obtain digital images 122(1) through 122(N) associated with object ID 130 associated with digital content 114. The machine learning model 128 then calculates a probability score. A probability score is calculated for each of the plurality of digital images 122(1) through 122(N) based on the visual characteristics of the digital image and the user profile associated with the user ID received as part of the request. The probability score indicates the probability of obtaining the desired result.

To determine which visual characteristics are included in a respective digital image, image features are extracted from the corresponding digital image. For example, an embedding layer of a neural network (e.g., a convolutional neural network) may be used to extract image features to map a digital image to a lower-dimensional embedding space. In this manner, a single machine learning model 128 may be used for multiple different digital images, and thus the visual similarity between these digital images is processed. This is not possible in conventional techniques that employ a single proprietary model for each digital image.

As a result, the techniques described herein overcome the challenges of conventional techniques and improve the operation of computing devices that implement these techniques. A first such example is known as the counterfactual (counter-factual) problem. Assume that data is received indicating a user selection (e.g., conversion) of a digital image that does not indicate how user interaction will occur with other digital images. To avoid this problem, conventional techniques use a separate dedicated machine learning model for each digital image. A problem with this approach is that it is not possible to learn the patterns in the digital images, since each digital image is associated with a single machine learning model. This adversely affects the accuracy of the machine learning model and, therefore, the operation of the corresponding computing device.

However, in the techniques described herein, a system configured to process multiple digital images of an object using a single machine learning model is described. As part of this, image features are extracted from the corresponding digital image using an embedding layer of a neural network to map the image as a vector to a lower dimensional space. In this way, the proximity of the vectors within the embedding space is a measure of the visual similarity between these digital images, and thus, the machine learning model 128 may implement a decision strategy on the digital images of the object, which results in an increased accuracy in making the predictions.

Furthermore, because conventional techniques train a separate machine learning model for each digital image, these conventional techniques are not themselves based on image content, but are based only on image IDs. In contrast, the techniques described herein may be used to train a single machine learning model 128 that learns image features extracted from the plurality of digital images 122(1) through 122 (N). This may be performed, for example, by using a pre-trained convolutional neural network to extract image features as an embedding learned by the last layer of the network. This enables the machine learning model 128 to learn patterns in digital images, which is not possible in conventional techniques.

In addition, conventional techniques suffer from cold start problems. In conventional techniques, when adding new digital images to support personalization of digital content, it may take weeks until the digital images are ready for accurate personalization. This is because a separate machine learning model is trained for each digital image, and thus, when a new digital image is added, it takes approximately two weeks in a real-world implementation to collect enough training data (e.g., "click" data) through the exploration technique for the digital image. This is a significant challenge even when the new digital image is a minor change to an existing digital image.

However, in the techniques described herein, this problem is solved in a number of ways. First, using the embedding layer, the machine learning model 128 maps the image identifier to a low-dimensional vector space. Thus, when a new digital image is added, some training examples are sufficient to map the digital image to a vector in the embedding space of the machine learning model 128. Thus, the patterns learned for digital images that map to vectors that are close to the vectors of the new digital image can now be used by the machine learning model 128 to control distribution, for example, as part of the digital content, without waiting for the weeks time required to use conventional techniques. Further, using the image metadata, the machine learning model 128 may utilize patterns learned from other digital images having similar image metadata (e.g., shape, color, etc.). Further discussion of these examples and others is included in the following sections and shown in the corresponding figures.

In general, the functionality, features, and concepts described with respect to the above examples and the following examples may be employed in the context of the example processes described in this section. In addition, the functionality, features, and concepts described with respect to the various figures and examples in this document may be interchanged with one another and are not limited to implementation in the context of a particular figure or process. Moreover, blocks associated with different representative processes and corresponding figures herein may be applied together and/or combined in different ways. Thus, individual functionality, features, and concepts described with respect to different example environments, devices, components, figures, and processes herein may be used in any suitable combination and are not limited to the specific combinations represented by the examples listed in this specification.

Training data generation

Fig. 2 depicts a system 200 in an example implementation that generates training data to train the machine learning model 128. Fig. 3 depicts a procedure 300 in an example implementation, the procedure 300 using exploration/development techniques to generate training data, and training the machine learning model 128 to select a digital image from a plurality of digital images depicting a change in an object to be used to generate digital content.

The following discussion describes techniques that may be implemented with the previously described systems and devices. Aspects of each of the procedures may be implemented in hardware, firmware, software, or a combination thereof. The process is illustrated as a set of blocks that specify operations performed by one or more devices and is not necessarily limited to the orders shown for performing the operations by the respective blocks. In the sections of the following discussion, reference will be made to fig. 1 to 3.

First, in this example, a request 110 for digital content 114 (e.g., a web page) is received by the digital content generation system 112 of FIG. 1. The communication module 108 is configured as a web browser of the client device 104 to generate a request 110 for a particular web page. The request 110 includes a user ID202 associated with the user of the client device 104.

The user ID202 is then passed as input to the profile collection module 204 of the digital image selection module 120. The profile collection module 204 is configured to obtain a user profile 206 from the storage device 208 based on the user ID202 associated with the request 110 for digital content that includes a digital image of an object (block 302). The storage device 208 may be maintained locally at the service provider system 102 and/or remotely by a third party system or client device 104. The user profile 206 is configured to describe user interactions with digital images, including which digital images were exposed to a corresponding user ID and the results of the exposure, e.g., conversion. The user profile 206 may also describe characteristics of the corresponding user, such as demographic data (e.g., age, gender) and other information related to the user ID202, e.g., corresponding geographic location, IP address, etc.

The exploration/development determination module 210 is then utilized to determine whether to explore or develop the user behavior associated with the user ID202 in response to the request 110 (block 304). The determination to explore user behavior involves selecting a digital image to learn more about user behavior with respect to objects depicted in the digital image, e.g., user preferences for the manner in which the objects are depicted. On the other hand, a determination is made to develop user behavior in order to maximize the likelihood of obtaining a desired result when exposed to an object via a digital image, such as recommending an item of interest, a conversion, and the like.

To make the determination in the illustrated example, the Epsilon-Greedy module 212 is employed by the exploration/development determination module 210. For example, Epsilon may be defined as a value between zero and one, e.g., 0.1. The value indicates the percentage of user IDs and associated user behaviors to explore and, thus, the remaining percentage of user IDs and associated user behaviors to develop. The value of Epsilon may be user specified, automatically specified without user intervention based on heuristics, etc. The tradeoff between exploration and development allows training data to be generated by digital image selection module 120, which digital image selection module 120 captures new trends in user behavior and thus remains accurate and up-to-date. Other techniques may also be employed by the exploration/development determination module 210 to make the determination.

Accordingly, in response to the exploration determination, the exploration module 214 employs the random image selection module 216 to randomly select a digital image from a plurality of digital images that has a change in the manner in which the objects are portrayed one-by-one (block 306). In another aspect, in response to the development determination, the development module 218 is operable to select a digital image from a plurality of digital images using the machine learning model 128, the plurality of digital images having a change in a manner of depicting the object (block 308). Further discussion of the operation of the machine learning model 128 for selecting digital images of objects is described in the following discussion with reference to fig. 6 and 7.

The selected digital image is included as part of the digital content and the results of the user interaction with the digital image are communicated to training data generation module 220. For example, the selected digital image may be captured as an object of an item for sale as part of a web page. Thus, the result is whether a conversion occurred, which is communicated to the training data generation module 220. As previously described, the conversion can correspond to various actions, such as whether an interaction (e.g., a hover or "click") occurred with the digital image, whether a corresponding product (e.g., object) or service was added to the shopping cart, whether a corresponding product or service was purchased, and so forth. Other results besides conversion are contemplated without departing from the spirit and scope of the present subject matter, such as initiating a corresponding digital video by selecting the digital image to represent the video.

Training data 222 (illustrated as being stored in storage 224) that can be used to train machine learning model 128 is then generated using training data generation module 220. To this end, training data generation module 220 generates interaction event 226 to correspond to request 110. The interaction event 226 includes the user ID202, the user profile 206 associated with the user ID202, the image ID 228 of the selected digital image, image metadata 230, and result data 232 describing the result of including the selected digital image as part of the digital content (block 310). For example, the result data 232 may describe whether the result did or did not occur, such as a conversion or other action.

As previously described, the image metadata 230 may include features 234 extracted from the selected digital image using machine learning. This may be stored as part of training data 222 at the time of generation, or later as part of training data 222 by taking a digital image corresponding to image ID 228 and processing the image using feature extraction as described above. The image metadata 230 may also include object metadata 236. The object metadata 236 includes information about the objects captured by the digital images (e.g., product category, description, color, size, image tag, etc.), which may be obtained from text (e.g., title, tag, description) or elsewhere associated with the respective digital images. As further described in the following sections, the training data 222 is then used to generate a machine learning model (block 312).

Machine learning model training

FIG. 4 depicts a system 400 in an example implementation for training the machine learning model 128 using the training data of FIG. 2. FIG. 5 depicts a procedure 500 in an example implementation in which training data with interactivity events is used, the training data for interactivity events comprising: a user profile; image metadata, the image metadata including image features extracted from the digital image; and result data for training the machine learning model 128.

The following discussion describes techniques that may be implemented with the previously described systems and devices. Aspects of each of the procedures may be implemented in hardware, firmware, software, or a combination thereof. The process is illustrated as a set of blocks that specify operations performed by one or more devices and is not necessarily limited to the orders shown for performing the operations by the respective blocks. In the sections of the following discussion, reference will be made to fig. 4 to 5.

The example discussion continues from the previous section, thus beginning with the receipt of training data 222 by the machine learning module 126 of the digital image selection module 120. The image metadata 230 may also include object metadata 236. Training data 222 includes object metadata 236 having information about the object captured by the digital image, such as product category, description, color, size, image label, and the like.

Based on the observed user interaction with the digital image of the object, training data 222 as described in the previous example is collected as a plurality of interaction events 226. As part thereof, each of the interaction events 226 in the training data 222 includes a user profile 206 associated with the user ID202, image metadata 230 having image features 234 extracted from respective ones of the digital images using machine learning, and result data 232 describing a result of including the selected digital image as part of the digital content (block 502). Thus, the object metadata 236 may be characteristic of the interaction event 226 (e.g., how the object is captured in the digital image) and/or common to multiple digital images, such as colors that are common between digital images.

The machine learning module 126 is then employed to generate a machine learning model 128, the machine learning model 128 being trained based on the object metadata 236, the user profile 206, the image metadata 230, and the loss function 402 based on the result data 232 (block 504). The loss function 402 is a function that maps values of one or more interpretation variables (e.g., features) to real numbers representing costs associated with events, and in an optimized design, the loss function 402 is minimized in order to train the machine learning model 128. In classification, for example, the loss function 402 is a penalty for incorrect classification, such as whether the result described in the output data 232 did or did not occur.

Accordingly, the object metadata 236, the user profile 206, and the image features 234 are processed by the embedding layer 404 of the machine learning model 128 to generate a training prediction 406, for example, for each of the interaction events 226. The training prediction is used as part of the loss function 402 along with the result data 232 to back propagate the results of the comparison of the training prediction 406 with the result data 232 to set parameter values (e.g., neurons and corresponding connections within a neural network) within the machine learning model 128 to train the machine learning model 128.

In this manner, the machine learning model 128 learns an embedding space for different images of the object, which can be used to determine similarities between digital images, thus addressing conventional cold start and counterfactual challenges as previously described. This training may be performed to initially generate the machine learning model 128 and to generate updated versions of the machine learning model 128, for example to capture trends in changes in user behavior with respect to the manner in which objects are represented in the digital images. The digital images may then be selected using the generated machine learning model 128, as further described in the following sections.

Digital image selection using machine learning models

FIG. 6 depicts a system 600 in an example implementation for selecting a digital image from a plurality of digital images depicting changes in an object using the machine learning model 128 trained in FIG. 4. Fig. 7 depicts a procedure 700 in an example implementation of generating digital content 114 having digital images selected by the machine learning model 128 based on the user profiles 206 and the image metadata 230.

In this example, the digital content 114 is generated using a machine learning model 128 that is trained as described in the previous section. First, a user ID202 associated with a request 110 for digital content 114 including a digital image of an object is received (block 702). For example, the digital content may be configured as a web page and the digital image included as part of the web page. Other examples are also contemplated, such as thumbnails for representing digital videos.

In response, the user profile 206 associated with the user ID202 is obtained by the profile collection module 204 from a storage device 208 (block 704), which may be local to the service provider system 102 or removed from the service provider system 102. The user profile 206 describes various characteristics associated with the user ID 202. This may include characteristics of the associated user, such as demographic information (e.g., age and gender), characteristics of how access is obtained by the user ID202 (e.g., device type, network connection), location, and so forth. The user profile 206 may also describe past user interactions with the corresponding digital image, such as results of interactions with the digital image.

A plurality of digital images associated with the object ID 130 is also obtained (block 706), including a change in the manner in which the objects are depicted one after another. For example, the image collection module 602 may locate the object ID 130, the object ID 130 corresponding to an item of digital content to be generated. With continued reference to the previous example, the digital content may be configured as a web page having portions for depicting objects, such as products or services for sale in an e-commerce website. Thus, the web page includes an object ID 130 that is associated with the digital image 604 depicting the object. A selection is then made as to which digital image of the plurality of digital images 604 is to be included in the web page. In this way, in this example, the selection is made based only on the object, and not on the entire digital content, and thus the prediction is formed with increased accuracy because the prediction is not skewed by the "which other content" is included in the digital content.

To do so, the user profile 206 and digital image 604 are passed to the machine learning module 126. The machine learning module 126 is then configured to generate a plurality of predicted scores 606 for the plurality of digital images 604. Each prediction score is generated by the machine learning model 128 based on the user profile 206 and features extracted from the respective digital image from the plurality of digital images 604 (block 708). For example, the machine learning model 128 may include an embedding layer 404 to generate image metadata having image features extracted from each of the digital images 604. These image features are processed by the machine learning model 128 using machine learning along with the user profile to generate a predicted score 606 for each of the digital images 604. The prediction score indicates a probability (e.g., between zero and one) that the corresponding result will occur based on the selected digital image being included as part of the digital content 114. For example, the prediction score 606 may indicate a likelihood of conversion, e.g., to select a digital image to initiate a corresponding digital image, initiate a purchase of a good or service corresponding to an object in a digital image, etc.

The prediction score 606 is then passed as input by the machine learning module 126 to the prediction selection module 608. The prediction selection module 608 is configured to select a digital image from the plurality of digital images 604 based on the plurality of prediction scores (block 710). For example, the prediction selection module 608 may select digital images that are most likely to achieve a desired result (e.g., conversion) based on the prediction scores 606. The prediction 610 is then passed to a digital content generation module 612 to generate the digital content 114 with the selected digital image 604(n) including the object (block 712), for example to include the digital image 604(n) as part of a web page.

In this manner, the techniques described herein overcome the challenges, limitations, and computational inefficiencies of conventional techniques. This includes: dealing with counterfactual problems, cold start problems (and, therefore, computing resources available weeks before conventional techniques), and processing the image content itself to learn patterns in the digital image, which is not possible in conventional techniques.

Example systems and devices

Fig. 8 illustrates an example system, generally at 800, that includes an example computing device 802 that represents one or more computing systems and/or devices that may implement the various techniques described herein. This is illustrated by the inclusion of a digital image selection module 120. Computing device 802 may be, for example, a server of a service provider, a device associated with a client (e.g., a client device), a system on a chip, and/or any other suitable computing device or computing system.

The example computing device 802 as illustrated includes a processing system 804, one or more computer-readable media 806, and one or more I/O interfaces 808 communicatively coupled to each other. Although not shown, the computing device 802 may also include a system bus or other data and command transfer system that couples the various components to one another. A system bus can include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. Various other examples are also contemplated, such as control lines and data lines.

Processing system 804 represents functionality to perform one or more operations using hardware. Accordingly, the processing system 804 is illustrated as including hardware elements 810 that may be configured as processors, functional blocks, and so forth. This may include implementation in hardware as an application specific integrated circuit or other logic device formed using one or more semiconductors. Hardware elements 810 are not limited by the materials from which they are formed or the processing mechanisms employed therein. For example, a processor may be comprised of semiconductor(s) and/or transistors (e.g., electronic Integrated Circuits (ICs)). In this context, processor-executable instructions may be electronically-executable instructions.

The computer-readable storage medium 806 is illustrated as including memory/storage 812. Memory/storage 812 represents memory/storage capacity associated with one or more computer-readable media. Memory/storage component 812 may include volatile media (such as Random Access Memory (RAM)) and/or nonvolatile media (such as Read Only Memory (ROM), flash memory, optical disks, magnetic disks, and so forth)). The memory/storage component 812 may include fixed media (e.g., RAM, ROM, a fixed hard drive, etc.) as well as removable media (e.g., flash memory, a removable hard drive, an optical disk, and so forth). The computer-readable medium 806 may be configured in a variety of other ways, as described further below.

The input/output interface(s) 808 represents functionality to: a user is allowed to enter commands and information to computing device 802, and also to present information to the user and/or other components or devices using various input/output devices. Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone, a scanner, touch functionality (e.g., capacitive or other sensors configured to detect physical touches), a camera (e.g., which may employ visible or non-visible wavelengths such as infrared frequencies to recognize movement as gestures that do not involve touch), and so forth. Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a network card, a haptic response device, and so forth. Accordingly, the computing device 802 may be configured in various ways as further described below to support user interaction.

Various techniques may be described herein in the general context of software, hardware elements, or program modules. Generally, these modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and "component" as used herein generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques may be implemented on a variety of commercial computing platforms having a variety of processors.

An implementation of the described modules and techniques may be stored on or transmitted across some form of computer readable media. Computer readable media can include a variety of media that can be accessed by computing device 802. By way of example, and not limitation, computer-readable media may comprise "computer-readable storage media" and "computer-readable signal media".

"computer-readable storage medium" may refer to media and/or devices that enable persistent and/or non-transitory storage of information as compared to mere signal transmission, carrier waves, or signals per se. Accordingly, computer-readable storage media refers to non-signal bearing media. Computer-readable storage media include hardware (such as volatile and non-volatile media, removable and non-removable media) and/or storage devices implemented in methods or technology suitable for storage of information such as computer-readable instructions, data structures, program modules, logic elements/circuits or other data. Examples of computer readable storage media may include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, hard disks, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other storage devices, tangible media, or articles of manufacture suitable for storing the desired information and which may be accessed by a computer.

"computer-readable signal medium" may refer to a signal-bearing medium configured to transmit instructions to hardware of computing device 802, such as via a network. Signal media may typically embody computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave, data signal, or other transport mechanism. Signal media also includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection; and wireless media such as acoustic, RF, infrared and other wireless media.

As previously described, the hardware element 810 and the computer-readable medium 806 represent modules, programmable device logic, and/or fixed device logic implemented in hardware that may be employed in some embodiments to implement at least some aspects of the techniques described herein, such as to execute one or more instructions. The hardware may include integrated circuits or systems on a chip, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), components of Complex Programmable Logic Devices (CPLDs), and other implementations in silicon or other hardware. In this context, hardware may operate as a processing device that performs program tasks defined by instructions and/or hardware implemented logic, such as the computer readable storage media previously described, for example, by hardware and hardware for storing instructions to be executed.

Combinations of the foregoing may also be used to implement the various techniques described herein. Thus, software, hardware, or executable modules may be implemented as one or more instructions and/or logic embodied on some form of computer-readable storage medium and/or by one or more hardware elements 810. Computing device 802 may be configured to implement particular instructions and/or functions corresponding to software and/or hardware modules. Accordingly, implementation of the modules as software executable by the computing device 802 may be achieved, at least in part, in hardware (e.g., through use of computer-readable storage media and/or hardware elements 810 of the processing system 804). The instructions and/or functions may be executable/operable by one or more articles of manufacture (e.g., one or more computing devices 802 and/or processing systems 804) to implement the techniques, modules, and examples described herein.

The techniques described herein may be supported by various configurations of the computing device 802 and are not limited to specific examples of the techniques described herein. The functionality may also be implemented, in whole or in part, through the use of a distributed system, such as through the "cloud" 814 via a platform 816 as described below.

Cloud 814 includes and/or is representative of platform 816 for resources 818. The platform 816 abstracts underlying functionality of hardware (e.g., servers) and software resources of the cloud 814. The resources 818 can include applications and/or data that can be utilized when performing computer processing on servers remote from the computing device 802. The resources 818 may also include services provided over the internet and/or over a subscriber network, such as a cellular network or a Wi-Fi network.

The platform 816 may abstract resources and functionality to connect the computing device 802 with other computing devices. The platform 816 may also be used to abstract resource scaling to provide a corresponding level of scaling to meet the demand for resources 818 that are implemented via the platform 816. Thus, in interconnected device embodiments, implementation of functionality described herein may be distributed throughout the system 800. For example, the functionality may be implemented in part on the computing device 802, as well as via the platform 816 that abstracts the functionality of the cloud 814.

Conclusion

Although the invention has been described in language specific to structural features and/or methodological acts, it is to be understood that the invention defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as example forms of implementing the claimed invention.

Claims

1. In a digital media digital content generation environment that processes object changes as part of a digital image selection, a method implemented by at least one computing device, the method comprising:

receiving, by the at least one computing device, a user ID associated with the request for digital content;

obtaining, by the at least one computing device, a plurality of digital images having changes in the manner in which objects are depicted, and a user profile associated with the user ID;

generating, by the at least one computing device, a plurality of prediction scores for the plurality of digital images, each prediction score generated by a machine learning model based on the user profile and image metadata, the image metadata comprising features extracted from respective digital images of the plurality of digital images;

selecting, by the at least one computing device, a digital image of the plurality of digital images based on the plurality of prediction scores; and

generating, by the at least one computing device, the digital content as having the selected digital image depicting the object.

2. The method of claim 1, wherein the generating comprises: extracting the features from the respective digital images as an embedding using a convolutional neural network.

3. The method of claim 1, wherein the image metadata further describes characteristics of the object, including a product category or an object description from text associated with the respective digital image.

4. The method of claim 1, the user profile having user information including demographic information or location information.

5. The method of claim 1, wherein a prediction score of the plurality of prediction scores is indicative of a probability of obtaining a result resulting from including the respective digital image as part of the digital content.

6. The method of claim 5, wherein the result is a conversion.

7. The method of claim 1, wherein the digital content is a web page and the plurality of digital images include respective differences in the manner in which the object is depicted.

8. The method of claim 7, wherein the object is an article of clothing and the corresponding discrepancy is a mannequin wearing the article of clothing.

9. The method of claim 1, wherein the digital content is a digital video and the selected digital image is configured to be selectable to launch a thumbnail of the digital video.

10. The method of claim 1, wherein the machine learning model is a single convolutional neural network trained using a plurality of training digital images, and the plurality of prediction scores are generated from the plurality of digital images using the single convolutional neural network.

11. A system in a digital media machine learning model training environment that processes object changes as part of a digital image selection, the system comprising:

an exploration/development determination module implemented at least in part in hardware of the computing device to make an exploration or development determination to explore or develop user behavior associated with a user ID in response to a request for digital content;

an exploration module implemented at least in part in hardware of the computing device to randomly select a digital image from a plurality of digital images in response to the exploration determination, the plurality of digital images depicting changes in an object one after another;

a development module implemented at least in part in hardware of the computing device to select a digital image from the plurality of digital images based on a machine learning model in response to the development determination, the plurality of digital images depicting the change in the object;

a training data generation module implemented at least in part in hardware of the computing device to generate, as part of training data, for each of the requests, an interaction event comprising a user profile associated with the user ID, result data describing a result of including the selected digital image as part of the digital content, and image metadata having features extracted from the selected digital image using machine learning; and

a machine learning module implemented at least in part in hardware of the computing device to generate a machine learning model using the training data.

12. The system of claim 11, wherein the features are extracted from the selected digital image using a convolutional neural network.

13. The system of claim 11, wherein the training data further describes characteristics of the object, the characteristics including a product category or an object description from text associated with the respective digital image.

14. The system of claim 11, further comprising: a profile collection module implemented at least in part in hardware of the computing device to obtain the user profile based on the user ID associated with the request for the digital content.

15. The system of claim 11, further comprising: an image collection module implemented at least in part in hardware of the computing device to obtain the plurality of digital images based on an object ID associated with the digital content.

16. The system of claim 11, wherein the exploration/development determination module employs an Epsilon-greedy exploration technique.

17. A system in a digital media machine learning model training environment, the system comprising:

means for receiving training data, the training data comprising:

object metadata describing an object included in the plurality of digital images; and

a plurality of interaction events, each interaction event of the plurality of interaction events comprising result data, a user profile, and image metadata having features extracted from a respective digital image of the plurality of digital images using machine learning; and means for generating a machine learning model trained using machine learning based on the object metadata, user profile and image metadata and a loss function based on the result data.

18. The system of claim 17, wherein the features are extracted as an embedding from the respective digital image using a convolutional neural network.

19. The system of claim 17, wherein the image metadata further describes characteristics of the object, including a product category or an object description from text associated with the respective digital image.

20. The system of claim 17, wherein the user profile has user information including demographic information or location information.