As an Amazon Associate I earn from qualifying purchases.

Virtual try-all: Visualizing any product in any personal setting

[ad_1]

A way for online shoppers to virtually try out products is a sought-after technology that can create a more immersive shopping experience. Examples include realistically draping clothes on an image of the shopper or inserting pieces of furniture into images of the shopper’s living space.

NERF-BLIRF comparison.16x9.png

Related content

Representing light and density fields as weighted sums over basis functions, whose weights vary over time, improves motion capture, texture, and lighting.

In the clothing category, this problem is traditionally known as virtual try-on; we call the more general problem, which targets any category of product in any personal setting, the virtual try-all problem.

In a paper we recently posted in arXiv, we presented a solution to the virtual-try-all problem called Diffuse-to-Choose (DTC). Diffuse-to-Choose is a novel generative-AI model that allows users to seamlessly insert any product at any location in any scene.

The customer starts with a personal scene image and a product and draws a mask in the scene to tell the model where to insert the object. The model then integrates the item into the scene, with realistic angles, lighting, shadows, and so on. If necessary, the model infers new perspectives on the item, and it preserves the item’s fine-grained visual-identity details.

Diffuse-to-choose

New “virtual try-all” method works with any product, in any personal setting, and enables precise control of image regions to be modified.

The Diffuse-to-Choose model has a number of characteristics that set it apart from existing work on related problems. First, it is the first model to address the virtual-try-all problem, as opposed to the virtual-try-on problem: it is a single model that works across a wide range of product categories. Second, it doesn’t require 3-D models or multiple views of the product, just a single 2-D reference image. Nor does it require sanitized, white-background, or professional-studio-grade images: it works with “in the wild” images, such as regular cellphone pictures. Finally, it is fast, cost effective, and scalable, generating an image in approximately 6.4 seconds on a single AWS g5.xlarge instance (NVIDIA A10G with 24GB of GPU memory).

Under the hood, Diffuse-to-Choose is an inpainting latent-diffusion model, with architectural enhancements that allow it to preserve products’ fine-grained visual details. A diffusion model is one that’s incrementally trained to denoise increasingly noisy inputs, and a latent-diffusion model is one in which the denoising happens in the model’s representation (latent) space. Inpainting is a technique in which part of an image is masked, and the latent-diffusion inpainting model is trained to fill in (“inpaint”) the masked region with a realistic reconstruction, sometimes guided by a text prompt or an image reference.

Diffuse-to-choose allows customers to control virtual-try-on features such as sleeve length and whether shirts are worn tucked or untucked, simply by specifying the region of the image to be modified.

Like most inpainting models, DTC uses an encoder-decoder model known as a U-Net to do the diffusion modeling. The U-Net’s encoder consists of a convolutional neural network, which divides the input image into small blocks of pixels and applies a battery of filters to each block, looking for particular image features. Each layer of the encoder steps down the resolution of the image representation; the decoder steps the resolution back up. (The U-shaped curve describing the resolution of the representation over successive layers gives the network its name.)

These schematics compare conventional attention-head knowledge distillation (right) and a new approach, attention map alignment distillation (AMAD)  on the left. The image contains a series of 3 by 3 grids with labels like head 1, head 2, and head 3. Each grid has some colored squares and arrows of different thickness and colors are connecting some of the grids. The grids on the right show the conventional attention-head knowledge distillation approach and the grids on the left show the new approach.

Related content

Method preserves knowledge encoded in teacher model’s attention heads even when student model has fewer of them.

Our main innovation is to introduce a secondary U-Net encoder into the diffusion process. The input to this encoder is a rough copy-paste collage in which the product image, resized to match the scale of the background scene, has been inserted into the mask created by the customer. It’s a very crude approximation of the desired output, but the idea is that the encoding will preserve fine-grained details of the product image, which the final image reconstruction will incorporate.

We call the secondary encoder’s output a “hint signal”. Both it and the output of the primary U-Net’s encoder pass to a feature-wise linear-modulation (FiLM) module, which aligns the features of the two encodings. Then the encodings pass to the U-Net decoder.

The Diffuse-to-Choose (DTC) architecture, with sample input and output. The main difference between DTC and a typical inpainting diffusion model is the second U-Net encoder that produces a “hint signal” that carries additional information about details of the product image.

We trained Diffuse-to-Choose on AWS p4d.24xlarge instances (with NVIDIA A100 40GB GPUs), with a dataset of a few million pairs of public images. In experiments, we compared its performance on the virtual-try-all task to those of four different versions of a traditional image-conditioned inpainting model, and we compared it to the state-of-the-art model on the more-specialized virtual-try-on task.

Parallel grids, each containing four satellite images. The four at left have lower feature resolution and more homogenous coloring; the four at right are more detailed.

Related content

New approach enables sustainable machine learning for remote-sensing applications.

In addition to human-based qualitative evaluation of similarity and semantic blending, we used two quantitative metrics to assess performance: CLIP (contrastive language-image pretraining) score and the Fréchet inception distance (FID), which measures the realism and diversity of generated images. On the virtual-try-all task, DTC outperformed all four image-conditioned inpainting baselines on both metrics, with a margin of 9% in FID over the best-performing baseline.

On the virtual-try-on task, DTC was comparable to the baseline — slightly higher in CLIP score (90.14 vs. 90.11), but also slightly higher in FID, where lower is better (5.39 vs. 5.28). But given DTC’s generality, performing comparably to a special-purpose model on its specialized task is a substantial achievement. Finally, we demonstrate that DTC’s results are comparable in quality to those of order-of-magnitude more-expensive methods based on few-shot fine-tuning on every product, like our previous DreamPaint method.



[ad_2]

Source link

We will be happy to hear your thoughts

Leave a reply

Aqualib World- Space of Deals & offers
Logo