https://www.pinterest.com/pin/721631540266292038/?nic_v2=1a3dFhpjD

KDD’19: Learning a Unified Embedding for Visual Search at Pinterest

Arthur Lee

4 min readSep 14, 2020

Recommendation system paper challenge (22/50)

paper link

Engineer blog related to this paper (Unifying visual embeddings for visual search at Pinterest)

What problem do they solve?

They want to learn a single unified image embedding which can be used to power our multiple visual search products

What other people solve this problem?

In academic, the best model for graph based recommendation system is Graph Convolutional Network.

Challenges

Different Applications have Different Objectives

retrieval: pin and image

ranking (text, pin, user, image queries),

classification or regression (e.g. neardup classification, click-through-rate prediction, image type)

upstream multi-modal embedding models (PinSAGE [19]).

Domain shift of camera to Pinterest images:

Flashlight optimizes for browsing relevance within Pinterest catalog images While Lens optimizes for browsing Pinterest catalog images from camera photos.

Finally Shop-the-Look optimizes for searching for the exact product from objects in a scene for shopping

Embedding Maintenance/Cost/Deprecation

Each improvement in embedding requires a full back-fill for deployment, which is expensive. Particularly, there are so many dependencies on these embeddings.

Effective Usage of Datasets

For each embedding task, they used to require human carefully choose the dataset.

Through multi-task learning, they want to minimize the amount of costly human curation while leveraging as much engagement data as possible.

Scalable and Efficient Representation

Pinterest has a requirement for an image representation that is cheap to store and also computationally efficient for common operations such as distance for image similarity.

Model architecture

All the tasks share a common base network until the embedding is generated, where each task then splits into their own respective branches. Each task branch is simply a fully connected layer (where the weights are the proxies) followed by a softmax (no bias) cross entropy loss.

Subsampling Module

Given N images in a batch and M proxies to target each with an embedding dimension of D.

However, they may not be able to fit the proxy bank (MxD matrix) in GPU memory as we scale M to millions of classes. Therefore, they decide to fit it into CPU RAM along with also implementing class subsampling.

The sampled subset is guaranteed to have all the ground truth class labels of the images in the training batch (the label index slot will change however to ensure the index is within bounds of the number of classes sampled).

Binarization Module

They need efficient representations to the following:

(1) decrease cold storage costs (e.g. cost of AWS S3)

(2) bandwidth for downstream consumers (e.g. I/O costs to fully ingest the embeddings into map reduce jobs)

(3) improve the latency of real-time scoring

we see that embeddings learned from the classification based metric learning approach can be binarized by thresholding at zero with little drop in performance

model training

They train our model in a multi-tower distributed setup. They share parameters throughout the network with the exception of the sparse proxy parameters.

Each node (out of eight) has its own full copy of the CPU Embedding bank as sparse parameters cannot be distributed at this moment by the PyTorch framework.

Mini-Batch and Loss for Multi-task

For every mini-batch, we balance a uniform mix of each of the datasets with an epoch defined by the iterations to iterate through the largest dataset. Sample fixed number in each data set.

Each dataset has its own independent tasks so we ignore the gradient contributions of images on tasks that it does not have data for.

The losses from all the tasks are assigned equal weights and are summed for backward propagation. For example, if we have 3 tasks, these loss are weights equally.

Sparse Tensor Optimization

They represent the proxy banks that are sparsely subsampled as sparse tensors. This avoids expensive dense gradient calculation for all the proxies during the backward propagation.

Instead of using momentum, they just use higher learning rate Since momentum is very expensive because the sparse gradient tensors will be aggregated and become expensive dense gradient updates. It decreases our training time by 40% while retaining comparable performance.

Result

Either on offline metrics, user studies and online A/B experiments, the proposed unified embedding improves both relevance and engagement of our visual search products for both browsing and searching purposes when compared to existing specialized embeddings.

Moreover, the deployment of the unified embedding is drastically reduced the operation and engineering cost due to maintaining multiple embeddings.