KDD’19: Learning a Unified Embedding for Visual Search at Pinterest
Recommendation system paper challenge (22/50)
Engineer blog related to this paper (Unifying visual embeddings for visual search at Pinterest)
What problem do they solve?
They want to learn a single unified image embedding which can be used to power our multiple visual search products
What other people solve this problem?
In academic, the best model for graph based recommendation system is Graph Convolutional Network.
Challenges
Different Applications have Different Objectives
retrieval: pin and image
ranking (text, pin, user, image queries),
classification or regression (e.g. neardup classification, click-through-rate prediction, image type)
upstream multi-modal embedding models (PinSAGE [19]).
Domain shift of camera to Pinterest images:
Flashlight optimizes for browsing relevance within Pinterest catalog images While Lens optimizes for browsing Pinterest catalog images from camera photos.
Finally Shop-the-Look optimizes for searching for the exact product from objects in a scene for shopping
Embedding Maintenance/Cost/Deprecation
Each improvement in embedding requires a full back-fill for deployment, which is expensive. Particularly, there are so many dependencies on these embeddings.
Effective Usage of Datasets
For each embedding task, they used to require human carefully choose the dataset.
Through multi-task learning, they want to minimize the amount of costly human curation while leveraging as much engagement data as possible.
Scalable and Efficient Representation
Pinterest has a requirement for an image representation that is cheap to store and also computationally efficient for common operations such as distance for image similarity.
Model architecture
All the tasks share a common base network until the embedding is generated, where each task then splits into their own respective branches. Each task branch is simply a fully connected layer (where the weights are the proxies) followed by a softmax (no bias) cross entropy loss.
Subsampling Module
Given N images in a batch and M proxies to target each with an embedding dimension of D.
However, they may not be able to fit the proxy bank (MxD matrix) in GPU memory as we scale M to millions of classes. Therefore, they decide to fit it into CPU RAM along with also implementing class subsampling.
The sampled subset is guaranteed to have all the ground truth class labels of the images in the training batch (the label index slot will change however to ensure the index is within bounds of the number of classes sampled).
Binarization Module
They need efficient representations to the following:
(1) decrease cold storage costs (e.g. cost of AWS S3)
(2) bandwidth for downstream consumers (e.g. I/O costs to fully ingest the embeddings into map reduce jobs)
(3) improve the latency of real-time scoring
we see that embeddings learned from the classification based metric learning approach can be binarized by thresholding at zero with little drop in performance
model training
They train our model in a multi-tower distributed setup. They share parameters throughout the network with the exception of the sparse proxy parameters.
Each node (out of eight) has its own full copy of the CPU Embedding bank as sparse parameters cannot be distributed at this moment by the PyTorch framework.
Mini-Batch and Loss for Multi-task
For every mini-batch, we balance a uniform mix of each of the datasets with an epoch defined by the iterations to iterate through the largest dataset. Sample fixed number in each data set.
Each dataset has its own independent tasks so we ignore the gradient contributions of images on tasks that it does not have data for.
The losses from all the tasks are assigned equal weights and are summed for backward propagation. For example, if we have 3 tasks, these loss are weights equally.
Sparse Tensor Optimization
They represent the proxy banks that are sparsely subsampled as sparse tensors. This avoids expensive dense gradient calculation for all the proxies during the backward propagation.
Instead of using momentum, they just use higher learning rate Since momentum is very expensive because the sparse gradient tensors will be aggregated and become expensive dense gradient updates. It decreases our training time by 40% while retaining comparable performance.
Result
Either on offline metrics, user studies and online A/B experiments, the proposed unified embedding improves both relevance and engagement of our visual search products for both browsing and searching purposes when compared to existing specialized embeddings.
Moreover, the deployment of the unified embedding is drastically reduced the operation and engineering cost due to maintaining multiple embeddings.
Other related blogs:
BMVC19' Classification is a Strong Baseline for Deep Metric Learning
KDD’18: Graph Convolutional Neural Networks for Web-Scale Recommender Systems
WWW’17: Visual Discovery at Pinterest
NAACL’19: Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence
RecSys ’18: Causal Embeddings for Recommendation
Other reference:
ICCV 17' No Fuss Distance Metric Learning using Proxies
ICCV 17' Sampling Matters in Deep Embedding Learning
ECCV 18' Deep Metric Learning with Hierarchical Triplet Loss
KDD:
Best paper in RecSys:
https://recsys.acm.org/best-papers/
My Website: