https://www.pinterest.com/pin/526076800214384149/?nic_v2=1a3dFhpjD

CVPR19' Complete the Look: Scene-based Complementary Product Recommendation

Arthur Lee

5 min readSep 20, 2020

Computer Vision paper challenge (2/30)

paper link

🤔 What problem do they solve?

Given a scene picture without product, recommend a fashion compatible product to the user.

🤔 What other researcher solve this problem?

Other researcher focus on product-product based recommendation.

For example, given a jean, recommend what shoes whose fashion is compatible.

🤔 What are their data? (label by human)

At Pinterest, they have two Shop the Look (STL) datasets, containing various scene images and shoppable products from partners. STL-Fashion contains fashion images and products, while STL-Home includes interior design and home decor items.

Both datasets have scene-product pairs, bounding boxes for products, and product category information, all of which are labeled by internal workers.

Unlike the Exact Street2Shop dataset where users only provide product matches, here workers also label products that have a similar style to the observed product and are compatible with the scene.

With STL dataset, Pinterest applied heuristic rule to crop the original dataset to generate (Complete the Look) CTL dataset.

🤔 What is their model?

They consider global information and local information.

global compatibility

🤔 Is it enough that we only need global similarity?

NO.

If we only consider global compatibility may overlook key details in the scene.

Hence, we need local compatibility! 😁

Local Compatibility

They first measure the compatibility between every scene patch and the product, and then adopt category-aware attention to assign weights over all regions.

Compute the distance between every scene patch and the product in order to get d1, d2,…
Applying category-aware attention to get weights
combined each distance of scene patch and its weight to get final local distance.

😎 Combine local distance and global distance to the final distance

🙃 Training

They apply the hinge loss to learn style embeddings by considering triplets (scene, positive, negative).

Why hinge loss?

It is helpful to do regularization to avoid overfitting.

🤨 Experiments

Baselines models

Popularity: A simple baseline which recommends products based on their popularity (i.e., the number of associated (scene, product) pairs).

ImageNet Features: We directly use visual features from ResNet pre-trained on ImageNet, which have shown strong performance in terms of retrieving visually similar images. The similarity is measured via the 2 distance between embeddings.

IBR [28]: Image-based recommendation (IBR) measures product compatibility via a learned Mahalanobis distance between visual embeddings. Essentially IBR learns a linear transformation to convert visual features into a style space.

Siamese Nets: Veit et al. [40] adopt Siamese CNNs [7] to learn style embeddings from product images, and measure their compatibility using an 2 distance. At Pinterest, they fine-tune the network based on a pre-trained model.

BPR-DAE [36]: This method uses autoencoders to extract representations from clothing images and textual descriptions, and incorporates them into the BPR recommendation framework. Due to the absence of textual information in our datasets, we only use its visual module.

From the result, cropping brings win to the model.

Besides that, local information also earn some gains.

From the figure above, local attention mechanism can clearly capture the attention focus point not only all human faces (usually trained by huge image dataset).

🥳 The Qualitative result of the model

🧐 Reference:

[28] J. J. McAuley, C. Targett, Q. Shi, and A. van den Hengel. Image-based recommendations on styles and substitutes. In SIGIR, 2015.

[36] X. Song, F. Feng, J. Liu, Z. Li, L. Nie, and J. Ma. Neurostylist: Neural compatibility modeling for clothing matching. In MM, 2017.

[40] A. Veit, B. Kovacs, S. Bell, J. McAuley, K. Bala, and S. J. Belongie. Learning visual clothing style with heterogeneous dyadic co-occurrences. In ICCV, 2015.

🙃 Other related blogs:

COLING’14: Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts

NAACL’19: Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence

NIPS’2017: Attention Is All You Need (Transformer)

NIPS’13: Distributed Representations of Words and Phrases and their Compositionality

KDD’19: Learning a Unified Embedding for Visual Search at Pinterest

BMVC19' Classification is a Strong Baseline for Deep Metric Learning

KDD’18: Graph Convolutional Neural Networks for Web-Scale Recommender Systems

WWW’17: Visual Discovery at Pinterest

🧐 Conference

ICCV: International Conference on Computer Vision

http://iccv2019.thecvf.com/submission/timeline

CVPR: Conference on Computer Vision and Pattern Recognition

http://cvpr2019.thecvf.com/

ECCV: European Conference on Computer Vision

ECCV 2020 |The COVID-19 pandemic has imposed unprecedented changes in our personal and professional lives. As you are aware, we…
eccv2020.eu

Top Conference Paper Challenge:
https://medium.com/@arthurlee_73761/top-conference-paper-challenge-2d7ca24115c6
My Website:
https://light0617.github.io/#/