RecSys ’18: Causal Embeddings for Recommendation

4 min readJun 8, 2020

Recommendation system paper challenge (17/50)

https://www.pinterest.com/pin/634515034972497468/

What problem do they solve?

Most of the real-world recommendation system trained the model based on the biased user feedback data.

For example, items with long-living history will have much more impression than other items, which leading new items seldom show to users.

What is the model they propose?

They proposed Causal Embeddings model by requiring little modifications on MF model.

First they prepare two dataset: one is from current recommendation system (large, biased, control) the other one is from fully randomized (small, unbiased, treatment).

And then jointly update the models with SGD (stochastic gradient descent).

Data?

MovieLens10M

Netflix datasets

Metric?

AUC: area under ROC curve

MSE Lift

Baselines model?

Matrix Factorization Baselines

Bayesian Personalized Ranking (BPR)

https://www.slideshare.net/zenogantner/bayesian-personalized-ranking-for-nonuniformly-sampled-items

Supervised-Prod2Vec (SP2V):

Causal Inference Baselines:

Weighted-SupervisedP2V (WSP2V):

They applied SP2V algorithm on propensity-weighted data to test the performance of propensity-based causal inference.

BanditNet (BN):

They employed SP2V as our target policy πw and model the behavior of the recommendation system as a popularity-based solution as the existing policy πc.

How to Leverage The Exploration Sample St

Leveraging The Exploration Sample S

No adaptation (no) — It is trained only on the Sc sample.
Blended adaptation (blend) — the algorithm is trained on the union of the Sc and St samples.
Test-only adaptation (test) —the model is trained only on the St samples.
Average test adaptation (avg) — the algorithm constructs an average treatment product by pooling all of the points from the St sample into a single vector (it applies only to CausE).
Product-level adaptation (prod) — the algorithm constructs a separate treatment embedding for each product based on the St sample (it applies only to CausE). For the final prediction we can use either the control (denoted by CausE-prod-C) or the treatment product representation (denoted by CausEprod-T).

Result

My review:

Training on biased data is universal in real-world.

Although there are many approaches relaxing this problem. For example, utilizing time-delay feature or downsampling method to pick high probability of new items, the algorithm they propose is still novel idea to me.

I feel like the most cost is creating random set (St) and training two models for further model improvement in the future. For example, when we need to add 1 feature to our model, we need to train 1 model for Sc and the other one for St also modified training pipeline, not much sure this approach will give big gain compared with other simple methods.