RecSys ’18: Causal Embeddings for Recommendation
Recommendation system paper challenge (17/50)
What problem do they solve?
Most of the real-world recommendation system trained the model based on the biased user feedback data.
For example, items with long-living history will have much more impression than other items, which leading new items seldom show to users.
What is the model they propose?
They proposed Causal Embeddings model by requiring little modifications on MF model.
First they prepare two dataset: one is from current recommendation system (large, biased, control) the other one is from fully randomized (small, unbiased, treatment).
And then jointly update the models with SGD (stochastic gradient descent).
Data?
MovieLens10M
Netflix datasets
Metric?
AUC: area under ROC curve
MSE Lift
Baselines model?
Matrix Factorization Baselines
Bayesian Personalized Ranking (BPR)
Supervised-Prod2Vec (SP2V):
Causal Inference Baselines:
Weighted-SupervisedP2V (WSP2V):
They applied SP2V algorithm on propensity-weighted data to test the performance of propensity-based causal inference.
BanditNet (BN):
They employed SP2V as our target policy πw and model the behavior of the recommendation system as a popularity-based solution as the existing policy πc.
How to Leverage The Exploration Sample St
Leveraging The Exploration Sample S
- No adaptation (no) — It is trained only on the Sc sample.
- Blended adaptation (blend) — the algorithm is trained on the union of the Sc and St samples.
- Test-only adaptation (test) —the model is trained only on the St samples.
- Average test adaptation (avg) — the algorithm constructs an average treatment product by pooling all of the points from the St sample into a single vector (it applies only to CausE).
- Product-level adaptation (prod) — the algorithm constructs a separate treatment embedding for each product based on the St sample (it applies only to CausE). For the final prediction we can use either the control (denoted by CausE-prod-C) or the treatment product representation (denoted by CausEprod-T).
Result
My review:
Training on biased data is universal in real-world.
Although there are many approaches relaxing this problem. For example, utilizing time-delay feature or downsampling method to pick high probability of new items, the algorithm they propose is still novel idea to me.
I feel like the most cost is creating random set (St) and training two models for further model improvement in the future. For example, when we need to add 1 feature to our model, we need to train 1 model for Sc and the other one for St also modified training pipeline, not much sure this approach will give big gain compared with other simple methods.
Other related blogs:
Beyond Clicks: Dwell Time for Personalization
RecSys16: Adaptive, Personalized Diversity for Visual Discovery
NAACL’19: Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence
RecSys ’17: Translation-based Recommendation
RecSys ’18: HOP-Rec: High-Order Proximity for Implicit Recommendation
RecSys ’18: Impact of item consumption on assessment of recommendations in user studies
RecSys ’18: Generation Meets Recommendation: Proposing Novel Items for Groups of Users
Best paper in RecSys:
https://recsys.acm.org/best-papers/
My Website: