KDD’18: Graph Convolutional Neural Networks for Web-Scale Recommender Systems
Recommendation system paper challenge (21/50)
Engineer blog related to this paper (PinSage: A new graph convolutional neural network for web-scale recommender systems)
My review link for the engineering blog
What problem do they solve?
They want to recommend images to users not only based on the images feature but also the graph structure.
What other people solve this problem?
In academic, the best model for graph based recommendation system is Graph Convolutional Network.
What challenges do they have by applying Graph Convolutional Network?
At Pinterest, there are much bigger graph than academic. Many of assumptions will violate.
For example: operating on the full graph Laplacian during training is violated in billions of nodes and whose structure is constantly evolving.
What are their ideas for scale up the model?
On-the-fly convolutions
Instead of multiplying feature matrices by powers of the full graph Laplacian, Pinterest sample the neighborhood around a node and dynamically constructing a computation graph from this sampled neighborhood.
It can hugely reduce computing cost.
Producer-consumer minibatch construction
In order to maximal GPU utilization during model training, Pinterest builds a producer-consumer architecture for constructing mini-batches.
producer (CPU): It samples node and fetches the features to define local convolutions.
consumer (GPU): It consumes these pre-defined computation graphs and run stochastic gradient decent.
Efficient MapReduce inference
Given a fully-trained GCN mode, Pinterest distributes the trained model to generate embeddings while minimizing repeated computations by utilizing MapReduce pipeline.
Constructing convolutions via random walks
Since huge computation when they perform convolutions on a full neighborhoods of nodes, Pinterest apply a short random walks. Besides that, each node has importance score, which they use in the pooling/aggregation step as well.
Importance pooling
Instead of equally aggregating of features, they utilize random walk similarity measures to weigh the importance of node features in this aggregation.
There are two steps:
- pick a fixed number of neighboring nodes -> control memory consumption
- take consider the normalized number of visit count of random walk as importance measure
Training model
Loss function
Pinterest applies a max-margin ranking loss so that output embeddings of pairs in the set are close to each other.
Producer-consumer mini-batch construction
Due to large memory of CPU, feature matrix for billions of nodes places in CPU.
However in convolve step, GPU needs to access the data from CPU, which is inefficient. They apply a re-indexing technique to create a sub-graph of nodes and their neighborhood. The adjacency list of sub-graph and the small feature matrix are fed into GPUs at the start of each mini-batch iteration. Therefore, there is no communication between CPUs and GPUs.
Sampling negative items
In each mini-batch, there is a set of 500 negative items so that it reduce the number of embeddings to compute.
Is it fine naive negative sampling ? No!
Because it is super naive to let the model learn positive is more related than naive negative samples. In this case, the parameters are not updated too much.
Instead of naive sampling 500 over 2M data, they set up “hard” negative examples for each positive training data. These hard negative samples are somehow related to the query.
How to choose hard negative samples? from graph!
They applied Personalized PageRank scores to the query q. Among them, randomly pick 500 as hard negative samples.
With this way, we can force the model to learn to distinguish items at a finer granularity.
The challenge of using hard negative samples: converge
Due to hard negative samples is closer to the positive sample, it will be harder to converge.
They apply curriculum training scheme (Y. Bengio, J. Louradour, R. Collobert, and J. Weston. 2009. Curriculum learning. In ICML.). At the first epoch, they use naive samples to quickly find an area with relatively small loss and in the subsequent epochs, add hard negative samples to learn to distinguish items at a finer granularity.
Node Embeddings via MapReduce
Algorithm 2 (mini-batch) is good but still inefficient due to duplicate computation. They used MapReduce to solve this problem.
- One MapReduce job is used to map all pins to a low-dimensional latent space, where the aggregation operation will be performed (Algorithm 1, Line 1).
2. Another MapReduce job is then used to join the resulting pin representations with the ids of the boards they occur in, and the board embedding is computed by pooling the features of its (sampled) neighbors.
In first layer, they use 2 MapReduce jobs. In second layer, they also use 2 another MapReduce jobs. So in total, k layers, they will have 2k MapReduce jobs.
Efficient nearest-neighbor lookups
Approximate KNN (A. Andoni and P. Indyk. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS) can be obtained efficiently via locality sensitive hashing. After the hash function is computed, retrieval of items can be implemented with a two-level retrieval process based on the Weak AND operator
Experiment
visual feature: visual embeddings (4,096 dimensions) from 6th layer of VGG-16.
annotation feature: 256 dimensions from word2Vec based model(T. Mikolov, I Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS.)
Offline performance compared with baseline models:
Another offline evaluation (based on embedding): kurtosis
Intuition: The more spread the distribution of cosine similarity between random pairs, the better the embedding is.
If all items are at about the same distance (i.e., the distances are tightly clustered) then the embedding space does not have enough “resolution” to distinguish between items of different relevance.
PinSage has the most spread out distribution. In particular, the kurtosis of the cosine similarities of PinSage embeddings is 0.43, compared to 2.49 for annotation embeddings and 1.20 for visual embeddings.
User Studies
For only-visual features model, it can correctly tell which categories is and visual similarity but it cannot understand terms of image semantic. For example, it recommends food to the query (plant) in above image. Also, it focus too much in background (below image).
On the other hand, for Pixie model (only graph), it is good at tell semantic but not good at finding the most similar image.
Training and Inference Runtime Analysis
Conclusion
Graph Convolutional Network is useful for solving the graph recommendation problem, yet it is hard to scale for the large scale scenario.
Pinterest provides solution with efficient random walks, importance pooling and curriculum training to ensure scalability and gain the performance.
For offline evaluations, they consider MRR (order), hit ratio. Both are usually used on recommendation system. Besides that, they plot the distribution of cosine-similarity to measure the resolution of embeddings.
Other related blogs:
WWW’17: Visual Discovery at Pinterest
KDD’15 Visual Search at Pinterest
RecSys16: Adaptive, Personalized Diversity for Visual Discovery
NAACL’19: Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence
RecSys ’17: Translation-based Recommendation
RecSys ’18: Causal Embeddings for Recommendation
Other reference:
word2Vec model: T. Mikolov, I Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. In NIPS.)
curriculum training (Y. Bengio, J. Louradour, R. Collobert, and J. Weston. 2009. In ICML.)
KDD:
Best paper in RecSys:
https://recsys.acm.org/best-papers/
My Website: