KDD’18: Graph Convolutional Neural Networks for Web-Scale Recommender Systems

Arthur Lee
7 min readSep 6, 2020

Recommendation system paper challenge (21/50)

paper link

Engineer blog related to this paper (PinSage: A new graph convolutional neural network for web-scale recommender systems)

My review link for the engineering blog

What problem do they solve?

They want to recommend images to users not only based on the images feature but also the graph structure.

What other people solve this problem?

In academic, the best model for graph based recommendation system is Graph Convolutional Network.

What challenges do they have by applying Graph Convolutional Network?

At Pinterest, there are much bigger graph than academic. Many of assumptions will violate.

For example: operating on the full graph Laplacian during training is violated in billions of nodes and whose structure is constantly evolving.

What are their ideas for scale up the model?

On-the-fly convolutions

Instead of multiplying feature matrices by powers of the full graph Laplacian, Pinterest sample the neighborhood around a node and dynamically constructing a computation graph from this sampled neighborhood.

It can hugely reduce computing cost.

Producer-consumer minibatch construction

In order to maximal GPU utilization during model training, Pinterest builds a producer-consumer architecture for constructing mini-batches.

producer (CPU): It samples node and fetches the features to define local convolutions.

consumer (GPU): It consumes these pre-defined computation graphs and run stochastic gradient decent.

Efficient MapReduce inference

Given a fully-trained GCN mode, Pinterest distributes the trained model to generate embeddings while minimizing repeated computations by utilizing MapReduce pipeline.

Constructing convolutions via random walks

Since huge computation when they perform convolutions on a full neighborhoods of nodes, Pinterest apply a short random walks. Besides that, each node has importance score, which they use in the pooling/aggregation step as well.

Importance pooling

Instead of equally aggregating of features, they utilize random walk similarity measures to weigh the importance of node features in this aggregation.

There are two steps:

  1. pick a fixed number of neighboring nodes -> control memory consumption
  2. take consider the normalized number of visit count of random walk as importance measure

Training model

Loss function

Pinterest applies a max-margin ranking loss so that output embeddings of pairs in the set are close to each other.

Producer-consumer mini-batch construction

Due to large memory of CPU, feature matrix for billions of nodes places in CPU.

However in convolve step, GPU needs to access the data from CPU, which is inefficient. They apply a re-indexing technique to create a sub-graph of nodes and their neighborhood. The adjacency list of sub-graph and the small feature matrix are fed into GPUs at the start of each mini-batch iteration. Therefore, there is no communication between CPUs and GPUs.

Sampling negative items

In each mini-batch, there is a set of 500 negative items so that it reduce the number of embeddings to compute.

Is it fine naive negative sampling ? No!

Because it is super naive to let the model learn positive is more related than naive negative samples. In this case, the parameters are not updated too much.

Instead of naive sampling 500 over 2M data, they set up “hard” negative examples for each positive training data. These hard negative samples are somehow related to the query.

How to choose hard negative samples? from graph!

They applied Personalized PageRank scores to the query q. Among them, randomly pick 500 as hard negative samples.

With this way, we can force the model to learn to distinguish items at a finer granularity.

The challenge of using hard negative samples: converge

Due to hard negative samples is closer to the positive sample, it will be harder to converge.

They apply curriculum training scheme (Y. Bengio, J. Louradour, R. Collobert, and J. Weston. 2009. Curriculum learning. In ICML.). At the first epoch, they use naive samples to quickly find an area with relatively small loss and in the subsequent epochs, add hard negative samples to learn to distinguish items at a finer granularity.

Node Embeddings via MapReduce

Algorithm 2 (mini-batch) is good but still inefficient due to duplicate computation. They used MapReduce to solve this problem.

  1. One MapReduce job is used to map all pins to a low-dimensional latent space, where the aggregation operation will be performed (Algorithm 1, Line 1).

2. Another MapReduce job is then used to join the resulting pin representations with the ids of the boards they occur in, and the board embedding is computed by pooling the features of its (sampled) neighbors.

In first layer, they use 2 MapReduce jobs. In second layer, they also use 2 another MapReduce jobs. So in total, k layers, they will have 2k MapReduce jobs.

Efficient nearest-neighbor lookups

Approximate KNN (A. Andoni and P. Indyk. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS) can be obtained efficiently via locality sensitive hashing. After the hash function is computed, retrieval of items can be implemented with a two-level retrieval process based on the Weak AND operator

Experiment

visual feature: visual embeddings (4,096 dimensions) from 6th layer of VGG-16.

annotation feature: 256 dimensions from word2Vec based model(T. Mikolov, I Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS.)

Offline performance compared with baseline models:

What is MRR?

Another offline evaluation (based on embedding): kurtosis

Intuition: The more spread the distribution of cosine similarity between random pairs, the better the embedding is.

If all items are at about the same distance (i.e., the distances are tightly clustered) then the embedding space does not have enough “resolution” to distinguish between items of different relevance.

PinSage has the most spread out distribution. In particular, the kurtosis of the cosine similarities of PinSage embeddings is 0.43, compared to 2.49 for annotation embeddings and 1.20 for visual embeddings.

User Studies

For only-visual features model, it can correctly tell which categories is and visual similarity but it cannot understand terms of image semantic. For example, it recommends food to the query (plant) in above image. Also, it focus too much in background (below image).

On the other hand, for Pixie model (only graph), it is good at tell semantic but not good at finding the most similar image.

Training and Inference Runtime Analysis

Conclusion

Graph Convolutional Network is useful for solving the graph recommendation problem, yet it is hard to scale for the large scale scenario.

Pinterest provides solution with efficient random walks, importance pooling and curriculum training to ensure scalability and gain the performance.

For offline evaluations, they consider MRR (order), hit ratio. Both are usually used on recommendation system. Besides that, they plot the distribution of cosine-similarity to measure the resolution of embeddings.

Other related blogs:

WWW’17: Visual Discovery at Pinterest

KDD’15 Visual Search at Pinterest

RecSys16: Adaptive, Personalized Diversity for Visual Discovery

NAACL’19: Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence

RecSys ’17: Translation-based Recommendation

RecSys ’18: Causal Embeddings for Recommendation

Other reference:

word2Vec model: T. Mikolov, I Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. In NIPS.)

curriculum training (Y. Bengio, J. Louradour, R. Collobert, and J. Weston. 2009. In ICML.)

KDD:

https://www.kdd.org/kdd2020/

Best paper in RecSys:

https://recsys.acm.org/best-papers/

My Website:

https://light0617.github.io/#/

--

--

Arthur Lee
Arthur Lee

Written by Arthur Lee

An machine learning engineer in Bay Area in the United States

No responses yet