NIPS’13: Distributed Representations of Words and Phrases and their Compositionality

Natural Language Processing paper challenge (3/30)

Why this paper?

It established the NLP foundation and make the huge impacts. It also called word2vec model.

What problem do they solve?

Given a word, the model will generate the embedding.

What is the embedding?

Embedding is a real number vector representing the meaning. It is usually dense, low-dimension comparing with the original vector.

There are 2 popular scenario to use embedding.

  1. We have a categorical feature (like city) for modeling. We can transfer the information with 1-hot encoding to a big and sparse vector, but we do not want to. In order to save time, space, we can encode the huge sparse vector to embedding.
  2. We want to compute the similarity of the 2 elements (let’s say city). With 1-hot encoding, every city is independent. However, we can encode to low-dimension vector(embedding) and easily to get the similarity.

What is the benefit for this model?

word meaning

Business embedding


We can treat each user-view-biz session as a sentence (like word2vec). And the model will also capture the single biz information with the nearby viewed biz in the same session. Surely, we can apply more complicated model for it since word2vec does not consider the order.

How does word2vec work?

They have three innovations in this paper.

  1. Skip-gram model -> basic model architecture
  2. Negative Sampling -> improve efficiency
  3. Subsampling of Frequent Words -> improve accuracy and efficiency

Skip-gram model

Consider the following sentence:

I like the dog.

If we set the window size as 4 (previous 2 + later 2)

we have the pairs

(I, like), (I, the)

(like, I), (like, the), (like dog)

(the, I), (the, like), (the dog)

(dog, like), (dog, the)

For each word, we update the weights of neural network, promote the correct pair and penalize to other words (I, sky), (I sun)…

They apply softmax function to get the probability and compare the target value and update the weights.

Negative Sampling

However, if we have 1M words and 100 dimension, we will have 100M neurons in the model, and each time we need to update it to process 1 word.

They proposals for each positive pair, they can use negative sampling.

For example, we have a pair (I, dog)

It will become +(I, dog), -(I, cat), -(I, apple), -(I, sky), -(I, boy), -(I, sun)

It only needs to update these 6 pairs (6 * 100 << 1M * 100)

Subsampling of Frequent Words

They also found the always appears all the time, so they use subsampling approach to discard the words with high frequency.

Discard probability:


t: constant threshold

f(.): frequency of the word

Other related blogs:

NAACL’19: Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence

Best paper in RecSys:

My Website:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store