NIPS’13: Distributed Representations of Words and Phrases and their Compositionality
Natural Language Processing paper challenge (3/30)
Why this paper?
It established the NLP foundation and make the huge impacts. It also called word2vec model.
What problem do they solve?
Given a word, the model will generate the embedding.
What is the embedding?
Embedding is a real number vector representing the meaning. It is usually dense, low-dimension comparing with the original vector.
There are 2 popular scenario to use embedding.
- We have a categorical feature (like city) for modeling. We can transfer the information with 1-hot encoding to a big and sparse vector, but we do not want to. In order to save time, space, we can encode the huge sparse vector to embedding.
- We want to compute the similarity of the 2 elements (let’s say city). With 1-hot encoding, every city is independent. However, we can encode to low-dimension vector(embedding) and easily to get the similarity.
What is the benefit for this model?
word meaning
As the paper original idea, it encode a word to an embedding, assign the meaningful for it.
Business embedding
We also can get business embedding with word2vec model.
How?
We can treat each user-view-biz session as a sentence (like word2vec). And the model will also capture the single biz information with the nearby viewed biz in the same session. Surely, we can apply more complicated model for it since word2vec does not consider the order.
How does word2vec work?
They have three innovations in this paper.
- Skip-gram model -> basic model architecture
- Negative Sampling -> improve efficiency
- Subsampling of Frequent Words -> improve accuracy and efficiency
Skip-gram model
Consider the following sentence:
I like the dog.
If we set the window size as 4 (previous 2 + later 2)
we have the pairs
(I, like), (I, the)
(like, I), (like, the), (like dog)
(the, I), (the, like), (the dog)
(dog, like), (dog, the)
For each word, we update the weights of neural network, promote the correct pair and penalize to other words (I, sky), (I sun)…
They apply softmax function to get the probability and compare the target value and update the weights.
Negative Sampling
However, if we have 1M words and 100 dimension, we will have 100M neurons in the model, and each time we need to update it to process 1 word.
They proposals for each positive pair, they can use negative sampling.
For example, we have a pair (I, dog)
It will become +(I, dog), -(I, cat), -(I, apple), -(I, sky), -(I, boy), -(I, sun)
It only needs to update these 6 pairs (6 * 100 << 1M * 100)
Subsampling of Frequent Words
They also found the
always appears all the time, so they use subsampling approach to discard the words with high frequency.
Discard probability:
Where
t: constant threshold
f(.): frequency of the word
Other related blogs:
COLING’14: Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts
NAACL’19: Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence
Best paper in RecSys:
https://recsys.acm.org/best-papers/
My Website: