# NIPS’13: Distributed Representations of Words and Phrases and their Compositionality

# Natural Language Processing paper challenge (3/30)

Why this paper?

It established the NLP foundation and make the huge impacts. It also called word2vec model.

What problem do they solve?

Given a word, the model will generate the embedding.

What is the embedding?

Embedding is a real number vector representing the meaning. It is usually dense, low-dimension comparing with the original vector.

There are 2 popular scenario to use embedding.

- We have a categorical feature (like city) for
**modeling**. We can transfer the information with 1-hot encoding to a big and sparse vector, but we do not want to. In order to save time, space, we can encode the huge sparse vector to embedding. - We want to compute the
**similarity**of the 2 elements (let’s say city). With 1-hot encoding, every city is independent. However, we can encode to low-dimension vector(embedding) and easily to get the similarity.

What is the benefit for this model?

## word meaning

As the paper original idea, it encode a word to an embedding, assign the meaningful for it.

## Business embedding

We also can get business embedding with word2vec model.

How?

We can treat each user-view-biz session as a sentence (like word2vec). And the model will also capture the single biz information with the nearby viewed biz in the same session. Surely, we can apply more complicated model for it since word2vec does not consider the **order**.

How does word2vec work?

They have three innovations in this paper.

- Skip-gram model -> basic model architecture
- Negative Sampling -> improve efficiency
- Subsampling of Frequent Words -> improve accuracy and efficiency

**Skip-gram model**

Consider the following sentence:

**I like the dog.**

If we set the window size as 4 (previous 2 + later 2)

we have the pairs

(I, like), (I, the)

(like, I), (like, the), (like dog)

(the, I), (the, like), (the dog)

(dog, like), (dog, the)

For each word, we update the weights of neural network, promote the correct pair and penalize to other words (I, sky), (I sun)…

They apply softmax function to get the probability and compare the target value and update the weights.

**Negative Sampling**

However, if we have 1M words and 100 dimension, we will have 100M neurons in the model, and each time we need to update it to process 1 word.

They proposals for each positive pair, they can use negative sampling.

For example, we have a pair (I, dog)

It will become +(I, dog), -(I, cat), -(I, apple), -(I, sky), -(I, boy), -(I, sun)

It only needs to update these 6 pairs (6 * 100 << 1M * 100)

**Subsampling of Frequent Words**

They also found `the`

always appears all the time, so they use subsampling approach to discard the words with high frequency.

Discard probability:

Where

t: constant threshold

f(.): frequency of the word

# Other related blogs:

COLING’14: Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts

NAACL’19: Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence

Best paper in RecSys:

https://recsys.acm.org/best-papers/

My Website: