ELMO, BERT, GPT
Today I will write the note for Hung-yi Lee video:
First, he talks about many encoding techs:
The benefit idea of encoding is
- Converting string to a vector, it is easy to store, compute
- Keep information: measuring the distance/similarity between the original items.
1-hot encoding
it is easy to implement but memory costs huge and the distance between each word is same.
Considering cat, dog, apple encoding:
cat: [1, 0, 0]
dog: [0, 1, 0]
apple: [0, 0, 1]
After we convert to vector, the similarities are all the same. Yet, the distance between dog and cat should be smaller than the distance between dog and apple.
Word Class encoding
You can imagine word class is converting to 1D categorical map.
In the same class, even the dot should be closer to cat but it does not capture this information.
Word Embedding encoding
It also maps a word to a vector but it is much lower dimension than 1-hot encoding and also capture more information than word class.
There are so many approaches to implement word embedding. Usually people like to use RNN.
Word embedding can apply to many NLP field, like semantic analysis. We can take each word to word embedding as features. Hence, it can consider an approach of feature extractor.
word type v.s. word token
A word has different meaning!
bank: money place v.s. the place next to river.
In typical embedding approach, we consider 1 word type mapping to 1 embedding. In fact, it is not good enough.
Now we prefer each word toke (meaning) has 1 embedding!
There are 3 models achieving this goal!
ELMO (embedding from language model)
RNN based model -> predict the next word
However, each RNN has multiple layer, which layer should we use?
All I want!!!
BERT (Bidirectional Encoder of Representations from Transform)
BERT = encoder of transform
NO labeling data!
the goal of BERT: sentence -> embedding of these words
How to train BERT?
- Masked LM
Randomly replace 1 word as MASK and train it!
2. Next sentence prediction
Why put classifier in the beginning?
Because BERT applies self-attention not RNN.
For self-attention model, there is no different of the position of the label.
In the real life, it is best to use approach 1 and approach 2 together
Case4 — QA problem:
What do the deep learning layers learn?
GPT (generative pre-training)
the decoder of transformer
q: query
k: key
v: value
It is really good for reading comprehension even zero-shot
However, summarization and translation is not super good.
Here’s interesting demo for GPT:
My website: