ELMO, BERT, GPT

Arthur Lee
4 min readApr 21, 2020

--

Today I will write the note for Hung-yi Lee video:

First, he talks about many encoding techs:

The benefit idea of encoding is

  1. Converting string to a vector, it is easy to store, compute
  2. Keep information: measuring the distance/similarity between the original items.

1-hot encoding

it is easy to implement but memory costs huge and the distance between each word is same.

Considering cat, dog, apple encoding:

cat: [1, 0, 0]

dog: [0, 1, 0]

apple: [0, 0, 1]

After we convert to vector, the similarities are all the same. Yet, the distance between dog and cat should be smaller than the distance between dog and apple.

Word Class encoding

You can imagine word class is converting to 1D categorical map.

In the same class, even the dot should be closer to cat but it does not capture this information.

Word Embedding encoding

It also maps a word to a vector but it is much lower dimension than 1-hot encoding and also capture more information than word class.

There are so many approaches to implement word embedding. Usually people like to use RNN.

Word embedding can apply to many NLP field, like semantic analysis. We can take each word to word embedding as features. Hence, it can consider an approach of feature extractor.

word type v.s. word token

A word has different meaning!

bank: money place v.s. the place next to river.

In typical embedding approach, we consider 1 word type mapping to 1 embedding. In fact, it is not good enough.

Now we prefer each word toke (meaning) has 1 embedding!

There are 3 models achieving this goal!

ELMO (embedding from language model)

RNN based model -> predict the next word

However, each RNN has multiple layer, which layer should we use?

All I want!!!

BERT (Bidirectional Encoder of Representations from Transform)

BERT = encoder of transform

NO labeling data!

the goal of BERT: sentence -> embedding of these words

How to train BERT?

  1. Masked LM

Randomly replace 1 word as MASK and train it!

2. Next sentence prediction

Why put classifier in the beginning?

Because BERT applies self-attention not RNN.

For self-attention model, there is no different of the position of the label.

In the real life, it is best to use approach 1 and approach 2 together

Case4 — QA problem:

What do the deep learning layers learn?

GPT (generative pre-training)

the decoder of transformer

q: query

k: key

v: value

It is really good for reading comprehension even zero-shot

However, summarization and translation is not super good.

Here’s interesting demo for GPT:

My website:

https://light0617.github.io/#/

--

--

Arthur Lee

An machine learning engineer in Bay Area in the United States