https://www.pinterest.com/pin/785385622500401609/

NIPS’2017: Attention Is All You Need (Transformer)

3 min readAug 4, 2020

Natural Language Processing paper challenge (4/30)

Why this paper?

This model Transformer still make huge impact on NLP field. Hot model likes BERT, GPT-2, GPT-3 are all Transformer models. It provides state-of-the-art general-purpose architectures for Natural Language Understanding (NLU) and Natural Language Generation (NLG).

more detail

What problem do they solve?

Given a sentence, the model will generate another sentence. It solves long-dependency issue of NLP (RNN, LSTM) and compute in shorter time.

High level architecture

Encoder and Decoder Stacks

Encoder: multi-head self-attention + position-wise fully connected feed-forward network

Multi-Head Attention: self-attention : All of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.

2. Add & Norm: LayerNorm (normalized layer (x + Sublayer(x))

3. Feed Forward: position-wise fully connected feed-forward network

FFN(x) = max(0, xW1 + b1)W2 + b

Decoder:

Masked Multi-Head Attention: masked self-attention: Same with self-attention layers in encoder but we prevent leftward information flow in the decoder to preserve the auto-regressive property by using MASK.

2. Multi-Head Attention: masked self-attention:

queries come from the previous decoder layer

keys, values from encoder layer

3. Add & Norm: LayerNorm (normalized layer (x + Sublayer(x))

4. Feed Forward: position-wise fully connected feed-forward network

FFN(x) = max(0, xW1 + b1)W2 + b