https://www.pinterest.com/pin/785385622500401609/

NIPS’2017: Attention Is All You Need (Transformer)

Arthur Lee
3 min readAug 4, 2020

Natural Language Processing paper challenge (4/30)

paper link

Why this paper?

This model Transformer still make huge impact on NLP field. Hot model likes BERT, GPT-2, GPT-3 are all Transformer models. It provides state-of-the-art general-purpose architectures for Natural Language Understanding (NLU) and Natural Language Generation (NLG).

more detail

What problem do they solve?

Given a sentence, the model will generate another sentence. It solves long-dependency issue of NLP (RNN, LSTM) and compute in shorter time.

High level architecture

Encoder and Decoder Stacks

Encoder: multi-head self-attention + position-wise fully connected feed-forward network

  1. Multi-Head Attention: self-attention : All of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.

2. Add & Norm: LayerNorm (normalized layer (x + Sublayer(x))

3. Feed Forward: position-wise fully connected feed-forward network

FFN(x) = max(0, xW1 + b1)W2 + b

Decoder:

  1. Masked Multi-Head Attention: masked self-attention: Same with self-attention layers in encoder but we prevent leftward information flow in the decoder to preserve the auto-regressive property by using MASK.

2. Multi-Head Attention: masked self-attention:

queries come from the previous decoder layer

keys, values from encoder layer

3. Add & Norm: LayerNorm (normalized layer (x + Sublayer(x))

4. Feed Forward: position-wise fully connected feed-forward network

FFN(x) = max(0, xW1 + b1)W2 + b

There is a great youtube talking about this paper.

Other related blogs:

COLING’14: Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts

NAACL’19: Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence

NIPS’13: Distributed Representations of Words and Phrases and their Compositionality

Best paper in RecSys:

https://recsys.acm.org/best-papers/

My Website:

https://light0617.github.io/#/

--

--

Arthur Lee
Arthur Lee

Written by Arthur Lee

An machine learning engineer in Bay Area in the United States

No responses yet