https://www.pinterest.com/pin/785385622500401609/

NIPS’2017: Attention Is All You Need (Transformer)

Natural Language Processing paper challenge (4/30)

paper link

Why this paper?

This model Transformer still make huge impact on NLP field. Hot model likes BERT, GPT-2, GPT-3 are all Transformer models. It provides state-of-the-art general-purpose architectures for Natural Language Understanding (NLU) and Natural Language Generation (NLG).

more detail

What problem do they solve?

Given a sentence, the model will generate another sentence. It solves long-dependency issue of NLP (RNN, LSTM) and compute in shorter time.

High level architecture

Encoder and Decoder Stacks

Encoder: multi-head self-attention + position-wise fully connected feed-forward network

  1. Multi-Head Attention: self-attention : All of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.

2. Add & Norm: LayerNorm (normalized layer (x + Sublayer(x))

3. Feed Forward: position-wise fully connected feed-forward network

FFN(x) = max(0, xW1 + b1)W2 + b

Decoder:

  1. Masked Multi-Head Attention: masked self-attention: Same with self-attention layers in encoder but we prevent leftward information flow in the decoder to preserve the auto-regressive property by using MASK.

2. Multi-Head Attention: masked self-attention:

queries come from the previous decoder layer

keys, values from encoder layer

3. Add & Norm: LayerNorm (normalized layer (x + Sublayer(x))

4. Feed Forward: position-wise fully connected feed-forward network

FFN(x) = max(0, xW1 + b1)W2 + b

There is a great youtube talking about this paper.

Other related blogs:

COLING’14: Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts

NAACL’19: Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence

NIPS’13: Distributed Representations of Words and Phrases and their Compositionality

Best paper in RecSys:

https://recsys.acm.org/best-papers/

My Website:

https://light0617.github.io/#/

An machine learning engineer in Bay Area in the United States

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Sequence-to-Sequence architectures

Chihuahua or Muffin?

Image Inpainting using Wasserstein Generative Imputation Network

Sparkify: User Churn Prediction

Consistency of Bilinear Upsampling Layer

Machine learning based promotion strategy design

Transfer Learning on VGG16 Model

Benchmarking as a product: a case for Machine Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Arthur Lee

Arthur Lee

An machine learning engineer in Bay Area in the United States

More from Medium

How to Train your CLIP

Transformer’s Scaled Dot-Product Attention

ML Arxiv Haul #5

Memorizing Transformers ICLR 2022 Paper by Google