[NeurIPS 2017] Attention Is All You Need

RNN, CNN 없이 only Attention → transformer for NLP

Published

July 22, 2025

Modified

Invalid Date

References
- Paper: Attention Is All You Need
- Conference: NeurIPS 2017
Presentation Slides

1 Proposal

1.1 NLP → Sequence transduction model.

Sequence → Sequence 변환
기존 (1): RNN or CNN 사용. using encoder & decoder
기존 (2): Best model: Attention mechanism 사용해서 encoder - decoder 연결

1.2 Transformer architecture 제안.

RNN[LSTM, GRU], CNN 사용 X
오직 attention mechanisms [specifically, self-attention]만 사용.
기존: encoder-decoder architectures를 사용한 recurrent layers → transformer: multi-headed self-attention
Experiments 결과 → transformer: 병렬화 up & 학습시간 down.
평가척도: BLEU score, 2014 WMT [Workshop on Machine Translation]에서 성과 얻음.
- English to German, English to French
GPU도 적게 사용!!.

2 3. Model Architecture

2.1 Sequence transduction models

2.1.1 encoder-decoder structure

Encoder: [input] \(\mathbf{x} \rightarrow \mathbf{z}\) [output]
- \(\mathbf{x}=\left( x_1, \ldots, x_n \right)\): Symbol sequence
- \(\mathbf{z}=\left(z_1,\ldots,z_n\right)\): Continuous sequence
Decoder: [input] \(\mathbf{z} \rightarrow \mathbf{y}\) [output]
- \(\mathbf{z}=\left(z_1,\ldots,z_n\right)\): Continuous sequence
- \(\mathbf{y}=\left( y_1, \ldots, y_m \right)\): Symbol sequence

2.1.2 Full architecture

2.2 3.1. Encoder and Decoder Stacks

2.2.1 1st.

Inputs == input tokens
tokens → embedding vector \(\in \mathbb{R}^{d_\text{model}}\)
embedding vector + positional encoding: Input.

2.2.2 2nd. == Encoder

N = 6개의 identical layers, input → 1st layer → 2nd layer → … → N-th layer → output
each layer == two sub layers
1. [Multi-Head Attention] Multi-Head self-Attention mechanism
2. Feed Forward position-wise fully connected Feed-Forward Network [FFN]
- [Add & Norm] \(\text{LayerNorm}\left( \right)\) function
Sub-layers’ outputs
- \(\text{LayerNorm}\left( \mathbf{x}+\text{Sublayer}\left(\mathbf{x}\right)\right)\), where \(\mathbf{x}\): input of the each sub-layer

2.2.3 3rd.

Outputs == output tokens
tokens → embedding vector \(\in \mathbb{R}^{d_\text{model}}\)
embedding vector + positional encoding: Output.

2.2.4 4th. == Decoder

N = 6개의 identical layers, input → 1st layer → 2nd layer → … → N-th layer → output
each layer == three sub layers
1. [Masked Multi-Head Attention]
  - Masking: i의 position이 i보다 작은 positions의 outputs에만 의존함을 보장.
  - 1, 2, 3, …, i-1 → i
2. [Multi-Head Attention] Multi-Head self-Attention mechanism
  - 주의: Encoder output에 multi-head attention 적용.
3. Feed Forward position-wise fully connected Feed-Forward Network [FFN]
- [Add & Norm] \(\text{LayerNorm}\left( \right)\) function
Sub-layers’ outputs
- \(\text{LayerNorm}\left( \mathbf{x}+\text{Sublayer}\left(\mathbf{x}\right)\right)\), where \(\mathbf{x}\): input of the each sub-layer

2.2.5 5th. == 3.4.

2.3 3.2. Attention

Inputs
- query vector: \(\mathbf{q}\) → dimension \(d_q = d_k\)
- keys vector: \(\mathbf{k}\) → dimension \(d_k\)
- values vector: \(\mathbf{v}\) → dimension \(d_v\)
Output vector = values의 가중합, weight ← (query, key)

2.3.1 3.2.1. Scaled Dot-Product Attention [single attention head]

\(\mathbf{q}, \mathbf{k}, \mathbf{v} \rightarrow \text{matrix} \ Q, K, V\)

\[ \text{Attention}\left(Q,K,V\right)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \quad (1) \]
- dot product → scaling → softmax

2.3.2 3.2.2. Multi-Head Attention

\(h=8\)개의 multi-head 사용. with \(W:\text{parameter matrix}\)

\[ \text{MultiHead}\left(Q,K,V\right)=\text{Concat}\left(\text{head}_1,\ldots,\text{head}_h\right)W^O \\ \text{where} \quad \text{head}_i=\text{Attention}\left(QW_{i}^Q,KW_{i}^K,VW_{i}^V\right) \]

\[ W_{i}^Q \in \mathbb{R}^{d_{\text{model}}\times d_k},\ W_{i}^K \in \mathbb{R}^{d_{\text{model}}\times d_k},\ W_{i}^V \in \mathbb{R}^{d_{\text{model}}\times d_v} \]

\[ d_k=d_v=\frac{d_{\text{model}}}{h} = 64 \rightarrow d_{\text{model}}=512 \]

2.3.3 3.2.3. Applications of Attention in our Model

Transformer uses multi-head attention in three different ways

Encoder-Decoder Attention layers

image.png
1. previous decoder layer’s output → queries
2. encoder layer’s output → memory keys and values
Encoder’s self-attention layers [Q, K, V가 모두 같은 곳에서 나옴.]

image.png
1. previous encoder layer’s output → keys, values and queries
Decoder’s self-attention layers

2.4 3.3. Position-wise Feed-Forward Networks

\[ \text{FFN}\left(\mathbf{x}\right)=\text{RELU}\left(\mathbf{x}W_1 + \mathbf{b}_1\right)W_2 + \mathbf{b}_2 \quad (2) \\ \text{where} \ \text{RELU}\left(x\right)=\text{max}\left(0,x\right) \]
- row vector을 사용한 것 같다. \(\mathbf{x}W_1\)을 보면.

2.5 3.4. Embeddings and Softmax

input/output tokens → embedding vectors
Learned linear transformation & softmax function: decoder output → predicted next-token probabilities.

2.6 3.5. Positional Encoding

RNN, CNN 사용 X → sequence의 ordering 사용 불가.
→ Positional eocodings를 encoder and decoder stacks의 bottoms에 삽입.

다양한 positional encodings 사용 가능. → 논문에서는 아래와 같이 사용.

\[ \text{PE}_{\left(pos, 2i\right)}=\sin\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right) \\ \text{PE}_{\left(pos, 2i+1\right)}=\cos\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right) \]
- \(pos\): position
- \(i\): dimension

1 Proposal

1.1 NLP → Sequence transduction model.

1.2 Transformer architecture 제안.

2 3. Model Architecture

2.1 Sequence transduction models

2.1.1 encoder-decoder structure

2.1.2 Full architecture

2.2 3.1. Encoder and Decoder Stacks

2.2.1 1st.

2.2.2 2nd. == Encoder

2.2.3 3rd.

2.2.4 4th. == Decoder

2.2.5 5th. == 3.4.

2.3 3.2. Attention

2.3.1 3.2.1. Scaled Dot-Product Attention [single attention head]

2.3.2 3.2.2. Multi-Head Attention

2.3.3 3.2.3. Applications of Attention in our Model

2.4 3.3. Position-wise Feed-Forward Networks

2.5 3.4. Embeddings and Softmax

2.6 3.5. Positional Encoding

3 4. Why Self-Attention