[NeurIPS 2017] Attention Is All You Need

RNN, CNN 없이 only Attention → transformer for NLP
Published

July 22, 2025

Modified

Invalid Date

1 Proposal

1.1 NLP → Sequence transduction model.

  • Sequence → Sequence 변환
  • 기존 (1): RNN or CNN 사용. using encoder & decoder
  • 기존 (2): Best model: Attention mechanism 사용해서 encoder - decoder 연결

1.2 Transformer architecture 제안.

  • RNN[LSTM, GRU], CNN 사용 X
  • 오직 attention mechanisms [specifically, self-attention]만 사용.
  • 기존: encoder-decoder architectures를 사용한 recurrent layers → transformer: multi-headed self-attention
  • Experiments 결과 → transformer: 병렬화 up & 학습시간 down.
  • 평가척도: BLEU score, 2014 WMT [Workshop on Machine Translation]에서 성과 얻음.
    • English to German, English to French
  • GPU도 적게 사용!!.

2 3. Model Architecture

2.1 Sequence transduction models

2.1.1 encoder-decoder structure

  • Encoder: [input] \(\mathbf{x} \rightarrow \mathbf{z}\) [output]
    • \(\mathbf{x}=\left( x_1, \ldots, x_n \right)\): Symbol sequence
    • \(\mathbf{z}=\left(z_1,\ldots,z_n\right)\): Continuous sequence
  • Decoder: [input] \(\mathbf{z} \rightarrow \mathbf{y}\) [output]
    • \(\mathbf{z}=\left(z_1,\ldots,z_n\right)\): Continuous sequence
    • \(\mathbf{y}=\left( y_1, \ldots, y_m \right)\): Symbol sequence

2.1.2 Full architecture

image.png

2.2 3.1. Encoder and Decoder Stacks

2.2.1 1st.

image.png
  • Inputs == input tokens
  • tokens → embedding vector \(\in \mathbb{R}^{d_\text{model}}\)
  • embedding vector + positional encoding: Input.

2.2.2 2nd. == Encoder

image.png
  • N = 6개의 identical layers, input → 1st layer → 2nd layer → … → N-th layer → output
  • each layer == two sub layers
    1. [Multi-Head Attention] Multi-Head self-Attention mechanism
    2. Feed Forward position-wise fully connected Feed-Forward Network [FFN]
    • [Add & Norm] \(\text{LayerNorm}\left( \right)\) function
  • Sub-layers’ outputs
    • \(\text{LayerNorm}\left( \mathbf{x}+\text{Sublayer}\left(\mathbf{x}\right)\right)\), where \(\mathbf{x}\): input of the each sub-layer

2.2.3 3rd.

image.png
  • Outputs == output tokens
  • tokens → embedding vector \(\in \mathbb{R}^{d_\text{model}}\)
  • embedding vector + positional encoding: Output.

2.2.4 4th. == Decoder

image.png
  • N = 6개의 identical layers, input → 1st layer → 2nd layer → … → N-th layer → output
  • each layer == three sub layers
    1. [Masked Multi-Head Attention]
      • Masking: i의 position이 i보다 작은 positions의 outputs에만 의존함을 보장.
      • 1, 2, 3, …, i-1 → i
    2. [Multi-Head Attention] Multi-Head self-Attention mechanism
      • 주의: Encoder output에 multi-head attention 적용.
    3. Feed Forward position-wise fully connected Feed-Forward Network [FFN]
    • [Add & Norm] \(\text{LayerNorm}\left( \right)\) function
  • Sub-layers’ outputs
    • \(\text{LayerNorm}\left( \mathbf{x}+\text{Sublayer}\left(\mathbf{x}\right)\right)\), where \(\mathbf{x}\): input of the each sub-layer

2.2.5 5th. == 3.4.

image.png

2.3 3.2. Attention

  • Inputs
    • query vector: \(\mathbf{q}\) → dimension \(d_q = d_k\)
    • keys vector: \(\mathbf{k}\) → dimension \(d_k\)
    • values vector: \(\mathbf{v}\) → dimension \(d_v\)
  • Output vector = values의 가중합, weight ← (query, key)

image.png

2.3.1 3.2.1. Scaled Dot-Product Attention [single attention head]

image.png
  • \(\mathbf{q}, \mathbf{k}, \mathbf{v} \rightarrow \text{matrix} \ Q, K, V\)

    \[ \text{Attention}\left(Q,K,V\right)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \quad (1) \]

    • dot product → scaling → softmax

2.3.2 3.2.2. Multi-Head Attention

image.png
  • \(h=8\)개의 multi-head 사용. with \(W:\text{parameter matrix}\)

    \[ \text{MultiHead}\left(Q,K,V\right)=\text{Concat}\left(\text{head}_1,\ldots,\text{head}_h\right)W^O \\ \text{where} \quad \text{head}_i=\text{Attention}\left(QW_{i}^Q,KW_{i}^K,VW_{i}^V\right) \]

    \[ W_{i}^Q \in \mathbb{R}^{d_{\text{model}}\times d_k},\ W_{i}^K \in \mathbb{R}^{d_{\text{model}}\times d_k},\ W_{i}^V \in \mathbb{R}^{d_{\text{model}}\times d_v} \]

    \[ d_k=d_v=\frac{d_{\text{model}}}{h} = 64 \rightarrow d_{\text{model}}=512 \]

2.3.3 3.2.3. Applications of Attention in our Model

Transformer uses multi-head attention in three different ways

  1. Encoder-Decoder Attention layers

    image.png
    1. previous decoder layer’s output → queries
    2. encoder layer’s output → memory keys and values
  2. Encoder’s self-attention layers [Q, K, V가 모두 같은 곳에서 나옴.]

    image.png
    1. previous encoder layer’s output → keys, values and queries
  3. Decoder’s self-attention layers

2.4 3.3. Position-wise Feed-Forward Networks

  • \[ \text{FFN}\left(\mathbf{x}\right)=\text{RELU}\left(\mathbf{x}W_1 + \mathbf{b}_1\right)W_2 + \mathbf{b}_2 \quad (2) \\ \text{where} \ \text{RELU}\left(x\right)=\text{max}\left(0,x\right) \]

    • row vector을 사용한 것 같다. \(\mathbf{x}W_1\)을 보면.

2.5 3.4. Embeddings and Softmax

image.png
  • input/output tokens → embedding vectors
  • Learned linear transformation & softmax function: decoder output → predicted next-token probabilities.

2.6 3.5. Positional Encoding

  • RNN, CNN 사용 X → sequence의 ordering 사용 불가.
  • → Positional eocodings를 encoder and decoder stacks의 bottoms에 삽입.
  • 다양한 positional encodings 사용 가능. → 논문에서는 아래와 같이 사용.

    \[ \text{PE}_{\left(pos, 2i\right)}=\sin\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right) \\ \text{PE}_{\left(pos, 2i+1\right)}=\cos\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right) \]

    • \(pos\): position
    • \(i\): dimension

3 4. Why Self-Attention

image.png