[NeurIPS 2017] Attention Is All You Need
- References
- Paper: Attention Is All You Need
- Conference: NeurIPS 2017
- Presentation Slides
1 Proposal
1.1 NLP → Sequence transduction model.
- Sequence → Sequence 변환
- 기존 (1): RNN or CNN 사용. using encoder & decoder
- 기존 (2): Best model: Attention mechanism 사용해서 encoder - decoder 연결
1.2 Transformer architecture 제안.
- RNN[LSTM, GRU], CNN 사용 X
- 오직 attention mechanisms [specifically, self-attention]만 사용.
- 기존: encoder-decoder architectures를 사용한 recurrent layers → transformer: multi-headed self-attention
- Experiments 결과 → transformer: 병렬화 up & 학습시간 down.
- 평가척도: BLEU score, 2014 WMT [Workshop on Machine Translation]에서 성과 얻음.
- English to German, English to French
- GPU도 적게 사용!!.
2 3. Model Architecture
2.1 Sequence transduction models
2.1.1 encoder-decoder structure
- Encoder: [input] \(\mathbf{x} \rightarrow \mathbf{z}\) [output]
- \(\mathbf{x}=\left( x_1, \ldots, x_n \right)\): Symbol sequence
- \(\mathbf{z}=\left(z_1,\ldots,z_n\right)\): Continuous sequence
- Decoder: [input] \(\mathbf{z} \rightarrow \mathbf{y}\) [output]
- \(\mathbf{z}=\left(z_1,\ldots,z_n\right)\): Continuous sequence
- \(\mathbf{y}=\left( y_1, \ldots, y_m \right)\): Symbol sequence
2.1.2 Full architecture

2.2 3.1. Encoder and Decoder Stacks
2.2.1 1st.

- Inputs == input tokens
- tokens → embedding vector \(\in \mathbb{R}^{d_\text{model}}\)
- embedding vector + positional encoding: Input.
2.2.2 2nd. == Encoder

- N = 6개의 identical layers, input → 1st layer → 2nd layer → … → N-th layer → output
- each layer == two sub layers
- [Multi-Head Attention] Multi-Head self-Attention mechanism
- Feed Forward position-wise fully connected Feed-Forward Network [FFN]
- [Add & Norm] \(\text{LayerNorm}\left( \right)\) function
- Sub-layers’ outputs
- \(\text{LayerNorm}\left( \mathbf{x}+\text{Sublayer}\left(\mathbf{x}\right)\right)\), where \(\mathbf{x}\): input of the each sub-layer
2.2.3 3rd.

- Outputs == output tokens
- tokens → embedding vector \(\in \mathbb{R}^{d_\text{model}}\)
- embedding vector + positional encoding: Output.
2.2.4 4th. == Decoder

- N = 6개의 identical layers, input → 1st layer → 2nd layer → … → N-th layer → output
- each layer == three sub layers
- [Masked Multi-Head Attention]
- Masking: i의 position이 i보다 작은 positions의 outputs에만 의존함을 보장.
- 1, 2, 3, …, i-1 → i
- [Multi-Head Attention] Multi-Head self-Attention mechanism
- 주의: Encoder output에 multi-head attention 적용.
- Feed Forward position-wise fully connected Feed-Forward Network [FFN]
- [Add & Norm] \(\text{LayerNorm}\left( \right)\) function
- [Masked Multi-Head Attention]
- Sub-layers’ outputs
- \(\text{LayerNorm}\left( \mathbf{x}+\text{Sublayer}\left(\mathbf{x}\right)\right)\), where \(\mathbf{x}\): input of the each sub-layer
2.2.5 5th. == 3.4.

2.3 3.2. Attention
- Inputs
- query vector: \(\mathbf{q}\) → dimension \(d_q = d_k\)
- keys vector: \(\mathbf{k}\) → dimension \(d_k\)
- values vector: \(\mathbf{v}\) → dimension \(d_v\)
- Output vector = values의 가중합, weight ← (query, key)

2.3.1 3.2.1. Scaled Dot-Product Attention [single attention head]

\(\mathbf{q}, \mathbf{k}, \mathbf{v} \rightarrow \text{matrix} \ Q, K, V\)
\[ \text{Attention}\left(Q,K,V\right)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \quad (1) \]
- dot product → scaling → softmax
2.3.2 3.2.2. Multi-Head Attention

\(h=8\)개의 multi-head 사용. with \(W:\text{parameter matrix}\)
\[ \text{MultiHead}\left(Q,K,V\right)=\text{Concat}\left(\text{head}_1,\ldots,\text{head}_h\right)W^O \\ \text{where} \quad \text{head}_i=\text{Attention}\left(QW_{i}^Q,KW_{i}^K,VW_{i}^V\right) \]
\[ W_{i}^Q \in \mathbb{R}^{d_{\text{model}}\times d_k},\ W_{i}^K \in \mathbb{R}^{d_{\text{model}}\times d_k},\ W_{i}^V \in \mathbb{R}^{d_{\text{model}}\times d_v} \]
\[ d_k=d_v=\frac{d_{\text{model}}}{h} = 64 \rightarrow d_{\text{model}}=512 \]
2.3.3 3.2.3. Applications of Attention in our Model
Transformer uses multi-head attention in three different ways
Encoder-Decoder Attention layers

image.png - previous decoder layer’s output → queries
- encoder layer’s output → memory keys and values
Encoder’s self-attention layers [Q, K, V가 모두 같은 곳에서 나옴.]

image.png - previous encoder layer’s output → keys, values and queries
Decoder’s self-attention layers
2.4 3.3. Position-wise Feed-Forward Networks
\[ \text{FFN}\left(\mathbf{x}\right)=\text{RELU}\left(\mathbf{x}W_1 + \mathbf{b}_1\right)W_2 + \mathbf{b}_2 \quad (2) \\ \text{where} \ \text{RELU}\left(x\right)=\text{max}\left(0,x\right) \]
- row vector을 사용한 것 같다. \(\mathbf{x}W_1\)을 보면.
2.5 3.4. Embeddings and Softmax

- input/output tokens → embedding vectors
- Learned linear transformation & softmax function: decoder output → predicted next-token probabilities.
2.6 3.5. Positional Encoding
- RNN, CNN 사용 X → sequence의 ordering 사용 불가.
- → Positional eocodings를 encoder and decoder stacks의 bottoms에 삽입.
다양한 positional encodings 사용 가능. → 논문에서는 아래와 같이 사용.
\[ \text{PE}_{\left(pos, 2i\right)}=\sin\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right) \\ \text{PE}_{\left(pos, 2i+1\right)}=\cos\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right) \]
- \(pos\): position
- \(i\): dimension
3 4. Why Self-Attention
