Transformer 论文精读 + 代码实现

笔记温习一下经典的 Transformer 架构的论文，结合代码实现和解读。

前置知识

循环神经网络、卷积神经网络的演化过程、结构、代表性的模型；
传统的注意力机制（attention）已经在很多场合下成为序列/转录模型的不可分割的一部分，因为无论两个词语语义的依赖在输入/输出序列中距离多远，都能建模依赖关系。但是这种传统的注意力机制仍然没有用在 recurrent 网络中。
自注意力机制（self-attention）是通过关联单个序列中的的不同位置，来计算这个序列的 hidden representation。自注意力机制在此前被成功应用与阅读理解、抽象总结等任务中；
另外有工作表明，基于循环注意力机制（recurrent attention）的端到端记忆网络（end-to-end memory networks），它并没有采用传统 RNN 的序列对齐循环（sequence-aligned recurrence）的计算方法，仍然能在简单语言问答、语言建模等任务上取得比较好的效果；

循环注意力机制：一种将注意力机制与循环神经网络（RNN）相结合的技术，常见的有 Recurrent Attention Model（RAM）和 Recurrent Attention Convolutional Neural Network（RA - CNN）等模型；

序列对齐循环（sequence-aligned recurrence）：是一种与循环神经网络（RNN）相关的计算方式。通常沿输入和输出序列的符号位置进行因子计算（Recurrent models typically factor computation along the symbol positions of the input and output sequences），将位置与计算时间中的步骤对齐，根据前一个隐藏状态 $h_{t-1}$ 和位置 $t$ 的输入生成新的隐藏状态 $h_{t}$。这种计算方式具有内在的序列性，导致训练示例中的并行化难以实现，在处理长序列时，由于内存限制会影响跨示例的批处理效率。

背景

RNN 和 GRU/LSTM 之类的模型已经在语言序列建模（尤其是序列到序列，或者称为 “转录模型”，transduction models）、机器翻译等领域达到了 SOTA 级别的效果；
Recurrent 类型的模型由于采用的是 sequence-aligned recurrence 的计算方法，极大阻碍了计算的并行化，尤其是在序列很长的情况下；
- 虽然目前的工作进行了 factorization tricks 以及条件计算（后者还增强了模型的 performance）来优化性能，但是 Recurrent 网络串行计算的根源问题仍然无法解决；
CNN 架构的模型如 ByteNet/ConvS2S 等使用 CNN 作为 basic building block，可以并行计算所有输入输出位置的 hidden representations 数据，但是输入输出间任意位置需要进行的计算量会随着位置距离增长而增长（ByteNet 是线性的，ConvS2S 是对数的）。

但这也会导致模型难以学习到较远距离的两个位置之间的依赖关系。

基于上述背景，这个工作提出了 Transformer 模型架构，直接避开了 sequence-aligned recurrence 的做法，仅依靠注意力机制来构建一个输入/输出间的全局依赖。

目前 Transformer 也是第一个仅依靠自注意力机制（而不是使用 sequence-aligned recurrence 或者卷积的方法）来计算输入输出序列 representations 的转录模型。

这个架构的重要好处之一是可以尽可能地利用并行化的计算资源。另外 Transformer 还解决了 CNN 模型在解决序列长距离依赖时的高额时间开销问题：常数时间！

但同时由于引入了平均注意力加权的位置参数（averaging attention-weighted positions），代价是 reduced effective resolution（丢失有效分辨率）。本文通过引入多头注意力机制来缓解这一点。

模型架构

研究表明大部分的有竞争力的序列转录模型都有一个 encoder-decoder 结构。Transformer 也不例外：

encoder（上图左框）将符号表示的输入序列 $(x_1,x_2,\ldots,x_n)$ 映射到序列连续表示（a sequence of continuous representations）$z=(z_1,\ldots,z_n)$；
给定序列的连续表示 $z$，decoder（上图右框）就能每次生成一个输出序列 $(y_1,\ldots,y_m)$ 的一个元素，并且每一步模型都是自回归的（self-regressive，指前面步骤中生成的符号会作为后面序列生成的额外的输入）。

Transformer 总体就是遵循上述的架构设计，使用堆叠的 self-attention 块、由全连接层组成的 encoder 和 decoder，搭建出上图的结构。

Encoder 和 Decoder 设计

encoder 由 $N=6$ 的完全相同的层组成（参见上图示意），每个 layer 有两个 sub-layers：多头注意力机制，以及逐位的前馈全连接网络（position-wise fully connected feed-forward network）。

每个 sub-layers 周围引入残差连接块、正则化层，即每个 sub-layers 输出为 $\text{LayerNorm}(x+\text{Sublayer(x)})$，其中 $\text{Sublayer}$ 是 sub-layers 中实现的函数。

为了容易实现残差连接块，模型的所有 sub-layers，包括 embedding layers 的输出维度都是 $d=512$；

decoder 同样由 $N=6$ 的完全相同的层堆叠而成。不过其中的 sub-layers 有 3 个，除了 encoder 中有的两个以外，又加了一个 masked 多头注意力层（以及同样的残差连接-正则化层），用于处理输入的之前的输出序列（自回归嘛）。

为什么处理输入的 output embedding 的多头注意力层有 mask 呢？

主要考虑到防止模型在训练时“作弊”（Peeking Ahead），即防止模型利用当前要预测位置之后的信息（未来信息）来预测当前的位置。

这个问题就像把 validation set 直接作为 training set 一样，这会严重影响训练效果。

目的就是让模型在预测 $y_t$ 时不会“看到” $y_{t+1}$ 以及以后的信息。

Attention 设计以及创新点

一个注意力函数实际上能被描述为 {a query, a set{k: v}} -> an output 的映射。

其中查询（query）、键（key）、值（value）、输出（output，即注意力分数）都是向量。

而输出实质上就是值（values）的加权和，其中这些“权重” 是由一个 “适配性函数”（compatibility function）计算出的 这个查询 query 与对应键 key 的匹配的程度。

文章中介绍的这个算法就是 QKV 算法，注意力机制的一种高效的实现形式。

[!IMPORTANT]

读者这里可能会好奇，为什么将 query 和 key 的匹配程度作为权重加权到 value 上就能得到注意力分数，且这个做法是有效的？也就是说：为什么 QKV 算法、以及注意力机制是有效的？

其实关键在于它模拟了人类认知中一个核心过程：选择性聚焦。它允许模型在处理信息时，动态地、有选择性地将有限的“认知资源”集中在输入信息中最相关、最重要的部分上，而忽略或弱化不相关的部分。

因此说，注意力机制的核心思想就是“动态、内容相关的信息选择”，就是让模型具备这种动态聚焦的能力。

它让模型在处理某个特定元素（Query）时，能够“有意识地”去“看”其他元素（Key），并根据它们与当前元素的相关性（Query-Key 匹配度）来决定从这些元素中提取多少信息（加权 Value）。

它也因此突破了固定编码的局限。做个比较：

传统的神经网络层（如全连接层、CNN、RNN）在编码一个元素（如一个词、一个像素）时，主要依赖于其固定的上下文窗口或预定义的位置关系（如 CNN 的卷积核、RNN 的时序依赖）。

这种固定方式在处理长距离依赖、理解复杂关系或需要全局上下文信息时效率低下或效果不佳。

相比之下，注意力机制允许模型在处理序列中任何一个位置时，都能直接访问并评估序列中所有其他位置的信息，并根据内容的相关性（而非固定的位置或距离）来决定依赖程度。

上述思考的有效性也被实验结果所证明。

根据我们上面的注意力机制的定义，我们只需要设计一个 compatibility function 不就能完成注意力的计算了吗！我们记 compatibility function 为 $f_c$，那么

$\text{attention score}=f_c(q, k)\cdot v$

这里 $f_c$ 算出的结果是一个关系矩阵 $R_{ij}=(r)_{ij}$ 表示 $q_i$ 与 $k_j$ 的匹配程度，最后矩阵向量点积表示求加权和，得到对应的注意力分数。

这里因为我们想以匹配程度作为参考，给 $v$ 做个权重，因此希望满足：

计算结果的元素求和为 1（归一化与概率解释性）；
并且希望显著放大最高分数与其他分数之间的相对差异实现 “强聚焦” 的效果（突出显著项与抑制不相关项）；
还希望利于神经网络的后续梯度的计算（梯度计算的优化）；

因此 softmax 完美符合上述要求（本身输出归一化、可解释性强、非线性指数放大效应、容易计算导数），我们修改为下面的公式更为准确：

$\text{attention score}=\text{softmax}(f_c(q, k))\cdot v$

我们将 $f_c(q,k)$ 称为对齐分数，它的每个元素就是对应的、未归一化的 “query 和对应 key 的匹配程度”。

创新点 1：缩放点积注意力 (Scaled Dot-Product Attention)

对于适配性函数的具体定义，文章介绍了一种 “缩放点积” 的定义，即 $f_c(q,k)=\dfrac{1}{\sqrt{d_k}}\cdot q^T\cdot k$)，其中$q$ 和 $k$ 向量均为 $d_k$ 维。

因为计算机中一般需要批量并行计算，因此我们一般将输入向量堆叠成矩阵，具体计算起来会比上面单个向量的计算复杂一些：输入包含同样维度 $d_k$ 的 query 和 keys 向量（分别记为 $q_i$ 和 $k_i$），以及一个维度 $d_v$ 的 values 向量 $v_i$，输出对齐分数，计算方法：将一个 $q_i$ 与所有 $k_j$ 点积，每个都除以 $\sqrt{d_k}$，得到 $q_i$ 和 $k_j$ 的对齐分数。

例如 $q_i$（第 $i$ 个 query 向量）和 $k_j$（第 $j$ 个 key 向量）的关于缩放点积的对齐分数：

$e_{ij}=\dfrac{q_i^T\cdot k_j}{\sqrt{d_k}}$

最终注意力分数计算过程等价于下面的矩阵式：

$\text{Attention}(Q,K,V)=\text{softmax}(\dfrac{QK^T}{\sqrt{d_k}})V$

我们将使用 “缩放点积” 作为适配性函数的注意力机制称为 “缩放点积注意力”。

[!NOTE]

文章中也提到，并不是一开始就知道要用缩放点积函数作为 compatibility function，作者实际上先考虑的是常用的两种函数：加性、点积（分别对应加性注意力、点积注意力）。

加性函数是使用一个含有单层隐藏层的前馈神经网络，公式（$\tanh$ 是激活函数）：
$f_c(q,k)=v^T\tanh(W_q\cdot q+W_k\cdot k)$
虽然理论上，上述加性函数和点积函数的复杂度相当，但实际计算机计算起来点积函数的时间和空间消耗都更好一些，因为后者可以利用被高度优化的矩阵乘法计算代码。

不过文章指出，使用点积函数时，在 $d_k$ 很大的情况下，效果不如加性函数，作者推测可能是点积结果过大导致 Softmax 梯度消失，因此在点积后添加了一个 $\dfrac{1}{\sqrt{d_k}}$ 的缩放，这才提出了缩放点积。

原文：We suspect that for large values of $d_k$, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients [4]. To counteract this effect, we scale the dot products by $\dfrac{1}{\sqrt{d_k}}$.

创新点 2：多头注意力机制

文章注意到，与其使用单个的注意力函数来处理 $d_{\text{model}}$ 个的 $q,k,v$，不如将它们投影（线性变换）到 $h$ 个不同方向（分别是 $d_k,d_k,d_v$ 维的线性空间），然后对它们并行地求注意力分数，每个方向都能得到 $d_v$ 维输出值（注意力分数）。

作者指出这样做的作用是，增强模型捕捉不同子空间信息的能力。

原文：Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.

这样的多头注意力计算更加繁琐一些，我们直接展示结论（$d_\text{model}$ 组数据同时计算的矩阵计算式）：

$\begin{aligned} \text{MultiHead}(Q,K,V)&=\text{Concat}(\text{head}_1,\ldots,\text{head}_h)W^O\\ \text{where head}_i&=\text{Attention}(QW_i^Q,KW_i^K,VW_i^V) \end{aligned}$

其中 $W_i^Q\in\mathbf{R}^{d_{\text{model}}\times d_k},\space W_i^K\in\mathbf{R}^{d_{\text{model}}\times d_k},W_i^V\in\mathbf{R}^{d_{\text{model}}\times d_v}$ 以及 $W^O\in\mathbf{R}^{hd_v\times d_\text{model}}$ 均为用于投影的超参数矩阵。

本文的工作实验时取的值 $h=8,d_k=d_v=\dfrac{1}{h}d_{\text{model}}=64$。

Transformer 架构也充分利用了注意力机制，例如：

Encoder 的每一个自注意力层；
- 输入：注意这里 Q = K = V = 该层前一层的输出；
  
  queries、keys 和 value 全部是一样的，这也是为什么称它为 “自注意力”；
- 目的：让输入序列中的每个词（位置） 能够关注到输入序列（自身）中所有其他词（位置），捕捉词与词之间的依赖关系（无论距离远近）；
- 特点：
  - 全局上下文：每个词的表示都融合了整个输入序列的信息；
  - 并行计算：因为 Q, K, V 都来自同一序列的上一层输出，且计算不依赖顺序，整个层的计算可以高度并行化；
  - 多头注意力；
Decoder 的掩码自注意力层；
- 输入：仍然是 Q = K = V = 该层前一层的输出；
- 目的：让输出序列中正在预测的位置 i 能够关注到已经生成的输出序列中在它之前的所有位置（1 到 i-1），但不能看到它自身 (i) 或它之后 (i+1 到 m) 的位置。简言之，保持自回归（Autoregressive）特性，避免信息泄露（作弊）；
- 特点：
  - 掩码：在计算 Q（位置 i）与所有 K（位置 j）的点积后、进行 softmax 之前，会将 j > i 的位置对应的点积结果设置为一个非常大的负数（如 -10^9 或 -inf）。这样，经过 softmax 后，这些未来位置的权重就几乎为 0；
  - 其他同 encoder 的自注意力层；
Encoder-Decoder 注意力层（encoder-decoder attention layers）：
- 输入：注意这个时候与前面的不太一样：
  - 查询（Q）：来自解码器前一层的输出（即 Masked Self-Attention 子层的输出，代表了当前预测位置 i 及之前的信息）。
  - 键（K） 和 值（V）：来自编码器的最终输出（即最后一层编码器的输出，代表了整个输入序列的编码信息）；
  Query（来自 Decoder）会有意识地查询、注意来自不同序列的 Key 和 Value（来自 Encoder），
  
  因此和自注意力相对，也称 Cross Attention；
- 目的：让输出序列中正在预测的位置 i 能够关注到整个输入序列的所有位置（1 到 n）。这是经典的“源-目标”注意力机制，让解码器在生成目标词时，能够动态地聚焦于输入序列中最相关的部分（类似于传统 Seq2Seq + Attention 模型中的注意力）；
- 特点：
  - 源-目标对齐：核心作用是根据当前目标状态，在源端信息中找到最相关的上下文。这是翻译、摘要等任务的关键；
  - 信息桥梁：这是连接编码器和解码器信息的主要通道；
  - 多头注意力；

创新点 3：逐位的前馈神经网络

之前介绍架构时指出每两个 sub-layers 中一个是逐位的全连接前馈神经网络，采用一个 ReLU 激活函数和两层线性全连接层：

$\text{SubLayer}_\text{FFN}(x)=\max(0,\space xW_1+b_1)W_2+b_2$

其中输入输出的维度都为 $d_\text{model}=512$，两个线性层中间的维度是 $d_{ff}=2048$；

此外，文章指出这部分还可以用 kernel size 为 1 的两个卷积层来代替。

创新点 4: Embeddings and Softmax 的使用

和一般的序列转录模型一样，Transformer 使用 embedding 的方法将输入/输出的 tokens 转为 $d_\text{model}$ 维度的向量，并且使用 linear transformation 层和 softmax 层将 decoder 输出转换为预测的 next-token 概率。

在本文构建的模型中，作者将两个 embedding layers（input/output embedding layers，参见上图架构中的粉红色块）和 pre-softmax linear transformation（参见上图架构中 decoder 输出的第一个 Linear 块）共用了相同的参数权重矩阵，不过在 embedding layers 中，这些权重还会被乘以 $\sqrt{d_\text{model}}$；

创新点 5：位置编码 Positional Encoding

因为 Transformer 架构模型不含有 sequence-aligned recurrence 计算方法，也不含有卷积操作，所以，为了让模型利用并感知到序列的具体顺序信息，作者还在 input/output embedding 传给 encoder/decoder 前注入了 tokens 在序列中相对或绝对的位置信息，这被称为 “位置编码”。

位置编码和 embedding vectors 的维度都是 $d_\text{model}$ 维，因此可以直接相加起来。

一般位置编码可以通过学习获得，也可以事先给定，本文中选取了不同频率的正弦/余弦函数作为位置编码信息（这也是为什么上图架构图把 position encoding 部分画成了示波器的形状）：

$\begin{aligned} PE_{(pos,2i)}&=\sin(\dfrac{pos}{10000^{2i/d_\text{model}}})\\ PE_{(pos,2i+1)}&=\cos(\dfrac{pos}{10000^{2i/d_\text{model}}}) \end{aligned}$

其中 $pos$ 表示 token 在 sequence 中的位置，$i$ 表示的是一个 embedding vector 的维度索引（共 $d_\text{model}$ 维）。因此每个位置 $pos$ 对应一个 $d_\text{model}$ 维向量，偶数列用正弦函数，奇数列用余弦函数。

[!IMPORTANT]

为什么需要 “让模型感知到序列的具体顺序信息”（设计动机）？为什么这么设计（这么设计的原因）？

设计动机简单来说就一个：弥补 self-attention 的位置不变性的问题。

我们数学上注意到，自注意力机制（Self-Attention）本身是具有置换不变性的（Permutation Invariant）。即：若打乱输入序列顺序，输出不变（仅依赖词之间的相似度）。

但是同时我们又需要模型必须感知序列顺序，例如掌握语义的差别：“猫追狗”不等于“狗追猫”。

那么这样设计的原因也是和数学特性有关：对任意固定偏移量 $k$，$PE_{pos+k}$ 可表示为 $PE_{pos}$ 的线性变换。证明：
$\begin{bmatrix} \sin(\omega_i(pos+k))\\\cos(\omega_i(pos+k)) \end{bmatrix} = \begin{bmatrix} \cos(\omega_ik)&\sin(\omega_ik)\\-\sin(\omega_ik)&\cos(\omega_ik) \end{bmatrix} \begin{bmatrix} \sin(\omega_i(pos))\\\cos(\omega_i(pos)) \end{bmatrix}$
其中 $w_i=\dfrac{1}{10000^{2i/d_\text{model}}}$，这意味着 $i$ 越大，位置编码的“频率” 越高，越倾向于捕获全局位置信息（长距离依赖），反之倾向于捕获局部位置信息（相邻词关系）。

这样的特性可以让模型轻松地学习相对位置。

除了数学特性，这么还有其他的两个方面的考虑：

外推性（Extrapolation）：正弦/余弦函数的周期性允许模型泛化到比训练时更长的序列（如测试时遇到更长的句子），相比较下，可学习的位置嵌入（Learned Positional Embedding）则难以泛化到未见过的位置；

原文：We also experimented with using learned positional embeddings [8] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

值域有界：[−1,1]，与词嵌入（通常归一化）兼容。

训练方法

文章基于包含了 45 万词语对的 WMT 2014 English-German 数据集训练，句子被 encoded 为 byte-pair encoding（参考了 CoRR 的工作）。

使用的梯度下降的优化器是自适应学习率的 Adam optimizer（$\beta_1=0.9,\beta_2=0.98,\varepsilon=10^{-9}$），并且自定义控制学习率的策略：先预留 $warmup_steps=4000$ 这些 steps 来让学习率线性增长，后面在以平方根反比的速度减小学习率：

$l=d^{-0.5}_\text{model}\cdot\min(\text{step\_num}^{-0.5},\text{step\_num}\cdot\text{warmup\_steps}^{-1.5})$

文章训练过程中使用了 3 种正则化方法：

Normal Dropout：和一般的神经网络一样，我们会在比较长的网络中添加一些 dropout 进行正则化，起到防止过拟合等作用，没什么新鲜的不作介绍。
Residual Dropout：对每个 sub-layer 的输出采用了残差连接（在输入下一层 sub-layer 以及归一化前）。另外，在向 embeddings 加 positional encodings 时也用了 dropout；实验用的 base model 的 dropout rate 取 0.1；
Label Smoothing：训练过程中，使用 smoothing rate 取 $\varepsilon=0.1$；

知识补充：什么是标签平滑（label smoothing）？

在传统的分类任务（如机器翻译的词预测）中，标签通常采用 one-hot 编码（正确词的概率=1，其他词=0）。但问题是这会使模型过度自信（overconfident），强制将正确词概率推至 1，其他词压至 0。容易导致过拟合，降低泛化能力。

标签平滑的做法是，将正确词的目标概率设为 $1-\varepsilon$，并将剩余概率 $\varepsilon$ 均匀分配给所有其他词（共 $V$ 个词）：
$P=\left\{\begin{aligned} 1-\varepsilon&,\space\text{if correct}\\ \dfrac{\varepsilon}{V-1}&,\space\text{otherwise} \end{aligned}\right.$
那么为什么说 label smoothing 会损害 perplexity（模型困惑度）呢？回顾 perplexity 定义（交叉熵的指数），perplexity 值越低表示模型预测的越准确、越不太可能有很多不确定的用词选择：
$\text{perplexity}=\exp(-\dfrac{1}{N}\sum\limits_{i=1}^N\log P(w_i|\text{context}))$
而 label smoothing 会让目标分布更“平滑”，显然会提升模型的 perplexity；

但它也在另一个方便提升了模型的泛化性：

模型不会对训练数据中的噪声或特定模式过度敏感；

并且减少了 over-confident 的可能，在测试时对模糊边界（如近义词）更鲁棒；

效果简述

文章先对于 “为什么选择自注意力机制” 做了一些理论上的比较：

综合了整体复杂度、串行计算效率，以及关键路径长度比较，self-attention 在保证 performance（原文：could yield more interpretable models）的同时确保高效：Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.

然后摆出了和其他 Seq2Seq 模型的优越 performance：

语境生词

transduction model：转录模型，指输入序列、生成序列的模型；

eschew：避开，规避了；

counteract with：将…（不良效果）中和/抵销了。

代码实现及详解

本章前置知识：PyTorch 的基本使用，至少需要了解怎么用 PyTorch 搭建线性分类器/CNN 这样的模型。

代码中所有重要的部分均已用 Note 在注释中注明。

Input Encoder Layer

input encoder layer：

根据论文原文，取 $d_\text{model}=hd_k=512$，因此输出的 token embedding 维数为 512；
另外根据原文，在 embedding layers 中还会将所有权重乘以 $\sqrt{d_\text{model}}$（参见创新点 4）；

import torch
from torch import nn
import math

class InputEmbeddings(nn.Module):
    def __init__(self, d_model: int, vocab_size: int):
        """
        Initialize a Transformer input embedding layer
        :param d_model: the dimension of an embedding vector
        :param vocab_size: the full size of the vocabulary
        """
        super().__init__()
        self.d_model = d_model
        self.vocab_sz = vocab_size
        self.embedding = nn.Embedding(
            num_embeddings=vocab_size,
            embedding_dim=d_model)

    def forward(self, x):
        return self.embedding(x) * math.sqrt(self.d_model)

如上代码，PyTorch 内置的 nn.Embedding 足够进行词嵌入的运算。

Positional Encoding

根据原文，注入位置信息的方法就是将同维度的 $PE$ 与 input/output encoding 相加；
计算公式已由原文给出；
注意在这里添加了一个 dropout layer（这里是为了提升泛化性，要与 Layer Normalization 区分开），dropout rate 为 0.1；

class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, seq_len: int, p_dropout: float = 0.1):
        """
        Initialize a Transformer position encoding layer
        :param d_model: the dimension of an embedding vector
        :param seq_len: the max length of the tokens for the input sequence
        :param p_dropout: the dropout rate for the current layer
        """
        super().__init__()
        self.d_model = d_model
        self.seq_len = seq_len
        self.dropout = nn.Dropout(p_dropout)

        # PE_{(pos,2i)}=\sin(\dfrac{pos}{10000^{2i/d_\text{model}}})
        # PE_{(pos,2i+1)}=\cos(\dfrac{pos}{10000^{2i/d_\text{model}}})
        pe = torch.zeros((seq_len, d_model))
        # construct Tensor(seq_len, 1) from (seq_len,)
        pos = torch.arange(0, seq_len, dtype=torch.float64).unsqueeze(1)
        divisor_part = torch.exp(
            torch.arange(0, seq_len, 2, dtype=torch.float64) / self.d_model * (-math.log(10000.0)))
        # for all the seq_len (each token), for even/odd dimension
        pe[:, 0::2] = torch.sin(pos * divisor_part)
        pe[:, 1::2] = torch.cos(pos * divisor_part)

        # Note: here the size of the position encoding vector is (seq_len, d_model),
        # which cannot deal with batch input embeddings (multiple sequences).
        # We should generate (1, seq_len, d_model) for **batch processing**
        pe = pe.unsqueeze(0)

        # Note: If you want to save a variable in a nn.Module which is not a learnable parameter,
        # then you need to register it as a buffer so that PyTorch will save it for you.
        # And next time (e.g., training/inference stage) you can load it from the model checkpoint file!
        self.register_buffer('pe', pe)

    def forward(self, x):
        """
        Apply position encoding to input sequence
        :param x: a tensor with shape(batch, seq_len, d_model) including all the embeddings in batched sequences
        :return: PE(x)
        """
        # use pe buffer here
        # Note: truncate buffer self.pe to fit sequence length
        # Note: position encoding is fixed and is not learnable.
        # So we should tell PyTorch using 'requires_grad_(False)'.
        x = x + (self.pe[:, :x.shape(1), :]).requires_grad_(False)
        return self.dropout(x)

正则化层（架构中的 `Add & Norm` 的 “`Norm`”）

先看正则化层，即文中的 $\text{LayerNorm}$（Layer Normalization），它就是对一个 单个样本（Token）在某一层的所有特征维度（沿 $d_\text{model}$）上进行归一化。

必要性：我们知道因为这些 embedding representation 每个特征维度可能相差很大（几个数量级），例如一个 token 的 embedding vector 可能是这样的：[0.001, 10000.103, -9999999.113]（比较极端），这不利于后续梯度计算和收敛。因此 Layer Normalization 是为了稳定深层网络训练（缓解梯度问题）、加速收敛（减少内部协变量偏移）的考量才设计的。

这个 Layer Normalization 的归一化方法是非常科学的，除了一般的归一化处理，还会进行仿射变换。

这里仿射变换的必要性？

它让模型能够学习在归一化之后，是否以及如何恢复某些特征维度的原始重要性或偏差。如果没有它们，归一化可能会破坏网络已经学习到的一些表示能力。

因此现在步骤如下：

一般归一化：$\hat{h}=\dfrac{h-\mu}{\sqrt{\sigma^2+\varepsilon}}$，其中 $\varepsilon$ 是保证数值稳定性的极小数（老生常谈了），$\mu=\dfrac{1}{d_\text{model}}\sum\limits_{i=1}^{d_\text{model}}h_i$，$\sigma^2=\dfrac{1}{d_\text{model}}\sum\limits_{i=1}^{d_\text{model}}(h_i-\mu)^2$；
可学习的仿射变换（注意超参数 $\gamma$ 和 $\beta$ 都是可学习的，前者负责缩放归一化后的值，后者负责平移/偏移归一化后的值）：
$y=\gamma\cdot\hat{h}+\beta$

[!NOTE]

这里可以与一般的 Batch Normalization 做个区分。

Batch Norm：对一个 Batch 内所有样本的 同一特征维度 计算均值和方差进行归一化。它依赖于 Batch Size 和序列长度（需要填充对齐），对 Batch Size 敏感，且在 RNN/Transformer 这类序列模型上应用较麻烦。

而 Layer Norm：对 单个样本的所有特征维度 计算均值和方差进行归一化。它与 Batch Size 无关，天然适合处理不同长度的序列输入（每个 Token 独立归一化），可以针对每个样本、每个位置独立计算统计量，非常适合处理长度可变的序列数据（如句子），避免了 Batch Norm 在序列数据上的局限性（需要填充对齐、依赖 Batch Size 统计量）。

class LayerNormalization(nn.Module):
    def __init__(self, eps: float = 1e-9):
        super().__init__()
        self.eps = eps
        # Note: use Parameter for learnable parameters in nn.Module
        self.gamma = nn.Parameter(torch.ones(1))
        self.beta = nn.Parameter(torch.zeros(1))

    def forward(self, x):
        """
        Apply layer normalization to current input (LayerNorm)
        :param x: the tensor output from every sub-layer
        :return: LayerNorm(x)
        """
        # 'mean'/'std' along the dimension in each token (indexing elements of an embedding)
        # Note: 'mean'/'std' always cancels the dimension it applies.
        # Here we need to keep it for further calculation, otherwise we will need to un-squeeze
        mu = x.mean(dim=-1, keepdim=True)
        sigma = x.std(dim=-1, keepdim=True)
        return self.gamma * (x - mu) / (sigma + self.eps) + self.beta

至于残差连接块（residual dropout），我们将在后文介绍。

逐位的全连接前馈神经网络

计算公式和网络结构论文已经给出，参见 “创新点 3”。

这里建议实现时添加在激活函数后添加一个 dropout layer 进行正则化。

class FFNBlock(nn.Module):
    def __init__(self, d_model: int, d_ff: int, ff_dropout: float):
        """
        Initialize a 2-layer fully connected feed-forward network with activation function ReLU.
        :param d_model: input and output size
        :param d_ff: hidden layer size
        :param ff_dropout: dropout rate for this network
        """
        super().__init__()
        self.d_model = d_model
        self.d_ff = d_ff
        self.linear_1 = nn.Linear(in_features=d_model, out_features=d_ff)
        # Note: we add dropout layer here to do normalization
        self.dropout = nn.Dropout(ff_dropout)
        self.linear_2 = nn.Linear(in_features=d_ff, out_features=d_model)

    def forward(self, x):
        """
        Apply this FFN to the input tensor.
        (batch, seq_len, d_model) -> (batch, seq_len, d_ff) -> (batch, seq_len, d_model)
        :param x: the input tensor
        :return: SubLayer_{FFN}(x)
        """
        return self.linear_2(
            self.dropout(
                torch.relu(
                    self.linear_1(x))))

多头自注意力块、掩码多头自注意力块、Encoder-Decoder 注意力块

这里如果看论文的公式会发现比较复杂，尤其涉及分块和 batch，很容易混淆 tensor shape。这里建议在架构图上表明每一步的 shape/size 的变化情况。

如果是 Encoder 的多头自注意力块（或者 Decoder 的掩码多头自注意力块），$Q=K=V=\text{Input}$，而 Encoder-Decoder 注意力则是 $Q=\text{Decoder Input},\space K,V=\text{Encoder Input}$。它们只是计算传入的参数有所不同。另外关于 Mask 我们等会考虑。

为了方便分析，我们现在省去 batch 的维度，因此对每一个 sequence（token）的 embedding vector 而言，多头注意力层的计算过程如下图所示：

现在我们逐步介绍。输入 shape 为 $(\text{seq-len},d_\text{model})$，注意无论是哪种 attention，在这个论文的实现中都是这个 shape；

先计算多头 $\text{head}_i$：
1. 先定义可训练的模型超参数矩阵 $W^Q,W^K,W^V$，它们不能改变 $Q,K,V$ 的形状，所以大小显然都是 $d_\text{model}\times d_\text{model}$；
2. 直接计算 $QW^Q,KW^K,VW^V$，然后将它们沿着 $d_\text{model}$（垂直于 embedding 特征维度）方向，平均拆成 $h$ 份，每份记为 $QW_i^Q,KW_i^K,VW_i^V$；
  
  因为拆成 $h$ 份，因此大小都是 $\text{seq-len}\times \dfrac{d_\text{model}}{h}=\text{seq-len}\times d_k$（$d_k,d_v$ 的定义，$d_k=d_v=\dfrac{d_\text{model}}{h}$）；
  
  这一步相当于将 $Q,K,V$ 投影到 $h$ 个不同子空间；
3. 对拆好的每一份计算一次缩放点积注意力：$\text{head}_i=\text{Attention}(QW_i^Q,KW_i^K,VW_i^V)$；
现在 $\text{head}_i$ 包含了 $h$ 个不同子空间的 representation 的注意力信息，我们最后将它简单地拼起来，最后进行一个线性变换（$\times W^O$）。注意拼起来的 $\text{head}_i$ 的大小 $\text{seq-len}\times (h\cdot d_v)$，而要保证输入输出的 size 一致，因此线性变换 $W^O$ 张量大小需要 $(h\cdot d_v)\times d_\text{model}$。
最后的最后，和 position encoding、FFN 一样，我们需要添加一个 dropout layer 来正则化。

另外，为了代码的可重用性，我们应该在类中定义一个可以计算 mask 的注意力公式，同时可以计算含有掩码的多头注意力块。如果需要 mask，那么应该在计算 $\text{head}_i$ 时（$QW_i^Q\times KW_i^K$ 完成后、softmax 计算前）针对对齐分数进行 mask。

[!TIP]

这里有个比较有意思的处理方式，上面的超参数矩阵（$W^Q,W^K,W^V,W^O$）可以直接用无偏移的 nn.Linear（不包含激活函数的线性网络）表示，因为后者在数学上的表达式就是这样。注意 input feature 和 output feature 对应矩阵的长宽（利用 broadcast）。

class MultiHeadAttentionBlock(nn.Module):
    def __init__(self, d_model: int, h: int, p_dropout: float):
        """
        Build a multi-head attention block with source mask (optional)
        :param d_model: the dimension of an embedding vector
        :param h: the number of the heads
        :param p_dropout: dropout rate for this network
        """
        super().__init__()
        self.h = h
        self.d_model = d_model
        self.d_k = d_model // h     # Use floor div
        assert d_model % h == 0, "d_model is not divisible by h"

        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)

        # Note: h * d_v == d_model. so (h*d_v, d_model) == (d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(p_dropout)

    @staticmethod
    def attention(query, key, value, mask, dropout) -> (torch.Tensor, torch.Tensor):
        """
        Helper for calculating MHA scores
        :param query: Query tensor with batch and heads stacked (batch, h, seq_len, d_k)
        :param key: Key tensor with the same shape of Query
        :param value: Value tensor with shape (batch, h, seq_len, d_v), where `d_v == d_k`
        :param mask: (optional) Source mask with shape (1, 1, seq_len, seq_len). Meaning: (i,j) => j for i
        :param dropout: (optional) drop out network
        :return: (split attention_scores, softmax-processed alignment scores)
        """
        d_k = query.shape[-1]

        # Note: we need to switch dimension seq_len and d_k to do multiplication
        # [IMPORTANT] the shape of aligned_scores:
        # (batch, h, seq_len, d_k) x (batch, h, d_k, seq_len) -> (batch, h, seq_len, seq_len)
        aligned_scores: torch.Tensor = (query @ key.transpose(-2, -1)) / math.sqrt(d_k)
        if mask is not None:
            aligned_scores.masked_fill_(mask == 0, -1e9)
        aligned_scores = aligned_scores.softmax(dim=-1)
        if dropout is not None:
            aligned_scores = dropout(aligned_scores)

        # (batch, h, seq_len, seq_len) x (batch, h, seq_len, d_k) -> (batch, h, seq_len, d_k)
        return aligned_scores @ value, aligned_scores

    def forward(self, q, k, v, mask):
        """
        Calculate MHA scores for input Q/K/V
        :param q: Query tensor with shape (batch, seq_len, d_model)
        :param k: Key tensor with the same shape of Query
        :param v: Value tensor with the same shape of Key
        :param mask: source mask () for the alignment scores
        :return: Output tensor with shape (batch, seq_len, d_model)
        """
        q_prime: torch.Tensor = self.w_q(q)
        k_prime: torch.Tensor = self.w_k(k)
        v_prime: torch.Tensor = self.w_v(v)

        # split q_prime, k_prime and v_prime into h pieces.
        # And we use different dimension to indicate partition!
        # Reshape (without creating new memory area):
        # (batch, seq_len, d_model) -> (batch, seq_len, h, d_k)
        # [IMPORTANT] 注意：这里需要整理出 seq_len x d_k 相邻维度方便后续计算，因此应该交换 h 和 seq_len 的维度
        # (batch, seq_len, h, d_k) -> (batch, h, seq_len, d_k)
        q_split = q_prime.view((q_prime.shape[0], q_prime.shape[1], self.h, self.d_k)).transpose(1, 2)
        k_split = k_prime.view((k_prime.shape[0], k_prime.shape[1], self.h, self.d_k)).transpose(1, 2)
        v_split = v_prime.view((v_prime.shape[0], v_prime.shape[1], self.h, self.d_k)).transpose(1, 2)

        # shape of split_mha_scores: (batch, h, seq_len, d_v)
        # shape of alignment_scores: (batch, h, seq_len, seq_len)
        split_mha_scores, alignment_scores = MultiHeadAttentionBlock.attention(
            q_split, k_split, v_split, mask, self.dropout)

        # concat the split MHA scores and multiply by w_o (d_k == d_v)
        # concat: (batch, h, seq_len, d_v) -> (batch, seq_len, h, d_v) -> (batch, seq_len, h*d_v)
        # [IMPORTANT] do contiguous() here to declare memory copy explicitly
        mha_scores = (split_mha_scores
                      .transpose(1, 2).contiguous()
                      .view((split_mha_scores.shape[0], -1, self.h * self.d_k)))

        # multiply: (batch, seq_len, h*d_v) x (h*d_v, d_model) -(broadcast)-> (batch, seq_len, d_model)
        return self.w_o(mha_scores)

代码有些复杂，不过我用 [IMPORTANT] 标注出了 3 处比较重要、困难的部分，我们单独分析。

先看 q_split = q_prime.view((q_prime.shape[0], q_prime.shape[1], self.h, self.d_k)).transpose(1, 2) 这部分，view 是创建了一个 stride 不同的新的 tensor 对象，但是与 q_prime 共用数据内存（引用式 reshape），这个比较好理解。

但是为什么需要 transponse 将 $h$ 维度和 $d_k$ 交换呢？

这主要考虑到计算 attention score 时需要让 seq_len 和 $d_k$ 在相邻的维度上，方便后续计算。

然后再看 attention 计算函数的 aligned_scores = (query @ key.transpose(-2, -1)) / math.sqrt(d_k)，对应的是点积缩放 compatibility function $\dfrac{QK^T}{\sqrt{d_k}}$。

这里想搞清楚 shape 比较困难：为什么 (batch, h, seq_len, d_k) x (batch, h, d_k, seq_len) 得到形状 (batch, h, seq_len, seq_len)？

这主要是高维张量相乘特性，记住即可。如果是 query.transpose(-2,-1) @ key 那么 shape 就是 $(\text{batch}, h, d_k, d_k)$ 了。

你可以用简单的情况试一试，建立直觉：

# a 的形状是 (3, 1, 2, 3), b 的形状是 (1, 1, 2, 3)
a = torch.tensor([[[[1,3,3], [2,4,4]]], [[[3,5,5], [4,6,6]]], [[[5,7,7], [6,9,9]]]])
b = torch.tensor([[[[3, 4, 3], [5, 6, 7]]]])
c = a @ b.transpose(-2, -1)
d = a.transpose(-2, -1) @ b
# e = a @ b 维度不匹配
print(a.shape, b.shape, c.shape, d.shape)

最后来看这段将 MHA 的多头合并起来的代码：

1
2
3

mha_scores = (split_mha_scores
              .transpose(1, 2).contiguous()
              .view((split_mha_scores.shape[0], -1, self.h * self.d_k)))

.transpose(1, 2) 是将之前为了计算方便而交换的 $h$ 和 seq_len 维度再换回来，准备合并。

.contiguous() 是显式地进行 tensor 内存 copy，让 stride 对应的底层数据结构是连续的，方便后续 view reshape 和其他操作。

[!TIP]

特性 transpose() transpose_()

是否原地修改 ❌ 返回新张量 ✅ 修改原始张量

内存共享 ✅ 与原始张量共享内存 ✅ 同一张量内存地址不变

连续性 ❌ 结果是非连续的 ❌ 结果是非连续的

内存复制时机 仅在需要连续张量时触发（如 contiguous()）同左

特性	`transpose()`	`transpose_()`
是否原地修改	❌ 返回新张量	✅ 修改原始张量
内存共享	✅ 与原始张量共享内存	✅ 同一张量内存地址不变
连续性	❌ 结果是非连续的	❌ 结果是非连续的
内存复制时机	仅在需要连续张量时触发（如 `contiguous()`）	同左

最后的 .view() 最终进行符合要求的合并操作。

残差连接块（`Add & Norm` 中的 “`Add`”）

前面说了正则化层的定义，现在我们看残差连接层。正如论文的表达式：$\text{LayerNorm}(x+\text{SubLayer}(x))$，残差连接就是 $x+\text{SubLayer}(x)$，也就是上面架构图中将上一层的 input 拉过来的箭头。

为了方便起见，我们这里代码中的残差连接层的定义直接和正则化层写在了一起（调用关系），因为它们总是一同出现。

另外比较好玩的是，很多 Transformer Implementation 实际上是这么计算的：$x+\text{SubLayer}(\text{LayerNorm}(x))$，然后在 $\text{SubLayer}$ 计算完后再添加一个 dropout layer，最后再和 $x$ 残差连接起来。可能这样的工程效果更好？注：下面的代码也是这种和论文不一样的计算方法。

class ResidualConnection(nn.Module):
    def __init__(self, p_dropout: float):
        """
        Build residual connection layer (Add & Norm)
        :param p_dropout: the normal dropout rate for this layer
        """
        super().__init__()
        self.dropout = nn.Dropout(p_dropout)
        self.norm = LayerNormalization()

    def forward(self, x, sub_layer):
        return x + self.dropout(sub_layer(self.norm(x)))

Encoder & Decoder Block

先考虑 Encoder。现在我们要将之前已经定义的模块组合起来成为一个 Transformer Encoder（参见架构图），我们分别定义 EncoderBlock（包含 MHA、FFN、两个 Add & Norm），以及 Encoder（$N\times$ Encoder Block，论文中 $N=6$）。

注意，因为前面的多头注意力块实现的时候我们为了可复用性添加了 Mask 参数，所以这里在构造 Encoder Attention 的时候还需要代一个参数方便复用（尽管当前模型用不到）。

class EncoderBlock(nn.Module):
    def __init__(self, mha: MultiHeadAttentionBlock, ffn: FFNBlock, p_dropout: float):
        """
        Build a single encoder block with 1x MHA, 2x Residual Connections, 1x FFN.
        :param mha: Multi-head attention block instance
        :param ffn: Position-wise feed-forward network instance
        :param p_dropout: the dropout rate for each of the residual connection block
        """
        super().__init__()
        self.mha_layer = mha
        self.ffn_layer = ffn
        self.residual_connections = nn.ModuleList([ResidualConnection(p_dropout) for _ in range(2)])

    def forward(self, x, src_mask):
        # Note: sub_layer in ResidualConnection only has one argument,
        # but `forward` in MultiHeadAttentionBlock has 4 parameters (self excluded)
        # So we need to use lambda expr to construct function with one parameter
        x = self.residual_connections[0](x, lambda i: self.mha_layer(i, i, i, src_mask))
        x = self.residual_connections[1](x, self.ffn_layer)
        return x


class Encoder(nn.Module):
    def __init__(self, layers: nn.ModuleList):
        """
        Construct a Transformer Encoder
        :param layers:
        """
        super().__init__()
        self.layers = layers
        # Question: is this necessary?
        self.norm = LayerNormalization()

    def forward(self, x, mask):
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

还要注意一下，在组合网络的时候需要注意和定义一个网络有略微区别，主要是需要向上层呈递和注册参数，方便训练优化器识别。下面的 Tip 提示一个易错点。

[!TIP]

PyTorch 在处理多个模型拼接的时候，不能用普通 Python 列表来管理网络部件，必须使用 nn.ModuleList 来表示（当然你不嫌麻烦的话可以一个一个重复手写）。它允许 index / for 迭代、使用 List[nn.Module] 初始化。

读者可能会问，为什么不直接用 Python 列表存放 nn.Module 而必须用 nn.ModuleList 呢？

很多新手都会犯这样的错误（包括笔者），不用的话可能有些问题：

参数注册问题：nn.ModuleList 会自动将列表中的所有子模块注册到父模块中。这意味着子模块的参数（nn.Parameter）会被父模块的 parameters() 方法识别，从而被优化器发现并更新；

如果只是用列表的话，优化器可能没法识别到，或者无法正确保存/加载模型（state_dict 会缺失这些参数）；

另一种情况是设备移动问题，当调用 model.to(device) 时，nn.ModuleList 会管理并将所有子模块及其参数会自动移动到目标设备（如 GPU）；

如果只用列表的话，子模块可能不会被移动，导致模型一部分在 CPU、一部分在 GPU，引发运行时错误；

模式状态也可能有问题。model.train() 和 model.eval() 两种情况网络的读写行为是不一样的，用普通列表会导致更新状态错误。

举个栗子🌰：
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class BadModel(nn.Module):
    def __init__(self):
        super().__init__()
        # 错误的。你不炸了吗
        self.layers = [
            nn.Linear(10, 20),
            nn.ReLU(),
            nn.Linear(20, 2)
        ]

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

class GoodModel(nn.Module):
    def __init__(self):
        super().__init__()
        # 正确的（forward 与上面的相同）
        self.layers = nn.ModuleList([
            nn.Linear(10, 20),
            nn.ReLU(),
            nn.Linear(20, 2)
        ])

那么好，我们回顾一下残差连接层的代码：

class ResidualConnection(nn.Module):
    def __init__(self, p_dropout: float):
        super().__init__()
        self.dropout = nn.Dropout(p_dropout)
        self.norm = LayerNormalization()
        
    def forward(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm(x)))

回答问题：

为什么 LayerNormalization 和 Dropout 的网络不需要 nn.ModuleList 来帮它注册参数？

答：nn.Module 类会自动管理所有存放在实例属性中的网络（不包括 Python 原生容器类型，因为对象的内存存放方式导致）。

上面的网络作为 self.dropout 和 self.norm 存在，因此已经被管理；
为什么传入的 sublayer 不需要 nn.ModuleList 来帮它注册参数？

答：外部传入的 sublayer 不是 ResidualConnection 的组成部分，约定由创建它的父模块负责。

看完上面的 Tip 和两个问题后，你应该能明白 nn.ModuleList 或者 nn.Sequential 的作用了，主要是方便管理不方便一个个写成实例属性的情况：

场景	推荐方式	示例
固定数量的子模块	直接定义成实例属性	`self.conv = nn.Conv2d()`
动态数量的模块集合 / 重复的模块	`nn.ModuleList`	循环创建的 layer 列表
需要按名字访问的模块集合	`nn.ModuleDict`	通过键名访问的模块
顺序执行的固定模块序列（串行计算）	`nn.Sequential`	`self.seq = nn.Sequential(...)`

最后，Decoder 和 Encoder 非常相似，不再赘述，只要区分给 Encoder MHA 传入的 mask（source mask）和 Decoder MHA 的 mask（target mask）是不同的即可。

class DecoderBlock(nn.Module):
    def __init__(self, masked_mha: MultiHeadAttentionBlock, mha: MultiHeadAttentionBlock, ffn: FFNBlock, p_dropout: float):
        """
        Build a single decoder block with 1x Masked MHA, 1x Cross MHA, 1x FFN, 3x Residual Connections.
        :param masked_mha: Masked MHA instance for decoder
        :param mha: Multi-head cross attention block instance for decoder (encoder-decoder attention)
        :param ffn: Position-wise feed-forward network instance
        :param p_dropout: the dropout rate for each of the residual connection block
        """
        super().__init__()
        self.masked_mha_layer = masked_mha
        self.mha_layer = mha
        self.ffn_layer = ffn
        self.residual_connections = nn.ModuleList([ResidualConnection(p_dropout) for _ in range(3)])

    def forward(self, encoder_input, decoder_input, src_mask, target_mask):
        """
        Apply DecoderBlock to Transformer Encoder/Decoder inputs (with masks)
        :param encoder_input: input embeddings from encoder
        :param decoder_input: input from decoder
        :param src_mask: mask for MHA of Transformer Encoder
        :param target_mask: mask for MHA of Transformer Decoder
        :return: tensor proceeded by one single DecoderBlock
        """
        decoder_input = self.residual_connections[0](
            decoder_input, lambda i: self.masked_mha_layer(i, i, i, target_mask))
        output = self.residual_connections[1](
            decoder_input, lambda d: self.mha_layer(d, encoder_input, encoder_input, src_mask))
        output = self.residual_connections[2](output, self.ffn_layer)
        return output


class Decoder(nn.Module):
    def __init__(self, layers: nn.ModuleList):
        """
        Construct a Transformer Decoder
        :param layers: the layers of multiple DecoderBlocks
        """
        super().__init__()
        self.layers = layers
        self.norm = LayerNormalization()

    def forward(self, x_from_encoder, x_from_decoder, src_mask, target_mask):
        for layer in self.layers:
            x_from_decoder = layer(x_from_encoder, x_from_decoder, src_mask, target_mask)
        return self.norm(x_from_decoder)

Projection Layer (`Linear`)

在 Transformer 架构的最后有一个 Linear 块，即 pre-softmax linear transformation（参见 “创新点 4”），它的权重是与 Embedding layers 共用的（不过没有 $\sqrt{d_\text{model}}$ 的缩放），它的作用是最终将生成的特征 tensors 映射回 vocabulary 中。

这里为了简便起见，我们直接设置自由的权重：

class ProjectionLayer(nn.Module):
    """
    Build a projection layer for Transformer to convert features back into vocabulary
    """
    def __init__(self, d_model: int, vocab_size: int):
        super().__init__()
        self.proj = nn.Linear(d_model, vocab_size)

    def forward(self, x):
        # project: (batch, seq_len, d_model) -> (batch, seq_len, vocab_size)
        # log-softmax: convert vocab_size dim to probabilities
        return torch.log_softmax(self.proj(x), dim=-1)

Put It Together

最终我们定义一个 Transformer 类型，将上面的模块组合起来（以文本映射任务为例），并且为模型的参数设置初始值：

class Transformer(nn.Module):
    def __init__(
            self, encoder: Encoder, decoder: Decoder,
            src_embedding_layer: InputEmbeddings,
            target_embedding_layer: InputEmbeddings,
            src_pos_encoding_layer: PositionalEncoding,
            target_pos_encoding_layer: PositionalEncoding,
            proj_layer: ProjectionLayer):
        super().__init__()
        self.decoder = decoder
        self.encoder = encoder
        self.src_embed = src_embedding_layer
        self.target_embed = target_embedding_layer
        self.src_pos_encod = src_pos_encoding_layer
        self.target_pos_encod = target_pos_encoding_layer
        self.proj = proj_layer

    def encode(self, src, src_mask):
        src = self.src_embed(src)
        src = self.src_pos_encod(src)
        return self.encoder(src, src_mask)

    def decode(self, target, target_mask):
        target = self.target_embed(target)
        target = self.target_pos_encod(target)
        return self.decoder(target, target_mask)

    def project(self, x):
        return self.proj(x)


def build_transformer(
        src_vocab_size: int, target_vocab_size: int,
        input_seq_len: int, output_seq_len: int,
        d_model: int = 512,
        n: int = 6,
        h: int = 8,
        dropout: float = 0.1,
        d_ff: int = 2048):
    """
    Construct a Transformer model from scratch.
    :param src_vocab_size: size of vocabulary for the source langauge
    :param target_vocab_size: size of vocabulary for the target langauge
    :param input_seq_len: the approximate length of the input token sequence
    :param output_seq_len: the approximate length of the output token sequence
    :param d_model: the dimension of the embedding vectors (feature dimension)
    :param n: the number of EncoderBlock/DecoderBlock in Encoder/Decoder
    :param h: the number of the heads in MHA
    :param dropout: dropout rate for all the dropout networks in the model
    :param d_ff: the size for the hidden layer in all the FFNs
    :return:
    """
    src_embed = InputEmbeddings(d_model, src_vocab_size)
    target_embed = InputEmbeddings(d_model, target_vocab_size)

    src_pos_enc = PositionalEncoding(d_model, input_seq_len, dropout)
    target_pos_enc = PositionalEncoding(d_model, output_seq_len, dropout)

    # Note: we cannot use list expression because
    # each EncoderBlock/DecoderBlock has different parameters
    encoder_blocks = []
    decoder_blocks = []
    for _ in range(n):
        encoder_mha = MultiHeadAttentionBlock(d_model, h, dropout)
        decoder_masked_mha = MultiHeadAttentionBlock(d_model, h, dropout)
        cross_mha = MultiHeadAttentionBlock(d_model, h, dropout)

        encoder_ffn = FFNBlock(d_model, d_ff, dropout)
        decoder_ffn = FFNBlock(d_model, d_ff, dropout)

        encoder_blocks.append(EncoderBlock(encoder_mha, encoder_ffn, dropout))
        decoder_blocks.append(DecoderBlock(decoder_masked_mha, cross_mha, decoder_ffn, dropout))

    encoder = Encoder(nn.ModuleList(encoder_blocks))
    decoder = Decoder(nn.ModuleList(decoder_blocks))

    proj_layer = ProjectionLayer(d_model, target_vocab_size)

    model = Transformer(encoder, decoder, src_embed, target_embed, src_pos_enc, target_pos_enc, proj_layer)

    for p in model.parameters():
        if p.dim() > 1:
            # 为模型参数设置服从标准分布的值
            nn.init.xavier_uniform_(p)
    return model