SoftMax 回归 · 深度学习 01 | Hailiang ZHAO @ ZJU-CS

Softmax 回归是最简单的神经网络模型，输入数据经过线性组合后，直接通过 softmax 运算得到一个数据分布作为输出。Softmax 回归是 logistic 回归从二类拓展到多类的结果（将 logistic 函数换成 softmax 运算）。

模型定义

接下来的向量指 列向量。

设标签 $c \in \{1, ..., C \}$ ，对于样本 $(\vec{x}, y)$ ，softmax 回归预测样本标签为 $c$ 的概率 $$ p \left( y=c | \vec{x} \right) = \textrm{softmax} \left(\vec{w}_c^\textrm{T} \vec{x} \right) = \frac{\exp \left(\vec{w}_c^\textrm{T} \vec{x} \right)}{\sum_{c'=1}^C \exp \left(\vec{w}_{c'}^\textrm{T} \vec{x} \right)} \in [0, 1], $$

因此 softmax 回归的预测结果为 $$ \hat{y} =\operatorname{argmax}_{c=1}^C p \left(y=c|\vec{x} \right) = \operatorname{argmax}_{c=1}^C \vec{w}_c^\textrm{T} \vec{x}. $$ 本质上，softmax 回归是一个单层神经网络，输出层为 softmax 层，是一个概率分布。

单样本的矢量计算表达式

为了更方便地定义 torch tensor，接下来的向量为 行向量。

设 $d$ 为样本特征个数且 $\vec{x} \in \mathbb{R}^{1 \times d}$ ， $W \in \mathbb{R}^{d \times C}$ 为待学习的权重， $\vec{b} \in \mathbb{R}^{1 \times C}$ 为偏置，则对于样本 $\left( \vec{x}^{(i)}, y^{(i)} \right)$ ，softmax 回归的矢量计算表达式为 $$\hat{\vec{y}}^{(i)} = \textrm{softmax} \left(\vec{x}^{(i)} W + \vec{b} \right),$$ 其中 $ \hat{\vec{y}}^{(i)} \in \mathbb{R}^{C} $ 的各个元素反应了 softmax 回归预测各标签的概率。

多样本的矢量计算表达式

为了更方便地定义 torch tensor，假设 $\vec{b}$ 为 行向量。

令 $X \in \mathbb{R}^{n \times d}$ 是 $n$ 个样本的特征矩阵，则 $$ \hat{Y} = \textrm{softmax} \left( XW + \vec{b} \right). $$ PyTorch 会自动对 $\vec{b}$ 进行广播（将 $\vec{b}$ 转变为 $\bigg( \underbrace{\vec{b};...;\vec{b}}_{n \text{ elements}} \bigg) \in \mathbb{R}^{n \times C}$ ）。

参数学习

采用交叉熵损失函数，只关心正确类别的预测概率： $$ l ( W, \vec{b} ) = - \frac{1}{N} \sum_{n=1}^N \sum_{c=1}^C y_c^{(n)} \log \hat{y}_c^{(n)}. $$

# 交叉熵的 PyTorch 实现（此处不计算 batch 上的平均值）
def cross_entropy(y_hat, y):
    return -torch.log(y_hat.gather(1, y.view(-1, 1)))

根据该损失函数，参数 $W$ 和 $\vec{b}$ 的更新公式为 $$ W_{t+1} \leftarrow W_t + \alpha \left( \frac{1}{|\mathcal{B}|} \sum_{n \in \mathcal{B}} \vec{x}^{(n)} \left(\vec{y}^{(n)} - \hat{\vec{y}}^{(n)}_{W_t}\right)^\textrm{T} \right) $$

$$ b_{t+1} \leftarrow b_t + \alpha \left( \frac{1}{|\mathcal{B}|} \sum_{n \in \mathcal{B}} \left(\vec{y}^{(n)} - \hat{\vec{y}}^{(n)}_{b_t}\right) \right), $$ 其中 $\vec{y}^{(n)}$ 是一个 one-hot 向量，仅有 true label 位置对应的元素为 1。

推导过程

接下来针对单样本的情形分析。多样本的参数学习公式可以直接类比得到，请自行训练。令 $\vec{z} = W^\textrm{T} \vec{x} + \vec{b} \in \mathbb{R}^{C}$ ，则 $\hat{\vec{y}} = \textrm{softmax} (\vec{z})$ ，所以$$ \frac{\partial \hat{\vec{y}}}{\partial \vec{z}} = \textrm{diag} \left( \textrm{softmax} (\vec{z}) \right) - \textrm{softmax} (\vec{z}) \cdot \left( \textrm{softmax} (\vec{z}) \right)^\textrm{T}. $$ 其次，因为 $\vec{z} = W^\textrm{T} \vec{x} + \vec{b} = \left( \vec{w}_1^\textrm{T} \vec{x}, ..., \vec{w}_C^\textrm{T} \vec{x} \right)^\textrm{T} + \vec{b} \in \mathbb{R}^C$ ， $\vec{w}_c \in \mathbb{R}^d$ ，所以 $\forall c = 1, ..., C$ ，$$ \frac{\partial \vec{z}}{\partial \vec{w}_c} \in \mathbb{R}^{d \times c} = \left(\frac{\partial \vec{w}_1^\textrm{T} \vec{x}}{\partial \vec{w}_c}, ..., \frac{\partial \vec{w}_C^\textrm{T} \vec{x}}{\partial \vec{w}_c} \right)^\textrm{T} = \left(\vec{0}, ..., \underbrace{\vec{x}}_{\textrm{the }c\textrm{-th col}}, ..., \vec{0} \right) \triangleq M_c(\vec{x}). $$

因为 $l ( W, \vec{b} ) = - \vec{y}^\textrm{T} \log \hat{\vec{y}} \in \mathbb{R}$ 且其中的 $\vec{y} \in \mathbb{R}^C$ 为 one-hot 向量， $\hat{\vec{y}} \in \mathbb{R}^C$ 是 softmax 回归的输出，所以根据链式法则有 ¹

$$ \begin{align} \frac{\partial l ( W, \vec{b} ) }{\partial \vec{w}_c} &= - \frac{\partial \vec{z}}{\partial \vec{w}_c} \cdot \frac{\partial \hat{\vec{y}}}{\partial \vec{z}} \cdot \frac{\partial \log \hat{\vec{y}}}{\partial \hat{\vec{y}}} \cdot \vec{y} \\ &= - M_c(\vec{x}) \left( \textrm{diag} \hat{\vec{y}} - \hat{\vec{y}} \cdot \hat{\vec{y}}^\textrm{T} \right) \left( \textrm{diag} \hat{\vec{y}} \right)^{-1} \cdot \vec{y} \\ &= - M_c(\vec{x}) \left( I - \hat{\vec{y}} \vec{1}^\textrm{T} \right) \vec{y} && \hat{\vec{y}}^\textrm{T} \left( \textrm{diag} \hat{\vec{y}} \right)^{-1} = \vec{1}^\textrm{T} \\ &= -M_c(\vec{x}) \left( \vec{y} - \hat{\vec{y}} \vec{1}^\textrm{T} \vec{y} \right) \\ &= -M_c(\vec{x}) \left( \vec{y} - \hat{\vec{y}} \right) && \vec{1}^\textrm{T} \vec{y} = 1 \textrm{ since } \vec{y} \textrm{ is one-hot } \\ &= -\vec{x} \left( \vec{y} - \hat{\vec{y}} \right)_c \\ &= -\vec{x} \bigg( \underbrace{\mathbf{1} (\vec{y}_c = 1) - \hat{\vec{y}}_c}_{\textrm{scalar}} \bigg). \end{align} $$

由此可得 $$\frac{\partial l(W, \vec{b})}{\partial W} = -\vec{x} \big( \vec{y} - \hat{\vec{y}} \big)^\textrm{T}.$$

同理可得 $$\frac{\partial l(W, \vec{b})}{\partial \vec{b}} = -\big( \vec{y} - \hat{\vec{y}} \big)^\textrm{T}.$$

最后

Softmax 回归的 PyTorch 实现见本链接。

转载申请

本作品采用知识共享署名 4.0 国际许可协议进行许可，转载时请注明原文链接。您必须给出适当的署名，并标明是否对本文作了修改。

关于这个推导过程，更多细节参见需要熟练掌握的算法理论基础的 “第二部分，15（14）”。 ↩︎