矩阵微积分是用矩阵和向量表示因变量每个成分关于自变量每个成分的偏导数。技巧是观察偏导的维度构成。
运算规则
1、向量 $\to$
标量:$\forall x \in \mathbb{R}^p, \forall y = f(x) \in \mathbb{R}$
,
$$ \frac{\partial y}{\partial x} = \bigg[ \frac{\partial y}{\partial x_1}, \cdots \frac{\partial y}{\partial x_p} \bigg]^\textrm{T} \in \mathbb{R}^p $$
2、标量 $\to$
向量:$\forall x \in \mathbb{R}, \forall y = f(x) \in \mathbb{R}^q$
,
$$ \frac{\partial y}{\partial x} = \bigg[ \frac{\partial y_1}{\partial x}, \cdots \frac{\partial y_q}{\partial x} \bigg] \in \mathbb{R}^{1 \times q} $$
3、向量 $\to$
向量:$\forall x \in \mathbb{R}^p, \forall y = f(x) \in \mathbb{R}^q$
,
$$ \frac{\partial y}{\partial x} = \left[ \begin{matrix} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_q}{\partial x_1} \ \vdots & \ddots & \vdots \ \frac{\partial y_1}{\partial x_p} & \cdots & \frac{\partial y_q}{\partial x_p} \end{matrix} \right] \in \mathbb{R}^{p \times q} $$
4、加法法则:$\forall x \in \mathbb{R}^p, \forall y = f(x) \in \mathbb{R}^q, \forall z = f(x) \in \mathbb{R}^q$
,
$$ \frac{\partial (y + z)}{\partial x} = \frac{\partial y}{\partial x} + \frac{\partial z}{\partial x} \in \mathbb{R}^{p \times q} $$
5、乘法法则:
-
$\forall x \in \mathbb{R}^p, \forall y = f(x) \in \mathbb{R}^q, \forall z = f(x) \in \mathbb{R}^q$
,$$ \frac{\partial y^\textrm{T} z}{\partial x} = \frac{\partial y}{\partial x} z + \frac{\partial z}{\partial x} y \in \mathbb{R}^{p \times 1} $$
梯度的维度为
$p \times 1 = (p \times q) \times (q \times 1)$
。 -
$\forall x \in \mathbb{R}^p, \forall y = f(x) \in \mathbb{R}^s, \forall z = f(x) \in \mathbb{R}^t, A \in \mathbb{R}^{s \times t}$
,$$ \frac{\partial y^\textrm{T} A z}{\partial x} = \frac{\partial y}{\partial x} A z + \frac{\partial z}{\partial x} A^\textrm{T}y \in \mathbb{R}^{p \times 1} $$
-
$\forall x \in \mathbb{R}^p, \forall y = f(x) \in \mathbb{R}, \forall z = f(x) \in \mathbb{R}^q$
,$$ \frac{\partial yz}{\partial x} = \frac{\partial y}{\partial x} z^\textrm{T} + y \frac{\partial z}{\partial x} \in \mathbb{R}^{p \times q} $$
6、链式法则:
-
$\forall x \in \mathbb{R}^p, u = u(x) \in \mathbb{R}^s, g = g(u) \in \mathbb{R}^t$
,$$ \frac{\partial g}{\partial x} = \frac{\partial u}{\partial x} \cdot \frac{\partial g}{\partial u} \in \mathbb{R}^{p \times t} $$
-
$\forall X \in \mathbb{R}^{p \times q}, u = u(X) \in \mathbb{R}^s, g = g(u) \in \mathbb{R}$
,$$ \frac{\partial g}{\partial X_{ij}} = \frac{\partial u}{\partial X_{ij}} \cdot \frac{\partial g}{\partial u} \in \mathbb{R} $$
7、常用结论:
-
$\forall x \in \mathbb{R}^p, \frac{\partial x}{\partial x} = I \in \mathbb{R}^{p \times p}$
。 -
$\forall x \in \mathbb{R}^p, \forall A \in \mathbb{R}^{q \times p}, \frac{\partial Ax}{\partial x} = A^\textrm{T} \in \mathbb{R}^{p \times q}$
。 -
$\forall x \in \mathbb{R}^p, \forall A \in \mathbb{R}^{p \times q}, \frac{\partial x^\textrm{T} A}{\partial x} = A \in \mathbb{R}^{p \times q}$
。
更多相关结论,推荐查阅 The Matrix Cookbook。
案例
Logistic 函数的梯度
$\forall x \in \mathbb{R}, \sigma (x) = \frac{1}{1 + e^{-x}}$
。可以验证 $\sigma'(x) = \sigma(x) \big(1 - \sigma (x) \big)$
。当输入 $x \in \mathbb{R}^n$
时,
$$ \sigma' (x) = \operatorname{diag} \Big( \sigma (x) \odot \big( 1- \sigma (x) \big) \Big) \in \mathbb{R}^{n \times n} $$
SoftMax 函数的梯度
$\forall x = [x_1, \cdots x_n]^\textrm{T} \in \mathbb{R}^n$
,softmax 函数的输出 $z = [z_1, \cdots z_n]^\textrm{T}$
定义为:
$$ z_i = \operatorname{softmax} (x_i) = \frac{e^{x_i}}{\sum_{k=1}^n e^{x_k}} $$
下面给出了 softmax 函数的导数。
首先,
$$ z = \operatorname{softmax} (x) = \frac{1}{\sum_i e^{x_i}} [e^{x_1}, \cdots e^{x_n}]^\textrm{T} = \frac{\exp (x)}{\sum_i e^{x_i}} = \frac{\exp (x)}{I_n^\textrm{T} \exp (x)} \in \mathbb{R}^n $$
其中 $I_n = [1, \cdots 1]^\textrm{T} \in \mathbb{R}^n$
。所以
$$ \frac{\partial z}{\partial x} = \frac{\partial \Big( \frac{\exp (x)}{I_n^\textrm{T} \exp (x)} \Big)}{\partial x} = \frac{\partial \Big( \exp (x) \cdot \frac{1}{I_n^\textrm{T} \exp (x)} \Big)}{\partial x} $$
根据乘法法则三,进一步有
$$ \begin{aligned} \frac{\partial \Big( \exp (x) \cdot \frac{1}{I_n^\textrm{T} \exp (x)} \Big)}{\partial x} = \frac{\partial \exp (x)}{\partial x} \cdot \frac{1}{I_n^\textrm{T} \exp (x)} + \frac{\partial \Big( \frac{1}{I_n^\textrm{T} \exp (x)} \Big)}{\partial x} \exp^\textrm{T} (x) \\ = \frac{\operatorname{diag} \Big( \exp (x) \Big)}{I_n^\textrm{T} \exp (x)} - \Bigg( \frac{1}{I_n^\textrm{T} \exp (x)} \Bigg)^2 \cdot \frac{\partial \Big( I_n^\textrm{T} \exp (x) \Big)}{\partial x} \exp^\textrm{T} (x) \end{aligned} $$
紧接着,根据乘法法则一,有
$$ \frac{\partial \Big( I_n^\textrm{T} \exp (x) \Big)}{\partial x} = \frac{\partial \exp (x)}{\partial x} \cdot I_n + \frac{\partial I_n^\textrm{T}}{\partial x} \exp^\textrm{T} (x) = \operatorname{diag} \Big( \exp(x) \Big) I_n = \exp (x) $$
所以
$$ \begin{aligned} \frac{\partial z}{\partial x} = \operatorname{diag} \Bigg( \frac{\exp (x) }{I_n^\textrm{T} \exp (x)} \Bigg) - \frac{\exp (x)}{I_n^\textrm{T} \exp (x)} \cdot \frac{\exp^\textrm{T} (x)}{I_n^\textrm{T} \exp (x)} \\ = \operatorname{diag} \Big( \operatorname{softmax} (x) \Big) - \operatorname{softmax} (x) \cdot \operatorname{softmax}^\textrm{T} (x) \end{aligned} $$
最后
在本科阶段,非数学相关专业的同学很少讲解矩阵微积分。然而,矩阵微积分又是机器学习及相关理论的基础,是必须要熟练掌握的。本文仅是抛砖引玉,在实际科研中,应结合 The Matrix Cookbook 对复杂问题具体分析。
转载申请
本作品采用 知识共享署名 4.0 国际许可协议 进行许可,转载时请注明原文链接。您必须给出适当的署名,并标明是否对本文作了修改。