Is the Derivative of a Dot Product Always the Other Vector?
Introduction
In vector calculus, one of the most commonly encountered expressions is the dot product of two vectors. A particularly elegant and useful identity is the derivative of the dot product with respect to one of its vector components: \[ \frac{\partial}{\partial \mathbf{u}}(\mathbf{u}^\top \mathbf{h}) = \mathbf{h}^\top \] This article explores when this identity is valid, why it works, and under what conditions it may fail.
The Setup
Let \( \mathbf{u}, \mathbf{h} \in \mathbb{R}^n \) be two vectors of the same dimension. The dot product between them is: \[ \mathbf{u}^\top \mathbf{h} = \sum_{i=1}^n u_i h_i \] This expression is a scalar function, and we wish to compute its gradient with respect to \( \mathbf{u} \).
Taking the Derivative
Since each term \( u_i h_i \) is linear in \( u_i \), and \( h_i \) is treated as a constant, we get: \[ \frac{\partial}{\partial u_i}(u_i h_i) = h_i \] Therefore, the full gradient vector is: \[ \frac{\partial}{\partial \mathbf{u}}(\mathbf{u}^\top \mathbf{h}) = \begin{bmatrix} h_1 & h_2 & \cdots & h_n \end{bmatrix} = \mathbf{h}^\top \] This result is both intuitive and algebraically clean.
When Is This Identity Valid?
This derivative identity is valid under the following assumptions:
- \( \mathbf{u} \) and \( \mathbf{h} \) are vectors of the same length
- Only \( \mathbf{u} \) is treated as a variable; \( \mathbf{h} \) is held constant
- The function \( \mathbf{u}^\top \mathbf{h} \) is scalar-valued
In such cases, the dot product is symmetric, so the following also holds: \[ \frac{\partial}{\partial \mathbf{u}}(\mathbf{h}^\top \mathbf{u}) = \mathbf{h}^\top \]
When Does It Not Hold?
The identity does not hold if:
- \( \mathbf{h} \) is a function of \( \mathbf{u} \); in that case, apply the product rule
- The function is not scalar-valued (e.g., outer products)
- The transpose is misused in row-vs-column vector contexts
For example, if \( \mathbf{h} = f(\mathbf{u}) \), then: \[ \frac{\partial}{\partial \mathbf{u}}(\mathbf{u}^\top \mathbf{h}) = \mathbf{h}^\top + \left( \frac{\partial \mathbf{h}}{\partial \mathbf{u}} \right)^\top \mathbf{u} \] This includes a second term due to the chain rule.
Conclusion
The identity \( \frac{\partial}{\partial \mathbf{u}}(\mathbf{u}^\top \mathbf{h}) = \mathbf{h}^\top \) is a powerful shortcut in linear algebra and machine learning. It simplifies many derivations, especially in backpropagation and optimization. However, it should be used with care — ensuring that the assumptions hold, particularly that the vector you're not differentiating with respect to is constant. When \( \mathbf{h} \) depends on \( \mathbf{u} \), always apply the product rule.
Further Reading
- Magnus & Neudecker: Matrix Differential Calculus with Applications in Statistics and Econometrics
- Goodfellow et al.: Deep Learning — Appendix on Matrix Calculus
- CS231n: Gradient-Based Optimization
No comments:
Post a Comment