In numerical linear algebra , the conjugate gradient method is an iterative method for numerically solving the linear system
A x = b {\displaystyle {\boldsymbol {Ax}}={\boldsymbol {b}}} where A {\displaystyle {\boldsymbol {A}}} is symmetric positive-definite . The conjugate gradient method can be derived from several different perspectives, including specialization of the conjugate direction method [1] for optimization , and variation of the Arnoldi /Lanczos iteration for eigenvalue problems.
The intent of this article is to document the important steps in these derivations.
Derivation from the conjugate direction method [ edit ] This section
needs expansion . You can help by
adding to it .
(April 2010 )
The conjugate gradient method can be seen as a special case of the conjugate direction method applied to minimization of the quadratic function
f ( x ) = x T A x − 2 b T x . {\displaystyle f({\boldsymbol {x}})={\boldsymbol {x}}^{\mathrm {T} }{\boldsymbol {A}}{\boldsymbol {x}}-2{\boldsymbol {b}}^{\mathrm {T} }{\boldsymbol {x}}{\text{.}}} The conjugate direction method [ edit ] In the conjugate direction method for minimizing
f ( x ) = x T A x − 2 b T x . {\displaystyle f({\boldsymbol {x}})={\boldsymbol {x}}^{\mathrm {T} }{\boldsymbol {A}}{\boldsymbol {x}}-2{\boldsymbol {b}}^{\mathrm {T} }{\boldsymbol {x}}{\text{.}}} one starts with an initial guess x 0 {\displaystyle {\boldsymbol {x}}_{0}} and the corresponding residual r 0 = b − A x 0 {\displaystyle {\boldsymbol {r}}_{0}={\boldsymbol {b}}-{\boldsymbol {Ax}}_{0}} , and computes the iterate and residual by the formulae
α i = p i T r i p i T A p i , x i + 1 = x i + α i p i , r i + 1 = r i − α i A p i {\displaystyle {\begin{aligned}\alpha _{i}&={\frac {{\boldsymbol {p}}_{i}^{\mathrm {T} }{\boldsymbol {r}}_{i}}{{\boldsymbol {p}}_{i}^{\mathrm {T} }{\boldsymbol {Ap}}_{i}}}{\text{,}}\\{\boldsymbol {x}}_{i+1}&={\boldsymbol {x}}_{i}+\alpha _{i}{\boldsymbol {p}}_{i}{\text{,}}\\{\boldsymbol {r}}_{i+1}&={\boldsymbol {r}}_{i}-\alpha _{i}{\boldsymbol {Ap}}_{i}\end{aligned}}} where p 0 , p 1 , p 2 , … {\displaystyle {\boldsymbol {p}}_{0},{\boldsymbol {p}}_{1},{\boldsymbol {p}}_{2},\ldots } are a series of mutually conjugate directions, i.e.,
p i T A p j = 0 {\displaystyle {\boldsymbol {p}}_{i}^{\mathrm {T} }{\boldsymbol {Ap}}_{j}=0} for any i ≠ j {\displaystyle i\neq j} .
The conjugate direction method is imprecise in the sense that no formulae are given for selection of the directions p 0 , p 1 , p 2 , … {\displaystyle {\boldsymbol {p}}_{0},{\boldsymbol {p}}_{1},{\boldsymbol {p}}_{2},\ldots } . Specific choices lead to various methods including the conjugate gradient method and Gaussian elimination .
Derivation from the Arnoldi/Lanczos iteration [ edit ] The conjugate gradient method can also be seen as a variant of the Arnoldi/Lanczos iteration applied to solving linear systems.
The general Arnoldi method [ edit ] In the Arnoldi iteration, one starts with a vector r 0 {\displaystyle {\boldsymbol {r}}_{0}} and gradually builds an orthonormal basis { v 1 , v 2 , v 3 , … } {\displaystyle \{{\boldsymbol {v}}_{1},{\boldsymbol {v}}_{2},{\boldsymbol {v}}_{3},\ldots \}} of the Krylov subspace
K ( A , r 0 ) = s p a n { r 0 , A r 0 , A 2 r 0 , … } {\displaystyle {\mathcal {K}}({\boldsymbol {A}},{\boldsymbol {r}}_{0})=\mathrm {span} \{{\boldsymbol {r}}_{0},{\boldsymbol {Ar}}_{0},{\boldsymbol {A}}^{2}{\boldsymbol {r}}_{0},\ldots \}} by defining v i = w i / ‖ w i ‖ 2 {\displaystyle {\boldsymbol {v}}_{i}={\boldsymbol {w}}_{i}/\lVert {\boldsymbol {w}}_{i}\rVert _{2}} where
v i = { r 0 if i = 1 , A v i − 1 − ∑ j = 1 i − 1 ( v j T A v i − 1 ) v j if i > 1 . {\displaystyle {\boldsymbol {v}}_{i}={\begin{cases}{\boldsymbol {r}}_{0}&{\text{if }}i=1{\text{,}}\\{\boldsymbol {Av}}_{i-1}-\sum _{j=1}^{i-1}({\boldsymbol {v}}_{j}^{\mathrm {T} }{\boldsymbol {Av}}_{i-1}){\boldsymbol {v}}_{j}&{\text{if }}i>1{\text{.}}\end{cases}}} In other words, for i > 1 {\displaystyle i>1} , v i {\displaystyle {\boldsymbol {v}}_{i}} is found by Gram-Schmidt orthogonalizing A v i − 1 {\displaystyle {\boldsymbol {Av}}_{i-1}} against { v 1 , v 2 , … , v i − 1 } {\displaystyle \{{\boldsymbol {v}}_{1},{\boldsymbol {v}}_{2},\ldots ,{\boldsymbol {v}}_{i-1}\}} followed by normalization.
Put in matrix form, the iteration is captured by the equation
A V i = V i + 1 H ~ i {\displaystyle {\boldsymbol {AV}}_{i}={\boldsymbol {V}}_{i+1}{\boldsymbol {\tilde {H}}}_{i}} where
V i = [ v 1 v 2 ⋯ v i ] , H ~ i = [ h 11 h 12 h 13 ⋯ h 1 , i h 21 h 22 h 23 ⋯ h 2 , i h 32 h 33 ⋯ h 3 , i ⋱ ⋱ ⋮ h i , i − 1 h i , i h i + 1 , i ] = [ H i h i + 1 , i e i T ] {\displaystyle {\begin{aligned}{\boldsymbol {V}}_{i}&={\begin{bmatrix}{\boldsymbol {v}}_{1}&{\boldsymbol {v}}_{2}&\cdots &{\boldsymbol {v}}_{i}\end{bmatrix}}{\text{,}}\\{\boldsymbol {\tilde {H}}}_{i}&={\begin{bmatrix}h_{11}&h_{12}&h_{13}&\cdots &h_{1,i}\\h_{21}&h_{22}&h_{23}&\cdots &h_{2,i}\\&h_{32}&h_{33}&\cdots &h_{3,i}\\&&\ddots &\ddots &\vdots \\&&&h_{i,i-1}&h_{i,i}\\&&&&h_{i+1,i}\end{bmatrix}}={\begin{bmatrix}{\boldsymbol {H}}_{i}\\h_{i+1,i}{\boldsymbol {e}}_{i}^{\mathrm {T} }\end{bmatrix}}\end{aligned}}} with
h j i = { v j T A v i if j ≤ i , ‖ w i + 1 ‖ 2 if j = i + 1 , 0 if j > i + 1 . {\displaystyle h_{ji}={\begin{cases}{\boldsymbol {v}}_{j}^{\mathrm {T} }{\boldsymbol {Av}}_{i}&{\text{if }}j\leq i{\text{,}}\\\lVert {\boldsymbol {w}}_{i+1}\rVert _{2}&{\text{if }}j=i+1{\text{,}}\\0&{\text{if }}j>i+1{\text{.}}\end{cases}}} When applying the Arnoldi iteration to solving linear systems, one starts with r 0 = b − A x 0 {\displaystyle {\boldsymbol {r}}_{0}={\boldsymbol {b}}-{\boldsymbol {Ax}}_{0}} , the residual corresponding to an initial guess x 0 {\displaystyle {\boldsymbol {x}}_{0}} . After each step of iteration, one computes y i = H i − 1 ( ‖ r 0 ‖ 2 e 1 ) {\displaystyle {\boldsymbol {y}}_{i}={\boldsymbol {H}}_{i}^{-1}(\lVert {\boldsymbol {r}}_{0}\rVert _{2}{\boldsymbol {e}}_{1})} and the new iterate x i = x 0 + V i y i {\displaystyle {\boldsymbol {x}}_{i}={\boldsymbol {x}}_{0}+{\boldsymbol {V}}_{i}{\boldsymbol {y}}_{i}} .
The direct Lanczos method [ edit ] For the rest of discussion, we assume that A {\displaystyle {\boldsymbol {A}}} is symmetric positive-definite. With symmetry of A {\displaystyle {\boldsymbol {A}}} , the upper Hessenberg matrix H i = V i T A V i {\displaystyle {\boldsymbol {H}}_{i}={\boldsymbol {V}}_{i}^{\mathrm {T} }{\boldsymbol {AV}}_{i}} becomes symmetric and thus tridiagonal. It then can be more clearly denoted by
H i = [ a 1 b 2 b 2 a 2 b 3 ⋱ ⋱ ⋱ b i − 1 a i − 1 b i b i a i ] . {\displaystyle {\boldsymbol {H}}_{i}={\begin{bmatrix}a_{1}&b_{2}\\b_{2}&a_{2}&b_{3}\\&\ddots &\ddots &\ddots \\&&b_{i-1}&a_{i-1}&b_{i}\\&&&b_{i}&a_{i}\end{bmatrix}}{\text{.}}} This enables a short three-term recurrence for v i {\displaystyle {\boldsymbol {v}}_{i}} in the iteration, and the Arnoldi iteration is reduced to the Lanczos iteration.
Since A {\displaystyle {\boldsymbol {A}}} is symmetric positive-definite, so is H i {\displaystyle {\boldsymbol {H}}_{i}} . Hence, H i {\displaystyle {\boldsymbol {H}}_{i}} can be LU factorized without partial pivoting into
H i = L i U i = [ 1 c 2 1 ⋱ ⋱ c i − 1 1 c i 1 ] [ d 1 b 2 d 2 b 3 ⋱ ⋱ d i − 1 b i d i ] {\displaystyle {\boldsymbol {H}}_{i}={\boldsymbol {L}}_{i}{\boldsymbol {U}}_{i}={\begin{bmatrix}1\\c_{2}&1\\&\ddots &\ddots \\&&c_{i-1}&1\\&&&c_{i}&1\end{bmatrix}}{\begin{bmatrix}d_{1}&b_{2}\\&d_{2}&b_{3}\\&&\ddots &\ddots \\&&&d_{i-1}&b_{i}\\&&&&d_{i}\end{bmatrix}}} with convenient recurrences for c i {\displaystyle c_{i}} and d i {\displaystyle d_{i}} :
c i = b i / d i − 1 , d i = { a 1 if i = 1 , a i − c i b i if i > 1 . {\displaystyle {\begin{aligned}c_{i}&=b_{i}/d_{i-1}{\text{,}}\\d_{i}&={\begin{cases}a_{1}&{\text{if }}i=1{\text{,}}\\a_{i}-c_{i}b_{i}&{\text{if }}i>1{\text{.}}\end{cases}}\end{aligned}}} Rewrite x i = x 0 + V i y i {\displaystyle {\boldsymbol {x}}_{i}={\boldsymbol {x}}_{0}+{\boldsymbol {V}}_{i}{\boldsymbol {y}}_{i}} as
x i = x 0 + V i H i − 1 ( ‖ r 0 ‖ 2 e 1 ) = x 0 + V i U i − 1 L i − 1 ( ‖ r 0 ‖ 2 e 1 ) = x 0 + P i z i {\displaystyle {\begin{aligned}{\boldsymbol {x}}_{i}&={\boldsymbol {x}}_{0}+{\boldsymbol {V}}_{i}{\boldsymbol {H}}_{i}^{-1}(\lVert {\boldsymbol {r}}_{0}\rVert _{2}{\boldsymbol {e}}_{1})\\&={\boldsymbol {x}}_{0}+{\boldsymbol {V}}_{i}{\boldsymbol {U}}_{i}^{-1}{\boldsymbol {L}}_{i}^{-1}(\lVert {\boldsymbol {r}}_{0}\rVert _{2}{\boldsymbol {e}}_{1})\\&={\boldsymbol {x}}_{0}+{\boldsymbol {P}}_{i}{\boldsymbol {z}}_{i}\end{aligned}}} with
P i = V i U i − 1 , z i = L i − 1 ( ‖ r 0 ‖ 2 e 1 ) . {\displaystyle {\begin{aligned}{\boldsymbol {P}}_{i}&={\boldsymbol {V}}_{i}{\boldsymbol {U}}_{i}^{-1}{\text{,}}\\{\boldsymbol {z}}_{i}&={\boldsymbol {L}}_{i}^{-1}(\lVert {\boldsymbol {r}}_{0}\rVert _{2}{\boldsymbol {e}}_{1}){\text{.}}\end{aligned}}} It is now important to observe that
P i = [ P i − 1 p i ] , z i = [ z i − 1 ζ i ] . {\displaystyle {\begin{aligned}{\boldsymbol {P}}_{i}&={\begin{bmatrix}{\boldsymbol {P}}_{i-1}&{\boldsymbol {p}}_{i}\end{bmatrix}}{\text{,}}\\{\boldsymbol {z}}_{i}&={\begin{bmatrix}{\boldsymbol {z}}_{i-1}\\\zeta _{i}\end{bmatrix}}{\text{.}}\end{aligned}}} In fact, there are short recurrences for p i {\displaystyle {\boldsymbol {p}}_{i}} and ζ i {\displaystyle \zeta _{i}} as well:
p i = 1 d i ( v i − b i p i − 1 ) , ζ i = − c i ζ i − 1 . {\displaystyle {\begin{aligned}{\boldsymbol {p}}_{i}&={\frac {1}{d_{i}}}({\boldsymbol {v}}_{i}-b_{i}{\boldsymbol {p}}_{i-1}){\text{,}}\\\zeta _{i}&=-c_{i}\zeta _{i-1}{\text{.}}\end{aligned}}} With this formulation, we arrive at a simple recurrence for x i {\displaystyle {\boldsymbol {x}}_{i}} :
x i = x 0 + P i z i = x 0 + P i − 1 z i − 1 + ζ i p i = x i − 1 + ζ i p i . {\displaystyle {\begin{aligned}{\boldsymbol {x}}_{i}&={\boldsymbol {x}}_{0}+{\boldsymbol {P}}_{i}{\boldsymbol {z}}_{i}\\&={\boldsymbol {x}}_{0}+{\boldsymbol {P}}_{i-1}{\boldsymbol {z}}_{i-1}+\zeta _{i}{\boldsymbol {p}}_{i}\\&={\boldsymbol {x}}_{i-1}+\zeta _{i}{\boldsymbol {p}}_{i}{\text{.}}\end{aligned}}} The relations above straightforwardly lead to the direct Lanczos method, which turns out to be slightly more complex.
The conjugate gradient method from imposing orthogonality and conjugacy [ edit ] If we allow p i {\displaystyle {\boldsymbol {p}}_{i}} to scale and compensate for the scaling in the constant factor, we potentially can have simpler recurrences of the form:
x i = x i − 1 + α i − 1 p i − 1 , r i = r i − 1 − α i − 1 A p i − 1 , p i = r i + β i − 1 p i − 1 . {\displaystyle {\begin{aligned}{\boldsymbol {x}}_{i}&={\boldsymbol {x}}_{i-1}+\alpha _{i-1}{\boldsymbol {p}}_{i-1}{\text{,}}\\{\boldsymbol {r}}_{i}&={\boldsymbol {r}}_{i-1}-\alpha _{i-1}{\boldsymbol {Ap}}_{i-1}{\text{,}}\\{\boldsymbol {p}}_{i}&={\boldsymbol {r}}_{i}+\beta _{i-1}{\boldsymbol {p}}_{i-1}{\text{.}}\end{aligned}}} As premises for the simplification, we now derive the orthogonality of r i {\displaystyle {\boldsymbol {r}}_{i}} and conjugacy of p i {\displaystyle {\boldsymbol {p}}_{i}} , i.e., for i ≠ j {\displaystyle i\neq j} ,
r i T r j = 0 , p i T A p j = 0 . {\displaystyle {\begin{aligned}{\boldsymbol {r}}_{i}^{\mathrm {T} }{\boldsymbol {r}}_{j}&=0{\text{,}}\\{\boldsymbol {p}}_{i}^{\mathrm {T} }{\boldsymbol {Ap}}_{j}&=0{\text{.}}\end{aligned}}} The residuals are mutually orthogonal because r i {\displaystyle {\boldsymbol {r}}_{i}} is essentially a multiple of v i + 1 {\displaystyle {\boldsymbol {v}}_{i+1}} since for i = 0 {\displaystyle i=0} , r 0 = ‖ r 0 ‖ 2 v 1 {\displaystyle {\boldsymbol {r}}_{0}=\lVert {\boldsymbol {r}}_{0}\rVert _{2}{\boldsymbol {v}}_{1}} , for i > 0 {\displaystyle i>0} ,
r i = b − A x i = b − A ( x 0 + V i y i ) = r 0 − A V i y i = r 0 − V i + 1 H ~ i y i = r 0 − V i H i y i − h i + 1 , i ( e i T y i ) v i + 1 = ‖ r 0 ‖ 2 v 1 − V i ( ‖ r 0 ‖ 2 e 1 ) − h i + 1 , i ( e i T y i ) v i + 1 = − h i + 1 , i ( e i T y i ) v i + 1 . {\displaystyle {\begin{aligned}{\boldsymbol {r}}_{i}&={\boldsymbol {b}}-{\boldsymbol {Ax}}_{i}\\&={\boldsymbol {b}}-{\boldsymbol {A}}({\boldsymbol {x}}_{0}+{\boldsymbol {V}}_{i}{\boldsymbol {y}}_{i})\\&={\boldsymbol {r}}_{0}-{\boldsymbol {AV}}_{i}{\boldsymbol {y}}_{i}\\&={\boldsymbol {r}}_{0}-{\boldsymbol {V}}_{i+1}{\boldsymbol {\tilde {H}}}_{i}{\boldsymbol {y}}_{i}\\&={\boldsymbol {r}}_{0}-{\boldsymbol {V}}_{i}{\boldsymbol {H}}_{i}{\boldsymbol {y}}_{i}-h_{i+1,i}({\boldsymbol {e}}_{i}^{\mathrm {T} }{\boldsymbol {y}}_{i}){\boldsymbol {v}}_{i+1}\\&=\lVert {\boldsymbol {r}}_{0}\rVert _{2}{\boldsymbol {v}}_{1}-{\boldsymbol {V}}_{i}(\lVert {\boldsymbol {r}}_{0}\rVert _{2}{\boldsymbol {e}}_{1})-h_{i+1,i}({\boldsymbol {e}}_{i}^{\mathrm {T} }{\boldsymbol {y}}_{i}){\boldsymbol {v}}_{i+1}\\&=-h_{i+1,i}({\boldsymbol {e}}_{i}^{\mathrm {T} }{\boldsymbol {y}}_{i}){\boldsymbol {v}}_{i+1}{\text{.}}\end{aligned}}} To see the conjugacy of p i {\displaystyle {\boldsymbol {p}}_{i}} , it suffices to show that P i T A P i {\displaystyle {\boldsymbol {P}}_{i}^{\mathrm {T} }{\boldsymbol {AP}}_{i}} is diagonal:
P i T A P i = U i − T V i T A V i U i − 1 = U i − T H i U i − 1 = U i − T L i U i U i − 1 = U i − T L i {\displaystyle {\begin{aligned}{\boldsymbol {P}}_{i}^{\mathrm {T} }{\boldsymbol {AP}}_{i}&={\boldsymbol {U}}_{i}^{-\mathrm {T} }{\boldsymbol {V}}_{i}^{\mathrm {T} }{\boldsymbol {AV}}_{i}{\boldsymbol {U}}_{i}^{-1}\\&={\boldsymbol {U}}_{i}^{-\mathrm {T} }{\boldsymbol {H}}_{i}{\boldsymbol {U}}_{i}^{-1}\\&={\boldsymbol {U}}_{i}^{-\mathrm {T} }{\boldsymbol {L}}_{i}{\boldsymbol {U}}_{i}{\boldsymbol {U}}_{i}^{-1}\\&={\boldsymbol {U}}_{i}^{-\mathrm {T} }{\boldsymbol {L}}_{i}\end{aligned}}} is symmetric and lower triangular simultaneously and thus must be diagonal.
Now we can derive the constant factors α i {\displaystyle \alpha _{i}} and β i {\displaystyle \beta _{i}} with respect to the scaled p i {\displaystyle {\boldsymbol {p}}_{i}} by solely imposing the orthogonality of r i {\displaystyle {\boldsymbol {r}}_{i}} and conjugacy of p i {\displaystyle {\boldsymbol {p}}_{i}} .
Due to the orthogonality of r i {\displaystyle {\boldsymbol {r}}_{i}} , it is necessary that r i + 1 T r i = ( r i − α i A p i ) T r i = 0 {\displaystyle {\boldsymbol {r}}_{i+1}^{\mathrm {T} }{\boldsymbol {r}}_{i}=({\boldsymbol {r}}_{i}-\alpha _{i}{\boldsymbol {Ap}}_{i})^{\mathrm {T} }{\boldsymbol {r}}_{i}=0} . As a result,
α i = r i T r i r i T A p i = r i T r i ( p i − β i − 1 p i − 1 ) T A p i = r i T r i p i T A p i . {\displaystyle {\begin{aligned}\alpha _{i}&={\frac {{\boldsymbol {r}}_{i}^{\mathrm {T} }{\boldsymbol {r}}_{i}}{{\boldsymbol {r}}_{i}^{\mathrm {T} }{\boldsymbol {Ap}}_{i}}}\\&={\frac {{\boldsymbol {r}}_{i}^{\mathrm {T} }{\boldsymbol {r}}_{i}}{({\boldsymbol {p}}_{i}-\beta _{i-1}{\boldsymbol {p}}_{i-1})^{\mathrm {T} }{\boldsymbol {Ap}}_{i}}}\\&={\frac {{\boldsymbol {r}}_{i}^{\mathrm {T} }{\boldsymbol {r}}_{i}}{{\boldsymbol {p}}_{i}^{\mathrm {T} }{\boldsymbol {Ap}}_{i}}}{\text{.}}\end{aligned}}} Similarly, due to the conjugacy of p i {\displaystyle {\boldsymbol {p}}_{i}} , it is necessary that p i + 1 T A p i = ( r i + 1 + β i p i ) T A p i = 0 {\displaystyle {\boldsymbol {p}}_{i+1}^{\mathrm {T} }{\boldsymbol {Ap}}_{i}=({\boldsymbol {r}}_{i+1}+\beta _{i}{\boldsymbol {p}}_{i})^{\mathrm {T} }{\boldsymbol {Ap}}_{i}=0} . As a result,
β i = − r i + 1 T A p i p i T A p i = − r i + 1 T ( r i − r i + 1 ) α i p i T A p i = r i + 1 T r i + 1 r i T r i . {\displaystyle {\begin{aligned}\beta _{i}&=-{\frac {{\boldsymbol {r}}_{i+1}^{\mathrm {T} }{\boldsymbol {Ap}}_{i}}{{\boldsymbol {p}}_{i}^{\mathrm {T} }{\boldsymbol {Ap}}_{i}}}\\&=-{\frac {{\boldsymbol {r}}_{i+1}^{\mathrm {T} }({\boldsymbol {r}}_{i}-{\boldsymbol {r}}_{i+1})}{\alpha _{i}{\boldsymbol {p}}_{i}^{\mathrm {T} }{\boldsymbol {Ap}}_{i}}}\\&={\frac {{\boldsymbol {r}}_{i+1}^{\mathrm {T} }{\boldsymbol {r}}_{i+1}}{{\boldsymbol {r}}_{i}^{\mathrm {T} }{\boldsymbol {r}}_{i}}}{\text{.}}\end{aligned}}} This completes the derivation.
References [ edit ] Hestenes, M. R. ; Stiefel, E. (December 1952). "Methods of conjugate gradients for solving linear systems" (PDF) . Journal of Research of the National Bureau of Standards . 49 (6): 409. doi :10.6028/jres.049.044 . Saad, Y. (2003). "Chapter 6: Krylov Subspace Methods, Part I". Iterative methods for sparse linear systems (2nd ed.). SIAM. ISBN 978-0-89871-534-7 .
Key concepts Problems Hardware Software