3.3 Cholesky decomposition#

We now consider a special form of LU decomposition, when the matrix satisfies certain properties.

Definition 10

A real matrix \(A\) is positive definite if \(\boldsymbol{v}^TA\boldsymbol{v}>0\) for any nonzero vector \(\boldsymbol{v}\).

For a symmetric positive definite (SPD) matrix \(A\), the LU decomposition has the particular form of

\[\begin{equation*} A =R^TR \end{equation*}\]

where \(R\) is an upper triangular matrix. Such a decomposition is called Cholesky decomposition.

Theorem 16

An SPD matrix has the following properties:

  1. Eigenvalues of an SPD matrix are positive

  2. The determinant of an SPD matrix is positive

  3. Any principal submatrix of an SPD matrix is SPD. (A principal submatrix is a submatrix whose diagonal elements are also those of the original matrix.)

We use a \(2 \times 2\) example to inspire us to the algorithm for a general \(n \times n\) SPD matrix.

For a general \(2 \times 2\) SPD matrix,

\[\begin{equation*} A = \begin{bmatrix} a & b \\ b & c \end{bmatrix}, \end{equation*}\]

the factor \(R\) in the Cholesky decomposition \(A=R^TR\) has the following general form:

\[\begin{equation*} R = \begin{bmatrix} \alpha & \beta \\ 0 & \gamma \end{bmatrix}. \end{equation*}\]

Then we have

\[\begin{equation*} R^TR = \begin{bmatrix} \alpha & 0 \\ \beta & \gamma \end{bmatrix} \begin{bmatrix} \alpha & \beta \\ 0 & \gamma \end{bmatrix} = \begin{bmatrix} \alpha^2 & \alpha\beta \\ \alpha\beta & \beta^2+\gamma^2 \end{bmatrix} = \begin{bmatrix} a & b \\ b & c \end{bmatrix} = A. \end{equation*}\]

An elementwise comparison of the elements in \(R^TR\) and \(A\) gives the following system of equations

\[\begin{align*} \alpha^2 &= a \\ \alpha\beta &= b \\ \beta^2 + \gamma^2 &= c, \end{align*}\]

which leads to

(21)#\[\begin{equation} \alpha = \sqrt{a} \end{equation}\]
\[\begin{equation*} \beta = b/\alpha = b/\sqrt{a} \end{equation*}\]
(22)#\[\begin{equation} \gamma = \sqrt{c-\beta^2} = \sqrt{(ac-b^2)/a}. \end{equation}\]

Note that \(R\) is not unique, since a negative sign can be chosen for each of the two square roots in Eqns. (21) and (22).

Example 27

Find the Cholesky decomposition for the \(2 \times 2\) SPD matrix

\[\begin{equation*} \begin{bmatrix} 4 & -2 \\ -2 & 10 \end{bmatrix}. \end{equation*}\]

Solution.

\[\begin{align*} \alpha &= \sqrt{a} = \sqrt{4} = 2 \\ \beta &= b/\sqrt{a} = -2/2 = -1 \\ \gamma &= \sqrt{(ac-b^2)/a}= \sqrt{(4(10)-(-2)(-2))/4} = 3. \end{align*}\]

So the Cholesky decomposition is

\[\begin{equation*} \begin{bmatrix} 4 & -2 \\ -2 & 10 \end{bmatrix} = \begin{bmatrix} 2 & 0 \\ -1 & 3 \end{bmatrix} \begin{bmatrix} 2 & -1 \\ 0 & 3 \end{bmatrix}. \end{equation*}\]

The Cholesky decomposition algorithm for a \(2\times 2\) SPD matrix can be extended to a general \(n \times n\) SPD matrix. The extension is done by partitioning the matrix into blocks in an iterative way and the algorithm is summarized below:

  1. Partition the matrix

    \[\begin{equation*} A = \begin{bmatrix} a_{11} & a_{12} & \dots & a_{1n} \\ a_{21} & a_{22} & \dots & a_{2n} \\ \vdots & \vdots & \vdots & \vdots \\ a_{n1} & a_{n2} & \dots & a_{nn} \end{bmatrix} = \begin{bmatrix} a_{11} & \boldsymbol{b}^T \\ \boldsymbol{b} & C \end{bmatrix} \end{equation*}\]

    where

    \[\begin{equation*} \boldsymbol{b} = \begin{bmatrix} a_{21} \\ \vdots \\ a_{n1} \end{bmatrix} = [a_{12}, \dots, a_{1n}]^T \end{equation*}\]

    due to the symmetry of \(A\), and

    \[\begin{equation*} C = \begin{bmatrix} a_{22} & a_{23} & \dots & a_{2n} \\ a_{32} & a_{33} & \cdots & a_{3n} \\ \vdots & \vdots & \vdots & \vdots \\ a_{n2} & a_{n3} & \dots & a_{nn} \end{bmatrix} \end{equation*}\]

    is the symmetric lower \((n-1) \times (n-1)\) submatrix of \(A\).

  2. Update the \((1,1)\) position of \(R\) by setting \(R_{11} = \alpha = \sqrt{a_{11}}\) and update \([R_{12},\cdots,R_{1n}]=\boldsymbol{\beta}\) with \(\boldsymbol{b}^T/ \sqrt{a_{11}}\).

  3. Repeat Steps 1 and 2 on the \((n-1) \times (n-1)\) matrix \(A^{(1)}=C-\boldsymbol{\beta}\boldsymbol{\beta}^T\) to update \(R_{22}\) and \([R_{23},\dots, R_{2n}]\). Note that \(\boldsymbol{\beta}\boldsymbol{\beta}^T\) is an outer product that results in an \((n-1) \times (n-1)\) matrix, and \(C-\boldsymbol{\beta}\boldsymbol{\beta}^T\) resembles \(c-\beta^2\) in Eqn. (22) in the \(2\times 2\) case.

Theorem 17

The algorithm above for Cholesky decomposition has a cost of \(\frac{1}{3}n^3+\frac{1}{2}n^2+\frac{1}{6}n\) floating point operations (flops).

Proof. The first step involves computing the square root of \(a_{11}\) (1 flop), dividing the vector \([a_{12}, \dots, a_{1n}]\) by the square root (n-1 flops), and computing \(A^{(1)}\), where calculating \(\boldsymbol{\beta}\boldsymbol{\beta}^T\) costs \(\frac{(n-1)n}{2}\) flops, since only the upper or lower triangular part including the diagonal needs to be computed due to symmetry, and so does calculating \(C-\boldsymbol{\beta}\boldsymbol{\beta}^T\). So the total cost of performing the first step is

\[\begin{equation*} 1+(n-1)+\frac{(n-1)n}{2} + \frac{(n-1)n}{2} \end{equation*}\]

flops. In general, Step \(k\) costs

\[\begin{equation*} 1+(n-k)+\frac{(n-k+1)(n-k)}{2}(2). \end{equation*}\]

Summing the costs of all steps leads to:

\[\begin{align*} \sum_{k=1}^n 1+(n-k)+\frac{(n-k+1)(n-k)}{2}(2) &= \sum_{k=1}^n (n^2+2n+1) - (2n+2)\sum_{k=1}^nk + \sum_{k=1}^n k^2 \\ &= (n^3+2n^2+n) - (2n+2)\frac{(n+1)n}{2} + \frac{n(n+1)(2n+1)}{6} \\ &= \frac{1}{3}n^3+\frac{1}{2}n^2+\frac{1}{6}n, \end{align*}\]

where we used the formula

\[\begin{equation*} \sum_{k=1}^n k^2 = \frac{n(n+1)(2n+1)}{6}. \end{equation*}\]

Remark 8

Cholesky decomposition is not unique, unless the restriction that diagonal elements of \(R\) are positive is posed.

Example 28

Find the Cholesky decomposition for the \(3 \times 3\) SPD matrix

\[\begin{equation*} \begin{bmatrix} 1 & -3 & -2 \\ -3 & 10 & 9\\ -2 & 9 & 29 \end{bmatrix}. \end{equation*}\]

Solution.

Step 1. Update \(R_{11}\) and \(\boldsymbol{\beta}=[R_{12},\dots,R_{1n}]\) using \(A_{11}=1\) and \(\boldsymbol{b}^T=[-3,-2]\):

\[\begin{equation*} R = \begin{bmatrix} 1 & \begin{matrix} -3 & -2 \end{matrix} \\ \begin{matrix} 0 \\ 0 \end{matrix} & \begin{matrix} R^{(1)} \end{matrix} \end{bmatrix} \end{equation*}\]

where \(R^{(1)}\) is the lower \((n-1) \times (n-1)\) submatrix of \(R\).

Step 2. Calculate

\[\begin{equation*} A^{(1)}=C-\boldsymbol{\beta}\boldsymbol{\beta}^T = \begin{bmatrix} 10 & 9 \\ 9 & 29 \end{bmatrix} - \begin{bmatrix} -3 \\ -2 \end{bmatrix} \begin{bmatrix} -3 & -2 \end{bmatrix} = \begin{bmatrix} 1 & 3 \\ 3 & 25 \end{bmatrix}. \end{equation*}\]

Step 3. Repeat Step 1 on \(A^{(1)}\) to update \(R^{(1)}\), i.e., update \(R^{(1)}_{11}\) and \(R^{(1)}_{12}\) by \(1\) and \(3/1=3\):

\[\begin{equation*} R^{(1)} = \begin{bmatrix} 1 & 3 \\ 0 & R^{(2)} \end{bmatrix} \end{equation*}\]

where \(R^{(2)}\) is a scalar that needs to be updated next. Now \(R\) becomes

\[\begin{equation*} R = \begin{bmatrix} 1 & -3 & -2 \\ 0 & 1 & 3 \\ 0 & 0 & R^{(2)} \end{bmatrix}. \end{equation*}\]

Step 4. Calculate \(A^{(2)}=25-3(3)=16\), based on which \(R^{(2)}\) can be computed by \(R^{(2)}=\sqrt{16}=4\). So the final Cholesky factor \(R\) for \(A\) is:

\[\begin{equation*} R = \begin{bmatrix} 1 & -3 & -2 \\ 0 & 1 & 3 \\ 0 & 0 & 4 \end{bmatrix}. \end{equation*}\]

The following Python function chol implements Cholesky factorization for symmetric postive-definite matrices.

import numpy as np
def chol(A):
    """
    Perform Cholesky algorithm for a symmetric positive-definite 
    matrix A
    A: symmetric positive-definite matrix. A is changed in place in this code
    Return: R, an upper-triangular matrix such that A=R^TR
    """
    if not np.array_equal(A.T,A):
        print('Error: the matrix A is not symmetric!')
        return
        
    n = A.shape[0]
    R = np.zeros((n,n))
    for k in range(n):
        if A[k,k]<0:
            print('Error: the matrix is no positive-definite!')
            return
        R[k,k] = np.sqrt(A[k,k])
        u = A[k,k+1:n]/R[k,k]
        R[k,k+1:n] = u
        A[k+1:n, k+1:n] -= np.outer(u,u)
        
    return R

Apply chol to the following matrix:

\[\begin{align*} \begin{bmatrix} 4 & -2 \\ -2 & 10 \end{bmatrix}. \end{align*}\]
A = np.array([[4., -2], [-2, 10]])
R = chol(A)
print('R = ', R)
print('RT x R = ', R.T.dot(R))
R =  [[ 2. -1.]
 [ 0.  3.]]
RT x R =  [[ 4. -2.]
 [-2. 10.]]

For the following matrix

\[\begin{align*} \begin{bmatrix} 1 & 1 & 2 \\ 1 & 5 & 0 \\ 2 & 0 & 9 \end{bmatrix}, \end{align*}\]

the Python function gives

A = np.array([[1., 1, 2], [1, 5, 0], [2, 0, 9]])
R = chol(A)
print('R = ', R)
print('RT x R = ', R.T.dot(R))
R =  [[ 1.  1.  2.]
 [ 0.  2. -1.]
 [ 0.  0.  2.]]
RT x R =  [[1. 1. 2.]
 [1. 5. 0.]
 [2. 0. 9.]]

Exercise 3.3-1

Find the Cholesky decomposition for the following matrix

\[\begin{equation*} \begin{bmatrix} 36 & -48 \\ -48 & 73 \\ \end{bmatrix}. \end{equation*}\]

Exercise 3.3-2

Find the Cholesky decomposition for the following matrix

\[\begin{equation*} \begin{bmatrix} 4 & -6 & 2 \\ -6 & 34 & 12 \\ 2 & 12 & 14 \end{bmatrix}. \end{equation*}\]

Exercise 3.3-3

Find the Cholesky decomposition for the following matrix. Does the algorithm work? If not, what is the problem?

\[\begin{equation*} \begin{bmatrix} 4 & -2 & 2 \\ -2 & 5 & 1 \\ 2 & 1 & -2 \end{bmatrix}. \end{equation*}\]

Use Cholesky decomposition to solve linear systems#

Cholesky decomposition can be used to solve linear systems in the same fashion as how we used LU decomposition. Let the linear system be \(A\boldsymbol{x}=\boldsymbol{b}\), where \(A\) is symmetric positive-definite, with a Cholesky decomposition \(A=R^TR\). Then the system can be rewritten as \(R^TR\boldsymbol{x}=\boldsymbol{b}\), which can be split into two separate linear systems:

\[\begin{align*} R^T\boldsymbol{c} &= \boldsymbol{b} \\ R\boldsymbol{x} &= \boldsymbol{c}. \end{align*}\]

Solving the two linear systems in turn using forward and back substitutions leads to the solution to the original linear system.

Example 29

Solve the system of linear equations

\[\begin{equation*} \begin{bmatrix} 1 & -3 & -2 \\ -3 & 10 & 9\\ -2 & 9 & 29 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} = \begin{bmatrix} 0 \\ -5 \\ -47 \end{bmatrix}. \end{equation*}\]

Solution.

From Example 28, we obtained

\[\begin{equation*} R = \begin{bmatrix} 1 & -3 & -2 \\ 0 & 1 & 3 \\ 0 & 0 & 4 \end{bmatrix}. \end{equation*}\]

So the two systems of linear equations to be solved are:

\[\begin{align*} \begin{bmatrix} 1 & 0 & 0 \\ -3 & 1 & 0 \\ -2 & 3 & 4 \end{bmatrix} \begin{bmatrix} c_1 \\ c_2 \\ c_3 \end{bmatrix} &= \begin{bmatrix} 0 \\ -5 \\ -47 \end{bmatrix} \\ \begin{bmatrix} 1 & -3 & -2 \\ 0 & 1 & 3 \\ 0 & 0 & 4 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} &= \begin{bmatrix} c_1 \\ c_2 \\ c_3 \end{bmatrix}. \end{align*}\]

The first linear system gives

\[\begin{equation*} \boldsymbol{c} = \begin{bmatrix} c_1 \\ c_2 \\ c_3 \end{bmatrix} = \begin{bmatrix} 0 \\ -5 \\ -8 \end{bmatrix}. \end{equation*}\]

The second leads to

\[\begin{equation*} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} = \begin{bmatrix} -1 \\1 \\ -2 \end{bmatrix}. \end{equation*}\]

Cholesky decomposition has a wide range of applications, such as generating samples from a general multivariate normal distribution, non-linear optimization, etc., in addition to solving systems of linear equations.

Exercise 3.3-4

Solve the following linear system using Cholesky decomposition:

\[\begin{equation*} \begin{bmatrix} 4 & -6 & 2 \\ -6 & 34 & 12 \\ 2 & 12 & 14 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} = \begin{bmatrix} -24 \\ 126 \\ 46 \end{bmatrix}. \end{equation*}\]

Conjugate gradient method#

Definition 11

Let \(A\) be an \(n \times n\) symmetric positive definite matrix. Two vectors \(\boldsymbol{u}\in \mathbb{R}^{n}\) and \(\boldsymbol{v}\in \mathbb{R}^{n}\) are A-orthogonal or A-conjugate if \(\boldsymbol{u}^TA\boldsymbol{v}=0\).

Note that the symmetric positive definite matrix \(A\) induces an inner product on \(\mathbb{R}^n\), \(\langle\boldsymbol{u}, \boldsymbol{v} \rangle_A=\boldsymbol{u}^TA\boldsymbol{v}\).

Definition 12

Let \(V\) be a vector space. An inner product on \(V\), is a function, \(\langle \cdot, \cdot \rangle\),

\[\begin{equation*} V \times V \rightarrow \mathbb{F} \,\,(\mathbb{R} \text{ or } \mathbb{C}) \end{equation*}\]

such that the following properties hold:

  1. For any \(\boldsymbol{u}\in V\), \(\langle\boldsymbol{u}, \boldsymbol{u}\rangle \ge 0\) and the equality holds only when \(\boldsymbol{u}=\boldsymbol{0}\)

  2. For any \(\boldsymbol{u}, \boldsymbol{v}, \boldsymbol{w}\in V\), \(a, b\in \mathbb{F}\), \(\langle a\boldsymbol{u}+b\boldsymbol{v}, \boldsymbol{w} \rangle = a\langle \boldsymbol{u}, \boldsymbol{w} \rangle + b\langle \boldsymbol{v}, \boldsymbol{w} \rangle\)

  3. For any \(\boldsymbol{u}, \boldsymbol{v}\in V\), \(\langle\boldsymbol{u}, \boldsymbol{v}\rangle = \overline{\langle\boldsymbol{v}, \boldsymbol{u}\rangle}\), where \(\overline{\boldsymbol{u}}\) denotes the conjugate of \(\boldsymbol{u}\).

Now consider solving the linear system

(23)#\[A\boldsymbol{x} = \boldsymbol{b}\]

where \(A\in\mathbb{R}^{n \times n}\) and \(\boldsymbol{b}\in\mathbb{R}^{n}\). Suppose \(\boldsymbol{u}_1,\boldsymbol{u}_2, \dots, \boldsymbol{u}_n\) is an A-orthogonal basis of \(\mathbb{R}^{n}\). Let \(\boldsymbol{x}^*\) be the solution to Eqn. (23). Then \(\boldsymbol{x}^*\) can be written as a linear combination of the basis as

(24)#\[\boldsymbol{x}^* = a_1\boldsymbol{u}_1 + a_2\boldsymbol{u}_2 + \cdots + a_n\boldsymbol{u}_n.\]

The coefficients \(a_i\), \(1\le i\le n\), can be obtained by

(25)#\[a_i = \frac{\langle \boldsymbol{b},\boldsymbol{u}_i \rangle}{\langle \boldsymbol{u}_i,\boldsymbol{u}_i \rangle_A} = \frac{\boldsymbol{b}^T\boldsymbol{u}_i}{\boldsymbol{u}_i^TA\boldsymbol{u}_i}.\]

Before we get to the conjugate gradient algorithm, we first talk about the well-known Gram-Schmidt Orthogonalization algorithm.

Theorem 18

Given a set of linearly independent vectors \(\boldsymbol{v}_1, \boldsymbol{v}_2, \cdots, \boldsymbol{v}_n\in \mathbb{R}^n\), the following process generates a set of orthogonal vectors:

  1. \(\boldsymbol{u}_1 = \boldsymbol{v}_1\)

  2. \(\boldsymbol{u}_2 = \boldsymbol{v}_2 - \frac{\langle \boldsymbol{v}_2, \boldsymbol{u}_1 \rangle}{\langle \boldsymbol{u}_1, \boldsymbol{u}_1 \rangle} \boldsymbol{u}_1 = \boldsymbol{v}_2 - \frac{\boldsymbol{v}_2 \cdot \boldsymbol{u}_1}{||\boldsymbol{u}_1||_2^2}\boldsymbol{u}_1\)

  3. \(\boldsymbol{u}_3 = \boldsymbol{v}_3 - \frac{\langle \boldsymbol{v}_3, \boldsymbol{u}_1 \rangle}{\langle \boldsymbol{u}_1, \boldsymbol{u}_1 \rangle} \boldsymbol{u}_1 - \frac{\langle \boldsymbol{v}_3, \boldsymbol{u}_2 \rangle}{\langle \boldsymbol{u}_2, \boldsymbol{u}_2 \rangle} \boldsymbol{u}_2 = \boldsymbol{v}_3 - \frac{\boldsymbol{v}_3 \cdot \boldsymbol{u}_1}{||\boldsymbol{u}_1||_2^2}\boldsymbol{u}_1 - \frac{\boldsymbol{v}_3 \cdot \boldsymbol{u}_2}{||\boldsymbol{u}_2||_2^2}\boldsymbol{u}_2\)

  4. Similarly, \(\boldsymbol{u}_4, \dots, \boldsymbol{u}_n\) can be obtained, and \(\boldsymbol{u}_1,\dots, \boldsymbol{u}_n\) is a set of orthogonal vectors.

Example 30

Orthogonalize the following set of vectors

\[\begin{equation*} \boldsymbol{v}_1 = \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix}, \boldsymbol{v}_2 = \begin{bmatrix} -1 \\ 2 \\ 3 \end{bmatrix}, \boldsymbol{v}_3 = \begin{bmatrix} 3 \\ 0 \\ 3 \end{bmatrix}. \end{equation*}\]

Solution.

  1. \[\begin{equation*} \boldsymbol{u_1} = \boldsymbol{v}_1 = \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix}. \end{equation*}\]
  2. \[\begin{equation*} \boldsymbol{u}_2 = \boldsymbol{v}_2 - \frac{\langle \boldsymbol{v}_2, \boldsymbol{u}_1 \rangle}{\langle \boldsymbol{u}_1, \boldsymbol{u}_1 \rangle} \boldsymbol{u}_1 = \begin{bmatrix} -1 \\ 2 \\ 3 \end{bmatrix} - \frac{[-1,2,3] \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix} }{1^2+2^2+2^2} \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix} = \begin{bmatrix} -2 \\ 0 \\ 1 \end{bmatrix}. \end{equation*}\]
  3. \[\begin{equation*} \boldsymbol{u}_3 = \boldsymbol{v}_3 - \frac{\langle \boldsymbol{v}_3, \boldsymbol{u}_1 \rangle}{\langle \boldsymbol{u}_1, \boldsymbol{u}_1 \rangle} \boldsymbol{u}_1 - \frac{\langle \boldsymbol{v}_3, \boldsymbol{u}_2 \rangle}{\langle \boldsymbol{u}_2, \boldsymbol{u}_2 \rangle} \boldsymbol{u}_2 = \begin{bmatrix} 3 \\ 0 \\ 3 \end{bmatrix} - \frac{[3,0,3] \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix} }{1^2+2^2+2^2} \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix} - \frac{[3,0,3] \begin{bmatrix} -2 \\ 0 \\ 1 \end{bmatrix} }{(-2)^2+0^2+1^2} \begin{bmatrix} -2 \\ 0 \\ 1 \end{bmatrix} = \begin{bmatrix} \frac{4}{5} \\ -2 \\ \frac{8}{5} \end{bmatrix}. \end{equation*}\]

Similarly, we can generate a set of A-orthogonal vectors from a set of linearly independent vectors by slightly modifying the algorithm above.

Theorem 19

Given a set of linearly independent vectors \(\boldsymbol{v}_1, \boldsymbol{v}_2, \cdots, \boldsymbol{v}_n\in \mathbb{R}^n\), the following process generates a set of A-orthogonal vectors:

  1. \(\boldsymbol{u}_1 = \boldsymbol{v}_1\)

  2. \(\boldsymbol{u}_2 = \boldsymbol{v}_2 - \frac{\langle \boldsymbol{v}_2, \boldsymbol{u}_1 \rangle_A}{\langle \boldsymbol{u}_1, \boldsymbol{u}_1 \rangle_A} \boldsymbol{u}_1 = \boldsymbol{v}_2 - \frac{\boldsymbol{v}_2^T A \boldsymbol{u}_1}{\boldsymbol{u}_1^TA\boldsymbol{u}_1}\boldsymbol{u}_1\)

  3. \(\boldsymbol{u}_3 = \boldsymbol{v}_3 - \frac{\langle \boldsymbol{v}_3, \boldsymbol{u}_1 \rangle_A}{\langle \boldsymbol{u}_1, \boldsymbol{u}_1 \rangle_A} \boldsymbol{u}_1 - \frac{\langle \boldsymbol{v}_3, \boldsymbol{u}_2 \rangle_A}{\langle \boldsymbol{u}_2, \boldsymbol{u}_2 \rangle_A} \boldsymbol{u}_2 = \boldsymbol{v}_3 - \frac{\boldsymbol{v}_3^T A \boldsymbol{u}_1}{\boldsymbol{u}_1^TA\boldsymbol{u}_1}\boldsymbol{u}_1 - \frac{\boldsymbol{v}_3^T A \boldsymbol{u}_2}{\boldsymbol{u}_2^TA\boldsymbol{u}_2}\boldsymbol{u}_2\)

  4. Similarly, \(\boldsymbol{u}_4, \dots, \boldsymbol{u}_n\) can be obtained, and \(\boldsymbol{u}_1,\dots, \boldsymbol{u}_n\) is a set of A-orthogonal vectors.

Now we continue the discussion of using a set of A-orthogonal vectors to find the solution to Eqn. (23). Since the set of A-orthogonal vectors are not ready in the first place, our goal is to generate them on the fly. Suppose we start with an initial guess of the solution \(\boldsymbol{x}_0=\boldsymbol{0}\). The corresponding residual is \(\boldsymbol{u}_0 = \boldsymbol{r}_0=\boldsymbol{b}-A\boldsymbol{x}_0=\boldsymbol{b}\), which serves as the first vector in the set of \(n\) A-orthogonal vectors. Now, we update the estimate of the solution by Eqns. (24) and (25):

\[\begin{equation*} \boldsymbol{x}_1 = \boldsymbol{x}_0 + a_0 \boldsymbol{u}_0 \end{equation*}\]

where

\[\begin{equation*} a_0 = \frac{\boldsymbol{u}_0^T \boldsymbol{b}}{\boldsymbol{u}_0^TA\boldsymbol{u}_0} = \frac{\boldsymbol{u}_0^T \boldsymbol{r}_0}{\boldsymbol{u}_0^TA\boldsymbol{u}_0}. \end{equation*}\]

In the second step, update the residual:

\[\begin{equation*} \boldsymbol{r}_1 = \boldsymbol{b}-A\boldsymbol{x}_1. \end{equation*}\]

Note that \(\boldsymbol{u}_0\) and \(\boldsymbol{r}_1\) are not A-orthogonal in general, so we apply the Gram-Schmidt algorithm to \(\boldsymbol{r}_1\):

\[\begin{equation*} \boldsymbol{u}_1 = \boldsymbol{r}_1 - \frac{\boldsymbol{r}_1^TA\boldsymbol{u}_0}{\boldsymbol{u}_0^TA\boldsymbol{u}_0}\boldsymbol{u}_0. \end{equation*}\]

Now \(\boldsymbol{u}_0\) and \(\boldsymbol{u}_1\) are a set of two A-orthogonal vectors. Again, update the estimated solution by Eqns. (24) and (25):

\[\begin{equation*} \boldsymbol{x}_2 = \boldsymbol{x}_1 + a_1 \boldsymbol{u}_1 \end{equation*}\]

where

\[\begin{equation*} a_1 = \frac{\boldsymbol{u}_1^T \boldsymbol{b}}{\boldsymbol{u}_1^TA\boldsymbol{u}_1} = \frac{\boldsymbol{u}_1^T \boldsymbol{r}_1}{\boldsymbol{u}_1^TA\boldsymbol{u}_1}. \end{equation*}\]

The second equality holds since

\[\begin{equation*} \boldsymbol{u}_1^T \boldsymbol{r}_1 = \boldsymbol{u}_1^T(\boldsymbol{b}-A\boldsymbol{x}_1) = \boldsymbol{u}_1^T\boldsymbol{b} - \boldsymbol{u}_1^TA\boldsymbol{x}_1. \end{equation*}\]

Note that \(\boldsymbol{x}_1\) only involves vectors that are A-orthogonal to \(\boldsymbol{u}_1\), so the last term in the equation above vanishes.

The algorithm can proceed and it converges with at most \(n\) steps, when a set of \(n\) A-orthogonal vectors (directions) are found. In addition, it is not necessary to start from a zero initial guess \(\boldsymbol{x}_0=\boldsymbol{0}\). If \(\boldsymbol{x}_0\ne\boldsymbol{0}\), the algorithm actually solves the linear system \(A\hat{\boldsymbol{x}}=\boldsymbol{b}-A\boldsymbol{x}_0\), and \(\hat{\boldsymbol{x}}+\boldsymbol{x}_0\) is the solution to the original linear system.

The algorithm can be summarized as follows.

Algorithm 1

  1. Initialize \(\boldsymbol{x}_0\) as any vector. Set \(\boldsymbol{r}_0=\boldsymbol{b}-A\boldsymbol{x}_0\) and \(\boldsymbol{u}_0=\boldsymbol{r}_0\).

  2. For \(k=0, 1, \dots, n-1\):

    1. \(a_k=\frac{\boldsymbol{u}_k^T \boldsymbol{r}_k}{\boldsymbol{u}_k^TA\boldsymbol{u}_k}\)

    2. \(\boldsymbol{x}_{k+1} = \boldsymbol{x}_k + a_k \boldsymbol{u}_k\)

    3. \(\boldsymbol{r}_{k+1} = \boldsymbol{b}-A\boldsymbol{x}_{k+1}\)

    4. if (\(||\boldsymbol{r}_{k+1}|| < \epsilon\)):

      1. break

    5. \(\boldsymbol{u}_{k+1} = \boldsymbol{r}_{k+1} - \sum_{i<=k}\frac{\boldsymbol{r}_{k+1}^TA\boldsymbol{u}_{i}}{\boldsymbol{u}_{i}^TA\boldsymbol{u}_{i}}\boldsymbol{u}_{i}\)

  3. return \(\boldsymbol{x}_{k+1}\).

Here \(\epsilon\) is a tolerance, and if the norm of the residual is less than it, the algorithm is considered to converge. The algorithm is implemented in the following Python function ConjGrad_v1.

# Conjugate Gradient Method
def ConjGrad_v1(A, b, x0, eps):
    """
    Perform conjugate gradient method, the first version.
    A: a symmetric positive definite matrix
    b: the right-hand side
    x0: the initial guess
    eps: the tolerance for stopping the algorithm
    """
    
    if not np.array_equal(A.T, A):
        print('Error: the matrix A is not symmetric!')
        return
        
    n = A.shape[0]
    r = np.zeros((n,n+1)) # The ith column stores r_i  
    u = np.zeros((n,n+1)) # The ith column stores u_i
    r[:,0] = b-np.dot(A, x0)
    if np.linalg.norm(r) < eps:
        return x0
    u[:,0] = r[:,0]
    x = x0.copy()
    
    for k in range(n):        
        a = np.dot(u[:,k], r[:,k])/np.dot(u[:,k], np.dot(A,u[:,k]))
        x += a*u[:,k]
        r[:,k+1] = b - np.dot(A, x)
        if np.linalg.norm(r) < eps:
            return x
        sum = np.zeros(n)
        for i in range(k+1):
            sum += np.dot(r[:,k+1], np.dot(A, u[:,i]))/np.dot(u[:,i], np.dot(A, u[:,i]))*u[:,i]
        u[:,k+1] = r[:,k+1] - sum
        
    return x

Example 31

Use Conjugate Gradient to solve the following \(2\times 2\) linear system:

\[\begin{equation*} \begin{bmatrix} 4 & -2 \\ -2 & 10 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 4 \\ 34 \end{bmatrix}. \end{equation*}\]
A = np.array([[4., -2], [-2, 10]])
b = np.array([4., 34])
x0 = np.zeros(2)
eps = 1e-9

sol = ConjGrad_v1(A, b, x0, eps)
print(sol)
[3. 4.]

We take a closer look at the process using the following animation. Both \(\boldsymbol{u}\) and \(\boldsymbol{r}\) directions are shown, along with the solutions.

import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.animation import FuncAnimation
from IPython.display import HTML
Hide code cell source
# Conjugate Gradient Method for the animation
def ConjGrad(A, b, x0, nstep):
    """
    Perform conjugate gradient method, the first version.
    A: a symmetric positive definite matrix
    b: the right-hand side
    x0: the initial guess
    nstep: the number of steps of conjugate gradient
    """
    
    if not np.array_equal(A.T, A):
        print('Error: the matrix A is not symmetric!')
        return
        
    n = A.shape[0]
    r_old = b-np.dot(A, x0)
    r_new = r_old.copy()
    u = r_old
    x = x0.copy()
    
    for k in range(nstep):        
        a = np.dot(r_old, r_old)/np.dot(u, np.dot(A,u))
        x += a*u
        r_new = r_old - a*np.dot(A, u)
        b = np.dot(r_new, r_new)/np.dot(r_old, r_old)
        u = r_new + b*u
        r_old = r_new.copy()
    a = np.dot(r_old, r_old)/np.dot(u, np.dot(A,u))
    return x, r_new, u, a

# Animation for the conjugate gradient method
A = np.array([[4., -2], [-2, 10]])
b = np.array([4., 34])
x0 = np.array([0.0, 0.0])
true_sol = np.linalg.solve(A, b)
fig, ax = plt.subplots(figsize=(12,5))
def animate(i):
    ax.cla()
    ax.set_xlim(-5, 5)
    ax.set_ylim(-5, 5)
    x, r, u, a = ConjGrad(A, b, x0, i)
    ax.plot(x[0], x[1], 'o', ms=8, label='CG')
    ax.quiver([x[0]], [x[1]], [a*u[0]], [a*u[1]], angles='xy', scale_units='xy', scale=1, 
              color='r', label='u direction')
    ax.quiver([x[0]], [x[1]], [r[0]], [r[1]], angles='xy', 
              color='c', label='r direction')
    ax.plot(true_sol[0], true_sol[1], 's', color='green', ms=8, label='True solution')
    ax.axhline()
    ax.axvline()
    ax.set_title('Conjugate gradient method: iteration step '+str(i), size=16)
    ax.set_xlabel('$x_1$', size=16)
    ax.set_ylabel('$x_2$', size=16)
    ax.legend(loc='upper right')
plt.close()
anim = FuncAnimation(fig, animate, frames=3, interval=2000)
HTML(anim.to_jshtml())

Example 32

Solve the linear system shown in Example 29 using the conjugate gradient method.

Solution.

A = np.array([[1., -3, -2], [-3, 10, 9], [-2, 9, 29]])
b = np.array([0., -5, -47])
x0 = np.zeros(3)
eps = 1e-9

sol = ConjGrad_v1(A, b, x0, eps)
print(sol)
[-1.  1. -2.]

Note that the above conjugate gradient algorithm requires spaces to store \(\boldsymbol{r}\) and \(\boldsymbol{u}\), which is very inefficient for large \(n\). The following modification of the conjugate gradient algorithm avoids the need to store the two variables. It can be shown that the two algorithms are equivalent.

Algorithm 2 (Conjugate Gradient)

  1. Initialize \(\boldsymbol{x}_0\) as any vector. Set \(\boldsymbol{r}_0=\boldsymbol{b}-A\boldsymbol{x}_0\) and \(\boldsymbol{u}_0=\boldsymbol{r}_0\).

  2. For \(k=0, 1, \dots, n-1\):

    1. \(a_k=\frac{\boldsymbol{r}_k^T \boldsymbol{r}_k}{\boldsymbol{u}_k^TA\boldsymbol{u}_k}\)

    2. \(\boldsymbol{x}_{k+1} = \boldsymbol{x}_k + a_k \boldsymbol{u}_k\)

    3. \(\boldsymbol{r}_{k+1} = \boldsymbol{r}_{k}-a_kA\boldsymbol{u}_{k}\)

    4. if (\(||\boldsymbol{r}_{k+1}|| < \epsilon\)):

      1. break

    5. \(\boldsymbol{b}_k = \frac{\boldsymbol{r}_{k+1}^T\boldsymbol{r}_{k+1}}{\boldsymbol{r}_{k}^T\boldsymbol{r}_{k}}\)

    6. \(\boldsymbol{u}_{k+1} = \boldsymbol{r}_{k+1} + \boldsymbol{b}_k\boldsymbol{u}_{k}\)

  3. return \(\boldsymbol{x}_{k+1}\).

We implement the algorithm in the following Python function ConjGrad_v2.

# Conjugate Gradient Method
def ConjGrad_v2(A, b, x0, eps):
    """
    Perform conjugate gradient method, the second version.
    A: a symmetric positive definite matrix
    b: the right-hand side
    x0: the initial guess
    eps: the tolerance for stopping the algorithm
    """
    
    if not np.array_equal(A.T, A):
        print('Error: the matrix A is not symmetric!')
        return
        
    n = A.shape[0]
    r_old = b-np.dot(A, x0)
    r_new = r_old.copy()
    if np.linalg.norm(r_old) < eps:
        return x0
    u = r_old
    x = x0.copy()
    
    for k in range(n):        
        a = np.dot(r_old, r_old)/np.dot(u, np.dot(A,u))
        x += a*u
        r_new = r_old - a*np.dot(A, u)
        if np.linalg.norm(r_new) < eps:
            return x
        b = np.dot(r_new, r_new)/np.dot(r_old, r_old)
        u = r_new + b*u
        r_old = r_new.copy()
        
    return x

Applying the new code to Example 32 generates the same result.

A = np.array([[1., -3, -2], [-3, 10, 9], [-2, 9, 29]])
b = np.array([0., -5, -47])
x0 = np.zeros(3)
eps = 1e-9

sol = ConjGrad_v2(A, b, x0, eps)
print(sol)
[-1.  1. -2.]

Exercise 3.3-5

Show that \(\boldsymbol{u}^TA\boldsymbol{v}\) in Definition 11 defines an inner product.

Exercise 3.3-6

Prove Eqn. (25).