The CG method was originally designed to minimize convex quadratic functions but, through several variations, has been extended to the general case . The first iteration in CG is the same as in SD, but successive directions are constructed so that they form a set of mutually conjugate vectors with respect to the (positive-definite) Hessian of a general convex quadratic function . Whereas the rate of convergence for SD depends on the ratio of the extremal eigenvalues of , the convergence properties of CG depend on the entire matrix spectrum. Faster convergence is expected when the eigenvalues are clustered. In exact arithmetic, convergence is obtained in at most steps. In particular, if has distinct eigenvalues, convergence to a solution requires iterations.
For example, when the bound on convergence measures the size of with respect to the -norm,
we have 
Clearly rapid convergence is expected when , as for SD. Further estimates of convergence bounds can only be derived when certain properties about the eigenvalue distribution are known (e.g., large eigenvalues and small eigenvalues clustered in a region ) .
When one refers to the CG method, one often means the linear Conjugate Gradient; that is, the implementation for the convex quadratic form. In this case, minimizing is equivalent to solving the linear system . Consequently, the conjugate directions , as well as the step lengths , can be computed in closed form. Below we sketch such an algorithm from a given . We define the residual vector and use the vectors below to denote the CG search vectors.
Note here that only a few vectors are stored, the product is required but not knowledge (or storage) of per se, and the cost only involves several scalar and vector operations. The value of the step length can be derived in the closed form above by minimizing the quadratic function as a function of (producing ) and then using the conjugacy relation: for all .
See the linear algebra chapter for further details and examples.
Minimization performance of CG is shown in Figures 10 and 14 for Rosenbrock's and Beale's functions, respectively. Note that performance for both functions with CG is better than SD, as expected, but the paths are characteristically more `random'. For Rosenbrock's function, for example, a large step is taken between the fourth and fifth iterates, where a distant minimum was detected in the line search. The step length varies considerably from one iteration to the next. Similarly, for Beale's function, the step lengths vary from to 800.