Sparse.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
    <title>Sparse matrices in R</title>
  </head>

  <body>
    <h1>RFC on Sparse matrices in R</h1>
    <p>
      <a href="mailto:rkoenker@uiuc.edu">Roger Koenker</a> and <a
      href="mailto:Pin.Ng@NAU.edu">Pin Ng</a> have provided a sparse
      matrix implementation for R in the <a
      href="http://www.econ.uiuc.edu/~roger/research/sparse/sparse.html">SparseM</a>
      package, which is based on Fortran code in <a
      href="http://www.cs.umn.edu/Research/arpa/SPARSKIT/sparskit.html">sparskit</a>
      and a modified version of the sparse Cholesky factorization
      written by Esmond Ng and Barry Peyton.  The modified version is
      distributed as part of PCx by Czyzyk, Mehrotra, Wagner, and
      Wright and is copywrite by the University of Chicago.
    </p>
    <p>
      Recently I become very interested in certain sparse matrix
      calculations myself and have looked at some of the available
      Open Source software for the sparse Cholesky decomposition.
      While I certainly appreciate the work that Roger and Pin have
      done I will propose a slightly different implementation.
    </p>
    <h2>Representations of sparse matrices</h2>
    <p>
      Conceptually, the simplest representation of a sparse matrix is
      as a triplet of an integer vector <i>i</i> giving the row
      numbers, an integer vector <i>j</i> giving the column numbers,
      and a numeric vector <i>x</i> giving the non-zero values in the
      matrix.  An S4 class definition might be
    </p>
    <pre>
# Sparse general matrix in triplet format
setClass("tripletMatrix",
         representation(i = "integer", j = "integer", x = "numeric",
                        Dim = "integer"))
    </pre>
    <p>
      The triplet representation is row-oriented if elements in
      the same row were adjacent and column-oriented if elements in the
      same column were adjacent.  The compressed sparse row (csr)
      (or compressed sparse column - csc) representation is
      similar 
      to row-oriented triplet (column-oriented triplet) except that
      <i>i</i> (<i>j</i>) just stores the index of the first element
      in the row (column).  (There are a couple of other details but
      that is the gist of it.)  These compressed representations
      remove the redundant row (column) indices and provide faster
      access to a given location in the matrix because you only need
      to check one row (column).
    </p>
    <p>
      The preferred representation of sparse matrices in the SparseM
      package is csr.  <a href="http://www.mathworks.com/">Matlab</a>
      uses csc.  We hope that <a
      href="http://www.octave.org/">Octave</a> will also use this
      representation. There are certain advantages to csc in systems
      like R and Matlab where dense matrices are stored in
      column-major order.  For example, Sivan Toledo's <a
      href="http://www.tau.ac.il/~stoledo/taucs">TAUCS</a> library and
      Tim Davis's <a
      href="http://www.cise.ufl.edu/research/sparse/umfpack">UMFPACK</a>
      library are both based on csc and can both use level-3 BLAS in
      certain sparse matrix computations.
    </p>
    <p>
      I feel that compatibility with Matlab (and, we hope, Octave), the
      ability to use level-3 BLAS, and the availability of the
      csc-based TAUCS,
      UMFPACK, and <a
	href="http://www.cise.ufl.edu/research/sparse/amd">AMD</a>
      libraries favors csc as the preferred sparse matrix representation.
    </p>
    <h2>Applications of sparse matrices</h2>
    <p>
      I imagine that the main applications of sparse matrices in R will
      be for parameter estimation in very large linear models and for
      large sparse contingency tables.
    </p>
    <p>
      As Roger and Pin have pointed out, the key to estimating
      parameters in large linear models quickly and with minimal
      storage requirements will be in providing a way for
      <code>model.matrix</code> to generate a sparse model matrix
      <code>X</code> or a sparse symmetric representation of
      <code>X'X</code> and <code>X'y</code>.
    </p>
    <p>
      Assuming that we have a sparse representation of the model
      matrix as <code>mm</code> and a sparse or dense representation
      of the response as <code>y</code>, the coefficients can be
      estimated as
    </p>
    <pre>
solve(crossprod(mm), crossprod(mm, y))
    </pre>
    <p>
      I think that the multifrontal sparse Cholesky in the TAUCS
      library is one of the best currently available ways to do this
      and have implemented a solution based on that in the
      <code>taucs</code> package for R.  I use the approximate minimal
      degree ordering determined by Tim Davis's AMD library to reduce
      fill-in.
    </p>
    <p>
      For statistical analysis of a linear model we probably also want
      at least the standard errors of the coefficient estimates which
      means we want an inverse of the Cholesky factor.  TAUCS has an
      inverse factorization routine <code>taucs_ccs_factor_xxt</code>
      that can provide a sparse representation of the inverse.  I
      think that we want to use that for a linear model analysis.  We
      can use the multifrontal solver if we only want coefficients.
    </p>
    <p>
      When working with linear models there will be a tradeoff between
      the speed boost available by reordering rows and columns and the
      statistical information available in the original ordering of
      the rows and columns.  For example, the simplest way to
      determine the sequential sums of squares of the terms in the
      model is to maintain the column ordering in <code>X</code> but
      that could result in dramatic amounts of fill-in for the sparse
      Cholesky and especially for the inverse factorization.  I think
      it is best to compromise and obtain the inverse factorization of
      the reordered matrix.  This can provide standard errors and
      correlations of coefficients but not the sequential sums of
      squares.  (At least I don't know how to get them from the
      reordered matrix.)
    </p>
    <p>
      Sparse contingency tables can be easily constructed and
      manipulated.  I understand that <a
	href="mailto:kurt.hornik@r-project.org">Kurt Hornik</a> would
      like to use them but I'm not sure exactly what operations he needs.
    </p>
    <p>
      I have a hybrid application involving large linear mixed models
      with <a
      href="http://www.stat.wisc.edu/~bates/reports/MixedComp.pdf">partially
      crossed grouping factors</a>.  For these I need to manipulate
      both sparse contingency tables and some associated sparse positive
      definite matrices.
    </p>
    <h2>Utilities for sparse matrices</h2>
    <p>
      TAUCS has a convenient C struct for the csc representation of a
      matrix.  I have written functions to transfer from an S4 object
      to the TAUCS struct and back.  TAUCS also has routines for
      multiplication by dense matrices and for symmetric permutation
      of rows and columns (needed in the Cholesky factorization
      routines).
    </p>
    <p>
      UMFPACK is a set of routines for solving unsymmetric sparse
      linear systems with the Unsymmetric MultiFrontal method.  It has
      a couple of very convenient routines for switching between csc
      and a triplet representation.  The triplet to csc converter is
      quite general in that it allows redundant triplet
      representations (more than one entry for the same position -
      multiple entries have their values summed) and arbitrary
      ordering.  This allows convenient creation of sparse contingency
      tables (build up the triplet representation then compress it).
      As described in the UMFPACK documentation, it also allows simple
      ways to write operations like transposition of matrices (convert
      csc to triplet, interchange <i>i</i> and <i>j</i>, convert back
      to csc).
    </p>
    <p>
      As a side note, it appears that the UMFPACK/AMD form of the csc
      representation is more strict than the TAUCS representation.  If
      one applies <code>taucs_ccs_permute_symmetrically</code> to a
      csc matrix (in TAUCS these are called ccs) the result does not
      have the rows in increasing order within each column.  AMD
      doesn't like this and I find it confusing when trying to examine
      the matrix.  Again the csc to triplet to csc conversion can be
      used to remove this problem.
    </p>
    <h2>Licenses</h2>
    <p>
      TAUCS is available under the LGPL.  UMFPACK is <a
      href="http://www.gnu.org/directory/science/math/umfpack.html">official
      GNU software</a> covered by what is described as a 2-clause BSD
      license.  I'm not sure exactly what that means.  AMD is covered
      by a similar license and I think may be considered part of
      UMFPACK in the sense of also being official GNU software.  The
      tarball for UMFPACK contains AMD as you need AMD to be able to
      use UMFPACK.
    </p>
    <h2>Proposed plan</h2>
    <ul>
      <li>
	Build a library using S4 classes and methods for the csc
	representation and based on the TAUCS code for the sparse
	Cholesky plus AMD ordering.  I would include at least the csc
	to triplet utility routines from UMFPACK.  Depending upon
	whether there was substantial demand for unsymmetric (LU)
	sparse matrix factorization and solution of systems, we could
	bring in the whole of UMFPACK or look at the unsymmetric
	solvers in TAUCS.
      </li>
      <li>
	Modify <code>model.matrix</code> to produce a sparse
	representation of <code>X</code> or of <code>X'X</code> and
	<code>X'y</code>.  It would be convenient to get a sparse
	representation of the model matrix but the big payoff would be
	in getting the crossproduct matrix.  However, it is difficult
	to decide where the non-zero elements in the crossproduct
	matrix are when you are sequentially examining the rows of
	<code>X</code>.  It may be possible to perform the operation
	on chunks of rows and use the csc to triplet to csc
	transformation if new non-zero elements are found while
	processing a chunk.
      </li>
    </ul>


    <hr>
    <address><a href="mailto:bates@stat.wisc.edu">Douglas Bates</a></address>
<!-- Created: Tue Oct 21 13:45:49 CDT 2003 -->
<!-- hhmts start -->
Last modified: Tue Oct 21 16:11:58 CDT 2003
<!-- hhmts end -->
  </body>
</html>