Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc adjustments #169

Merged
merged 11 commits into from
Nov 9, 2021
40 changes: 23 additions & 17 deletions docs/src/affine.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,33 @@
# Affine Transformations

It's very common for measures to be parameterized by `μ` and `σ`, for example as in `Normal(μ=3, σ=4)` or `StudentT(ν=1, μ=3, σ=4)`. In this context, `μ` and `σ` do not always refer to the mean and standard deviation (the `StudentT` above is equivalent to a Cauchy, so both are undefined).
It's very common for measures to use parameters `μ` and `σ`, for example as in `Normal(μ=3, σ=4)` or `StudentT(ν=1, μ=3, σ=4)`. In this context, `μ` and `σ` need not always refer to the mean and standard deviation (the `StudentT` measure specified above is equivalent to a [Cauchy](https://en.wikipedia.org/wiki/Cauchy_distribution) measure, so both mean and standard deviation are undefined).

Rather, `μ` is a "location parameter", and `σ` is a "scale parameter". Together these determine an affine transformation
In general, `μ` is a "location parameter", and `σ` is a "scale parameter". Together these parameters determine an affine transformation.

```math
f(z) = σ z + μ
```

Here are below, we'll use ``z`` to represent an "un-transformed" variable, typically coming from a measure like `Normal()` with no location or scale parameters.
Starting with the above definition, we'll use ``z`` to represent an "un-transformed" variable, typically coming from a measure which has neither a location nor a scale parameter, for example `Normal()`.

Affine transforms are often incorrectly referred to as "linear". Linearity requires ``f(ax + by) = a f(x) + b f(y)`` for scalars ``a`` and ``b``, which only holds for the above ``f`` if ``μ=0``.
Affine transformations are often mistakenly referred to as "linear". In fact, an affine transformation is ["the composition of two functions: a translation and a linear map"](https://en.wikipedia.org/wiki/Affine_transformation#Representation). For a function `f` to be linear requires
jeremiahpslewis marked this conversation as resolved.
Show resolved Hide resolved
``f(ax + by) == a f(x) + b f(y)``
for scalars ``a`` and ``b``. For an affine function
``f(z) = σ * z + μ``,
jeremiahpslewis marked this conversation as resolved.
Show resolved Hide resolved
linearity holds only if ``μ = 0``.
jeremiahpslewis marked this conversation as resolved.
Show resolved Hide resolved


## Cholesky-based parameterizations

If the "un-transformed" `z` is a scalar, things are relatively simple. But it's important our approach handle the multivariate case as well.
If the "un-transformed" `z` is univariate, things are relatively simple. But it's important our approach handle the multivariate case as well.
jeremiahpslewis marked this conversation as resolved.
Show resolved Hide resolved

In the literature, it's common for a multivariate normal distribution to be parameterized by a mean `μ` and covariance matrix `Σ`. This is mathematically convenient, but can be very awkward from a computational perspective.
In the literature, it's common for a multivariate normal distribution to be parameterized by a mean `μ` and covariance matrix `Σ`. This is mathematically convenient, but less idea for efficient computation.
jeremiahpslewis marked this conversation as resolved.
Show resolved Hide resolved

While MeasureTheory.jl includes (or will include) a parameterization using `Σ`, we prefer to work in terms of its Cholesky decomposition ``σ``.

Using "``σ``" for this may seem strange at first, so we should explain the notation. Let ``σ`` be a lower-triangular matrix satisfying
The relationship between the computationally efficient "``σ``" and more familiar parameteriation `Σ` can be seen as follows:

Let ``σ`` be a lower-triangular matrix satisfying
jeremiahpslewis marked this conversation as resolved.
Show resolved Hide resolved

```math
σ σᵗ = Σ
jeremiahpslewis marked this conversation as resolved.
Show resolved Hide resolved
Expand All @@ -33,23 +39,23 @@ Then given a (multivariate) standard normal ``z``, the covariance matrix of ``σ
𝕍[σ z + μ] = Σ
```

Comparing to the one dimensional case where
The one-dimensional case where we have

```math
𝕍[σ z + μ] = σ²
```

shows that the lower Cholesky factor of the covariance generalizes the concept of standard deviation, justifying the notation.
shows that the lower Cholesky factor of the covariance generalizes the concept of standard deviation, completing the link between ``σ`` and `Σ`.

## The "Cholesky precision" parameterization

The ``(μ,σ)`` parameterization is especially convenient for random sampling. Any `z ~ Normal()` determines an `x ~ Normal(μ,σ)` through
The ``(μ,σ)`` parameterization is especially convenient for random sampling. Any measure `z ~ Normal()` determines an `x ~ Normal(μ,σ)` through the affine transformation

```math
x = σ z + μ
```

On the other hand, the log-density computation is not quite so simple. Starting with an ``x``, we need to find ``z`` using
The log-density transformation of a `Normal` with parameters μ, σ does not follow as directly. Starting with an ``x``, we need to find ``z`` using

```math
z = σ⁻¹ (x - μ)
gdalle marked this conversation as resolved.
Show resolved Hide resolved
Expand All @@ -63,19 +69,19 @@ logdensity(d::Normal{(:μ,:σ)}, x) = logdensity(d.σ \ (x - d.μ)) - logdet(d.

Here the `- logdet(σ)` is the "log absolute Jacobian", required to account for the stretching of the space.

The above requires solving a linear system, which adds some overhead. Even with the convenience of a lower triangular system, it's still not quite a efficient as a multiplication.
The above requires solving a linear system, which adds some overhead. Even with the convenience of a lower triangular system, it's still not quite as efficient as multiplication.

In addition to the covariance ``Σ``, it's also common to parameterize a multivariate normal by its _precision matrix_, ``Ω = Σ⁻¹``. Similarly to our use of ``σ``, we'll use ``ω`` for the lower Cholesky factor of ``Ω``.
In addition to the covariance ``Σ``, it's also common to parameterize a multivariate normal by its _precision matrix_, defined as the inverse of the covariance matrix, ``Ω = Σ⁻¹``. Similar to our use of ``σ`` for the lower Cholesky factor of `Σ`, we'll use ``ω`` for the lower Cholesky factor of ``Ω``.

This allows a more efficient log-density,
This parameterization enables more efficient calculation of the log-density using only multiplication and addition,

```julia
logdensity(d::Normal{(:μ,:ω)}, x) = logdensity(d.ω * (x - d.μ)) + logdet(d.ω)
```

## `AffineTransform`

Transforms like ``z → σ z + μ`` and ``z → ω \ z + μ`` can be represented using an `AffineTransform`. For example,
Transforms like ``z → σ z + μ`` and ``z → ω \ z + μ`` can be specified in MeasureTheory.jl using an `AffineTransform`. For example,

```julia
julia> f = AffineTransform((μ=3.,σ=2.))
Expand All @@ -85,9 +91,9 @@ julia> f(1.0)
5.0
```

In the scalar case this is relatively simple to invert. But if `σ` is a matrix, this would require matrix inversion. Adding to this complication is that lower triangular matrices are not closed under matrix inversion.
In the univariate case this is relatively simple to invert. But if `σ` is a matrix, matrix inversion becomes necessary. This is not always possible as lower triangular matrices are not closed under matrix inversion and as such are not guaranteed to exist.

Our multiple parameterizations make it convenient to deal with these issues. The inverse transform of a ``(μ,σ)`` transform will be in terms of ``(μ,ω)``, and vice-versa. So
With multiple parameterizations of a given family of measures, we can work around these issues. The inverse transform of a ``(μ,σ)`` transform will be in terms of ``(μ,ω)``, and vice-versa. So

```julia
julia> f⁻¹ = inv(f)
Expand Down