The direct modeling of likelihood provides many advantages. For example, the negative log-likelihood can be directly computed and minimized as the loss function. Additionally, novel samples can be generated by sampling from the initial distribution, and applying the flow transformation.
Let be a (possibly multivariate) random variable with distribution .
For , let be a sequence of random variables transformed from . The functions should be invertible, i.e. the inverse function exists. The final output models the target distribution.
To efficiently compute the log likelihood, the functions should be easily invertible, and the determinants of their Jacobians should be simple to compute. In practice, the functions are modeled using deep neural networks, and are trained to minimize the negative log-likelihood of data samples from the target distribution. These architectures are usually designed such that only the forward pass of the neural network is required in both the inverse and the Jacobian determinant calculations. Examples of such architectures include NICE,[4] RealNVP,[5] and Glow.[6]
As is generally done when training a deep learning model, the goal with normalizing flows is to minimize the Kullback–Leibler divergence between the model's likelihood and the target distribution to be estimated. Denoting the model's likelihood and the target distribution to learn, the (forward) KL-divergence is:
The second term on the right-hand side of the equation corresponds to the entropy of the target distribution and is independent of the parameter we want the model to learn, which only leaves the expectation of the negative log-likelihood to minimize under the target distribution. This intractable term can be approximated with a Monte-Carlo method by importance sampling. Indeed, if we have a dataset of samples each independently drawn from the target distribution , then this term can be estimated as:
The earliest example.[9] Fix some activation function , and let with the appropriate dimensions, thenThe inverse has no closed-form solution in general.
The Jacobian is .
For it to be invertible everywhere, it must be nonzero everywhere. For example, and satisfies the requirement.
The Real Non-Volume Preserving model generalizes NICE model by:[5]
Its inverse is , and its Jacobian is . The NICE model is recovered by setting . Since the Real NVP map keeps the first and second halves of the vector separate, it's usually required to add a permutation after every Real NVP layer.
In generative flow model,[6] each layer has 3 parts:
channel-wise affine transformwith Jacobian .
invertible 1x1 convolutionwith Jacobian . Here is any invertible matrix.
Real NVP, with Jacobian as described in Real NVP.
The idea of using the invertible 1x1 convolution is to permute all layers in general, instead of merely permuting the first and second half, as in Real NVP.
Instead of constructing flow by function composition, another approach is to formulate the flow as a continuous-time dynamic.[12][13] Let be the latent variable with distribution . Map this latent variable to data space with the following flow function:
where is an arbitrary function and can be modeled with e.g. neural networks.
Since the trace depends only on the diagonal of the Jacobian , this allows "free-form" Jacobian.[14] Here, "free-form" means that there is no restriction on the Jacobian's form. It is contrasted with previous discrete models of normalizing flow, where the Jacobian is carefully designed to be only upper- or lower-diagonal, so that the Jacobian can be evaluated efficiently.
The trace can be estimated by "Hutchinson's trick":[15][16]
Given any matrix , and any random with , we have . (Proof: expand the expectation directly.)
Usually, the random vector is sampled from (normal distribution) or (Radamacher distribution).
When is implemented as a neural network, neural ODE methods[17] would be needed. Indeed, CNF was first proposed in the same paper that proposed neural ODE.
There are two main deficiencies of CNF, one is that a continuous flow must be a homeomorphism, thus preserve orientation and ambient isotopy (for example, it's impossible to flip a left-hand to a right-hand by continuous deforming of space, and it's impossible to turn a sphere inside out, or undo a knot), and the other is that the learned flow might be ill-behaved, due to degeneracy (that is, there are an infinite number of possible that all solve the same problem).
By adding extra dimensions, the CNF gains enough freedom to reverse orientation and go beyond ambient isotopy (just like how one can pick up a polygon from a desk and flip it around in 3-space, or unknot a knot in 4-space), yielding the "augmented neural ODE".[18]
To regularize the flow , one can impose regularization losses. The paper [15] proposed the following regularization loss based on optimal transport theory:where are hyperparameters. The first term punishes the model for oscillating the flow field over time, and the second term punishes it for oscillating the flow field over space. Both terms together guide the model into a flow that is smooth (not "bumpy") over space and time.
When a probabilistic flow transforms a distribution on an -dimensional smooth manifold embedded in , where , and where the transformation is specified as a function, , the scaling factor between the source and transformed PDFs is not given by the naive computation of the determinant of the Jacobian (which is zero), but instead by the determinant(s) of one or more suitably defined matrices. This section is an interpretation of the tutorial in the appendix of Sorrenson et al.(2023),[20] where the more general case of non-isometrically embedded Riemann manifolds is also treated. Here we restrict attention to isometrically embedded manifolds.
As running examples of manifolds with smooth, isometric embedding in we shall use:
As a first example of a spherical manifold flow transform, consider the normalized linear transform, which radially projects onto the unitsphere the output of an invertible linear transform, parametrized by the invertible matrix :
In full Euclidean space, is not invertible, but if we restrict the domain and co-domain to the unitsphere, then is invertible (more specifically it is a bijection and a homeomorphism and a diffeomorphism), with inverse . The Jacobian of , at is , which has rank and determinant of zero; while as explained here, the factor (see subsection below) relating source and transformed densities is: .
For , let be an -dimensional manifold with a smooth, isometric embedding into . Let be a smooth flow transform with range restricted to . Let be sampled from a distribution with density . Let , with resultant (pushforward) density . Let be a small, convex region containing and let be its image, which contains ; then by conservation of probability mass:
where volume (for very small regions) is given by Lebesgue measure in -dimensional tangent space. By making the regions infinitessimally small, the factor relating the two densities is the ratio of volumes, which we term the differential volume ratio.
To obtain concrete formulas for volume on the -dimensional manifold, we contruct by mapping an -dimensional rectangle in (local) coordinate space to the manifold via a smooth embedding function: . At very small scale, the embedding function becomes essentially linear so that is a parallelotope (multidimensional generalization of a parallelogram). Similarly, the flow transform, becomes linear, so that the image, is also a parallelotope. In , we can represent an -dimensional parallelotope with an matrix whose colum-vectors are a set of edges (meeting at a common vertex) that span the paralellotope. The volume is given by the absolute value of the determinant of this matrix. If more generally (as is the case here), an -dimensional paralellotope is embedded in , it can be represented with a (tall) matrix, say . Denoting the parallelotope as , its volume is then given by the square root of the Gram determinant:
In the sections below, we show various ways to use this volume formula to derive the differential volume ratio.
As a first example, we develop expressions for the differential volume ratio of a simplex flow, , where . Define the embedding function:
which maps a conveniently chosen, -dimensional repesentation, , to the embedded manifold. The Jacobian is . To define , the differential volume element at the transformation input (), we start with a rectangle in -space, having (signed) differential side-lengths, from which we form the square diagonal matrix , the columns of which span the rectangle. At very small scale, we get , with:
For the 1-simplex (blue) embedded in , when we pull back Lebesgue measure from tangent space (parallel to the simplex), via the embedding , with Jacobian , a scaling factor of results.
To understand the geometric interpretation of the factor , see the example for the 1-simplex in the diagram at right.
The differential volume element at the transformation output (), is the parallelotope, , where is the Jacobian of at . Its volume is:
so that the factor cancels in the volume ratio, which can now already be numerically evaluated. It can however be rewritten in a sometimes more convenient form by also introducing the representation function, , which simply extracts the first components. The Jacobian is . Observe that, since , the chain rule for function composition gives: . By plugging this expansion into the above Gram determinant and then refactoring it as a product of determinants of square matrices, we can extract the factor , which now also cancels in the ratio, which finally simpifies to the determinant of the Jacobian of the "sandwiched" flow transformation, :
which, if , can be used to derive the pushforward density after a change of variables, :
This formula is valid only because the simplex is flat and the Jacobian, is constant. The more general case for curved manifolds is discussed below, after we present two concrete examples of simplex flow transforms.
A calibration transform, , which is sometimes used in machine learning for post-processing of the (class posterior) outputs of a probabilistic -class classifier,[21][22] uses the softmax function to renormalize categorical distributions after scaling and translation of the input distributions in log-probability space. For and with parameters, and the transform can be specified as:
where the log is applied elementwise. After some algebra the differential volume ratio can be expressed as:
This result can also be obtained by factoring the density of the SGB distribution,[23] which is obtained by sending Dirichlet variates through .
While calibration transforms are most often trained as discriminative models, the reinterpretation here as a probabilistic flow allows also the design of generative calibration models based on this transform.
The above calibration transform can be generalized to , with parameters and invertible:[24]
where the condition that has as an eigenvector ensures invertibility by sidestepping the information loss due to the invariance: . Note in particular that is the only allowed diagonal parametrization, in which case (for ) we recover , while (for ) generalization is possible with non-diagonal matrices. The inverse is:
The differential volume ratio is:
If is to be used as a calibration transform, a further constraint could be imposed that be positive definite, so that , which avoids direction reversals. (This condition is the generalization of in the parameter.)
If , and positive definite, then and are equivalent in the sense that in both cases, is a straight line, the (positive) slope and offset of which are functions of the transform parameters.
Consider a flow, on a curved manifold, for example which we equip with the embedding function, that maps a set of angular spherical coordinates to . The Jacobian of is non-constant and we have to evaluate it at both input () and output (). The same applies to , the represententation function that recovers spherical coordinates from points on , for which we need the Jacobian at the output (). The differential volume ratio now generalizes to:
For geometric insight, consider , where the spherical coordinates are co-latitude, and longitude, . At , we get , which gives the radius of the circle at that latitude (compare e.g. polar circle to equator). The differential volume (surface area on the sphere) is: .
The above derivation for is fragile in the sense that when using fixed functions , there may be places where they are not well-defined, for example at the poles of the 2-sphere where longitude is arbitrary. This problem is sidestepped (using standard manifold machinery) by generalizing to local coordinates (charts), where in the vicinities of , we map from local -dimensional coordinates to and back using the respective function pairs and . We continue to use the same notation for the Jacobians of these functions (), so that the above formula for remains valid.
We can however, choose our local coordinate system in a way that simplifies the expression for and indeed also its practical implementation.[20] Let be a smooth idempotent projection () from the projectible set, , onto the embedded manifold. For example:
The positive orthant of is projected onto the simplex as:
Non-zero vectors in are projected onto the unitsphere as:
For every , we require of that its Jacobian, has rank (the manifold dimension), in which case is an idempotent linear projection onto the local tangent space (orthogonal for the unitsphere: ; oblique for the simplex: ). The colums of span the -dimensional tangent space at . We use the notation, for any matrix with orthonormal columns () that span the local tangent space. Also note: . We can now choose our local coordinate embedding function, :
Since the Jacobian is injective (full rank: ), a local (not necessarily unique) left inverse, say with Jacobian , exists such that and . In practice we do not need the left inverse function itself, but we do need its Jacobian, for which the above equation does not give a unique solution. We can however enforce a unique solution for the Jacobian by choosing the left inverse as, :
We can now finally plug and into our previous expression for , the differential volume ratio, which because of the orthonormal Jacobians, simplifies to:[25]
For learning the parameters of a manifold flow transformation, we need access to the differential volume ratio, , or at least to its gradient w.r.t. the parameters. Moreover, for some inference tasks, we need access to itself. Practical solutions include:
Sorrenson et al.(2023)[20] give a solution for computationally efficient stochastic parameter gradient approximation for
For some hand-designed flow transforms, can be analytically derived in closed form, for example the above-mentioned simplex calibration transforms. Futher examples are given below in the section on simple spherical flows.
On a software platform equipped with linear algebra and automatic differentiation, can be automatically evaluated, given access to only .[26] But this is expensive for high-dimensional data, with at least computational costs. Even then, the slow automatic solution can be invaluable as a tool for numerically verifying hand-designed closed-form solutions.
In machine learning literature, various complex spherical flows formed by deep neural network architectures may be found.[20] In contrast, this section compiles from statistics literature the details of three very simple spherical flow transforms, with simple closed-form expressions for inverses and differential volume ratios. These flows can be used individually, or chained, to generalize distributions on the unitsphere, . All three flows are compositions of an invertible affine transform in , followed by radial projection back onto the sphere. The flavours we consider for the affine transform are: pure translation, pure linear and general affine. To make these flows fully functional for learning, inference and sampling, the tasks are:
To derive the inverse transform, with suitable restrictions on the parameters to ensure invertibility.
To derive in simple closed form the differential volume ratio, .
An interesting property of these simple spherical flows is that they don't make use of any non-linearities apart from the radial projection. Even the simplest of them, the normalized translation flow, can be chained to form perhaps suprisingly flexible distributions.
The normalized translation flow, , with parameter , is given by:
The inverse function may be derived by considering, for : and then using to get a quadratic equation to recover , which gives:
from which we see that we need to keep real and positive for all . The differential volume ratio is given (without derivation) by Boulerice & Ducharme(1994) as:[27]
This can indeed be verified analytically:
By a laborious manipulation of .
By setting in , which is given below.
Finally, it is worth noting that and do not have the same functional form.
The normalized linear flow, , where parameter is an invertible matrix, is given by:
The differential volume ratio is:
This result can be derived indirectly via the Angular central Gaussian distribution (ACG),[28] which can be obtained via normalized linear transform of either Gaussian, or uniform spherical variates. The first relationship can be used to derive the ACG density by a marginalization integral over the radius; after which the second relationship can be used to factor out the differential volume ratio. For details, see ACG distribution.