DPP.jl: a Julia package for sampling Determinantal Point Processes

DPP.jl provides some types and functions for sampling from DPPs (and related mod els).

Brief background

A Determinantal Point Process is a random subset $\X$ of a "ground set" $\Omega$. Here we think of $\Omega$ as a set of items or points, and $\X$ is a random subset selected in a way that preserves some of the "diversity" in $\Omega$.

The traditional definition of a DPP you will see in most of the literature is based on inclusion probabilities. We say $\X$ is a DPP if for all fixed subsets $S \subseteq \Omega$ there exists a matrix $\bK$ (called the marginal kernel) such that:

\[p(S \subseteq \X) = \det \bK_{S} \]

where $\bK_{S}$ is the sub-matrix of $\bK$ indexed by $S$. For instance, if $S=\{i\}$, a single item, then this says that

\[p(i \in \X) = K_{i,i} \]

so the inclusion probabilities for single items can be read out of the diagonal of $\bK$. If $S =\{ i,j\}$, then we have

\[p(S \subseteq \X) = K_{ii} K_{jj} - K_{ij}^2 \]

Because $K_{ij}^2$ is non-negative, this means that the probability of getting both $i$ and $j$ is less than the product of the invididual probabilities. This tells us that a DPP is in general repulsive (compared to a Poisson process).

Here is an example of defining a very simple DPP in DPP.jl, over a set of size 2, using the matrix

\[\bK = \begin{pmatrix} 3/4 & 1/4 \\ 1/4 & 3/4 \end{pmatrix}\]

using DPP
#A 2x2 matrix, so here Ω={1,2}
K = [3 1; 1 3]/4
dpp = MarginalDPP(K)
sample(dpp) #𝒳, a random subset
#estimate prob. that item 1 is in 𝒳
sum([1 ∈ sample(dpp) for _ in 1:1000])/1000
#should be approx. equal to K[1,1]

0.74

The name "MarginalDPP" refers to the fact that the DPP is defined based on its marginal kernel. There is one important constraint when defining DPPs based on the marginal kernel $\bK$; the eigenvalues of $\bK$ need to be in $[0,1]$. Because of this, and for reasons of interpretability, it is often more convenient to work with L-ensembles or extended L-ensembles.

L-ensembles are a (large) subset of DPPs with the following property. We say $\X$ is a L-ensemble if there exists $\bL$ such that:

\[p(\X=X) \propto \det \bL_{X}\]

This equation relates the likelihood (probability mass function) of the DPP to the determinant of a submatrix. Subsets s.t. $\bL_{X}$ is large have a high probability of being selected. It is quite natural to define a L-ensemble based on kernel matrix that measures similarity (i.e. where $L_{ij}$ measures the similarity of points $i$ and $j$). Submatrices of $\bL$ with points that are unlike one another (diverse) will be closer to the identity matrix and therefore have a higher determinant. The next example shows a more realistic use of DPP.jl

using LinearAlgebra
x = randn(2,500) #some points in dim 2

#compute a kernel matrix for the points in x
L = [ exp(-norm(a-b)^2) for a in eachcol(x), b in eachcol(x) ]
dpp = EllEnsemble(L) #form an L-ensemble based on the 𝐋 matrix
rescale!(dpp,50) #scale so that the expected size is 50
ind = sample(dpp) #a sample from the DPP (indices)

using Plots
scatter(x[1,:],x[2,:],marker_z = map((v) -> v ∈ ind, 1:size(x,2)),legend=:none,alpha=.75) #hide

The line rescale!(dpp,50) sets the expected size of the L-ensemble to 50. One could also decide to use a fixed-size DPP, and call instead sample(dpp,50), which always returns a subset of size 50. For more on DPPs, L-ensembles, etc. see Nicolas Tremblay, Simon Barthelm{\'e}, Konstantin Usevich, Pierre-Olivier Amblard (2021).

Using other kernels

EllEnsemble requires a SPD matrix as input. For more exotic kernels than the Gaussian, you can either do things by hand or use KernelFunctions.jl, which DPP.jl supports.

using KernelFunctions
x = randn(2,100)
L = EllEnsemble(ColVecs(x),ExponentialKernel())

L-Ensemble.
Number of items in ground set : 100. Max. rank : 100. Rescaling constant α=1.0

Here we need to specify whether the $2\times 100$ matrix 'x' should be considered to represent 100 points in dimension 2, or 2 points in dimension 100. ColVecs specifies the former, RowVecs the latter. This mechanism is borrowed from KernelFunctions and used in other places as well.

See the documentation of KernelFunctions.jl for a list of available kernels.

Using low-rank matrices for speed

L-ensembles defined using a full-rank matrix are expensive in large $n$, because they require an eigendecomposition of the $\bL$ matrix (at cost $\O(n^3)$). For practical applications in large $n$ it is preferable to use a low-rank ensemble, i.e. one such that $\bL = \bM \bM^t$ with $\bM$ a $n$ times $m$ matrix with $m \ll n$.

DPP.jl provides a type called "LowRank" that represents a symmetric low-rank matrix efficiently:

Z = randn(5,2)
K = Z*Z' #a rank 2 matrix of size 5x5
K_lr = LowRank(Z)
all(K .≈ K_lr)

true

It provides a specialised implementation of eigen that only returns the non-null eigenvalues. Here is an example:

eigen(K_lr).values

2-element Vector{Float64}:
 2.1106966534426714
 6.892307131610616

Because K has rank 2, there are only two eigenvalues.

The constructor for EllEnsemble accepts matrices of LowRank type as arguments:

EllEnsemble(K_lr)

L-Ensemble.
Number of items in ground set : 5. Max. rank : 2. Rescaling constant α=1.0

The advantage of using LowRank is that the cost of forming the L-ensemble drops from $\O(n^3)$ to $\O(nm^2)$, where $m$ is the rank. Note that the maximum size of the DPP cannot exceed $m$ in this case.

The rff functions computes a low-rank approximation to a Gaussian kernel matrix using Random Fourier Features, Ali Rahimi, Benjamin Recht (2007)

x = randn(2,1000) #some points in dim 2
L_rff = rff(ColVecs(x),150,.5) #second argument determines rank, third is standard deviation of Gaussian kernel
L_exact = gaussker(ColVecs(x),.5)
lr=EllEnsemble(L_rff)
lex = EllEnsemble(L_exact)

The plot shows the approximation of the spectrum of the kernel matrix by the low-rank approximation.

Using low-rank representations DPP.jl can scale up to millions of points. Keep in mind that DPPs have good scaling in the size of $\Omega$ (n) but poor scaling in the rank ($m$, number of columns of $\bM$). The overall cost scales as $\O(nm^2)$, so $m$ should be kept in the hundreds at most.

As an alternative to Random Fourier Features, we also provide an implementation of the Nyström approximation Christopher Williams, Matthias Seeger (2001). The function again returns a matrix $\bM$ such that $\bL \approx \bM \bM^t$, but the approximation is formed using a subset of points.

x = randn(3,1000)
M_n =  nystrom_approx(ColVecs(x),SqExponentialKernel(),50) #use 50 points
L = EllEnsemble(ColVecs(x),SqExponentialKernel())
L_n = EllEnsemble(M_n)

plot!(sort(L_n.λ,rev=:true),legend=:false) # hide

Not all subsets give good Nyström approximations. You can indicate a specific subset to use:

nystrom_approx(ColVecs(x),SqExponentialKernel(),1:50); #use first 50 points

Instead of using low-rank approximations to kernel matrices, you can also design your own features

x = randn(3,1000)
#some non-linear features we've just made up
feat = (xi,xj,a) -> @. exp(-a*xi*xj)*cos(xi+xj)
ftrs = [ feat(vec(x[i,:]),vec(x[j,:]),a) for i in 1:3, j in 1:3, a in [0,1,2] if i >= j ]
M = reduce(hcat,ftrs)
ll = EllEnsemble(LowRank(M))
rescale!(ll,14)
sample(ll)

A sensible set of features to use are multivariate polynomial features, here used to set up a ProjectionEnsemble (a special case of a low-rank DPP that has fixed sample size, equal to rank of $\bM$)

x = randn(2,1000)
Lp = polyfeatures(ColVecs(x),10) |> ProjectionEnsemble
ind = sample(Lp)
Plots.scatter(x[1,:],x[2,:],color=:gray,alpha=.5) # hide
Plots.scatter!(x[1,ind],x[2,ind],color=:red,alpha=1) # hide

Inclusion probabilities

An attractive aspect of DPPs is that inclusion probabilities are easy to compute. An inclusion probability is the probability that a certain item (or items) is included in the random set produced by a DPP.

using StatsBase,Statistics
#sample 1,000 times and compute empirical inclusion frequencies
reps = [StatsBase.counts(sample(Lp),1:Lp.n) for _ in 1:1000];
#compare to theoretical values
scatter(inclusion_prob(Lp),mean(reps))

So far these are just first-order inclusion probabilities. More generally, you can obtain higher-order probabilities (ie prob that items i,j,k,... are in the set jointly) from the marginal kernel of the DPP, given by "marginal_kernel"

In the next example we compute the empirical inclusion probability of a set of items:

using Statistics
x = randn(2,10)
L = DPP.gaussker(ColVecs(x),.5) |> EllEnsemble
rescale!(L,4)
set = [3,5]

incl = [ length(intersect(set,sample(L)))==length(set) for _ in 1:10000];
#empirical inclusion prob.
emp = mean(incl)

0.1899

The theoretical value is given by

th = det(marginal_kernel(L)[set,set])

0.19209489048181533

Other sampling methods

DPP.jl offers other sampling methods that are based on inter-point distances.

In these algorithms, the initial point is selected uniformly. At all subsequent steps, the next point is selected based on its distance to the current set $\X(t)$, meaning $d_t(x,\X(t)) = \min \{ d(x,x_i) | x_i \in \X(t) \}$. The sampling probability depends on the method. In farthest-point sampling, which is deterministic, at each step, the point selected is one that is farthest from all currently selected points. In D²-sampling Sergei Vassilvitskii, David Arthur (2006), which is a relaxed stochastic version of farthest-point sampling, points are selected with prob. proportional to squared distance to the current set.

x = rand(2,1000);
ind = distance_sampling(ColVecs(x),40,:farthest)
scatter(x[1,ind],x[2,ind],title="Farthest-point sample")

ind = distance_sampling(ColVecs(x),40,:d2)
scatter(x[1,ind],x[2,ind],title="D² sample")

You can obtain other methods by changing how the prob. of selection depends on the distance. For instance, selecting points uniformly as long as they are more than distance $r$ away from the other points gives a so-called "hard-sphere" sample.

ind = distance_sampling(ColVecs(x),40,(v)-> v > .1) #may get fewer than 40 points
scatter(x[1,ind],x[2,ind],title="Hard-sphere sample")

Distance-based sampling is quite general, all it needs is a (pseudo-)distance matrix.

D = [sum(abs.(a-b)) for a in eachcol(x), b in eachcol(x)] #L1 distance
ind = distance_sampling(D,40,(v)-> v > .1) #may get fewer than 40 points
scatter(x[1,ind],x[2,ind],title="Hard-sphere sample in L1 dist")

For datasets that are large, it may not be wise (or even possible) to pre-compute and hold a full distance matrix in memory. You can use the LazyDist type, which behaves like a standard matrix, but whose entries are computed on-the-fly and not stored:

D = LazyDist(ColVecs(x),(a,b) -> sum(abs.(a-b)))
D[3,1] #not pre-computed!
ind = distance_sampling(D,40,(v)-> v >.5);

5-element Vector{Int64}:
 494
 610
 622
 769
 784

Extended L-ensembles

Extended L-ensembles are representations of DPPs that extend L-ensembles, introduced in Nicolas Tremblay, Simon Barthelm{\'e}, Konstantin Usevich, Pierre-Olivier Amblard (2021). They are defined by a pair of matrices $\bL$ and $\bV$, such that

\[p(\X=X) \propto \det \begin{pmatrix} \bL_{X} & \bV_{X,:} \\ \bV_{X,:}^t & \mathbf{0} \end{pmatrix}\]

\[\bV\]

should be thought of as defining "mandatory" features of the DPP, while $\bL$ can be interpreted more or less as a regular kernel, see Nicolas Tremblay, Simon Barthelm{\'e}, Konstantin Usevich, Pierre-Olivier Amblard (2021).

DPP.jl provides some basic support for defining extended L-ensembles. The following is the "default" DPP described in Nicolas Tremblay, Simon Barthelm{\'e}, Konstantin Usevich, Pierre-Olivier Amblard (2021), at order 3.

x = randn(2,1000)
L = [norm(a-b).^3 for a in eachcol(x), b in eachcol(x)]
V = polyfeatures(ColVecs(x),1)
ele = ExtEnsemble(L,V)
rescale!(ele,50)
ind = sample(ele)
Plots.scatter(x[1,:],x[2,:],color=:gray,alpha=.5) # hide
Plots.scatter!(x[1,ind],x[2,ind],color=:red,alpha=1,legend=:none) # hide

Functions and types

DPP.cardinal — Method

cardinal(L::AbstractLEnsemble)

The size of the set sampled by a DPP is a random variable. This function returns its mean and standard deviation. See also: rescale!, which changes the mean set size.

DPP.distance_sampling — Function

distance_sampling(x,m,sampling)

Select a random subset of size m from x based on a greedy distance criterion. The initial point is selected uniformly. Then, if sampling == :farthest, at each step, the point selected is one that is farthest from all currently selected points. If sampling == :d2, the algorithm is D²-sampling [vassilvitskii2006k](@vassilvitskii2006k), which is a relaxed stochastic version of farthest-point sampling (selecting points with prob. proportional to squared distance).

@example x = rand(2,200); ind = distance_sampling(x,40,:farthest) scatter(x[1,:],x[2,:],marker_z = map((v) -> v ∈ ind, 1:size(x,2)),legend=:none)

DPP.esp — Function

esp(L::AbstractLEnsemble,approx=false)

Compute the elementary symmetric polynomials of the L-ensemble L, e₁(L) ... eₙ(L). e₁(L) is the trace and eₙ(L) is the determinant. The ESPs determine the distribution of the sample size of a DPP:

$p(|X| = k) = \frac{e_k}{\sum_{i=1}^n e_i}$

The default algorithm uses the Newton equations, but may be unstable numerically for n large. If approx=true, a stable saddle-point approximation (as in Barthelmé et al. (2019)) is used instead for all eₖ with k>5.

DPP.gaussker — Method

gaussker(X,σ)

Compute the Gaussian kernel matrix for points X and parameter σ, ie. a matrix with entry i,j equal to $\exp(-\frac{(x_i-x_j)^2}{2σ^2})$

If σ is missing, it is set using the median heuristic. If the number of points is very large, the median is estimated on a random subset.

x = randn(2,6)
gaussker(ColVecs(x),.1)

See also: rff, KernelMatrix:kernelmatrix

DPP.greedy_subset — Method

greedy_subset(L :: AbstractLEnsemble,k)

Perform greedy subset selection: find a subset $\mathcal{X}$ of size $k$, such that $\det{L}_{\mathcal{X}}$ is high, by building the set item-by-item and picking at each step the item that maximises the determinant. You can view this as finding an (approximate) mode of the DPP with L-ensemble L.

If $k$ is too large relative to the (numerical) rank of L, the problem is not well-defined as so the algorithm will stop prematurely.

Example

X = randn(2,100) #10 points in dim 2
L = LowRankEnsemble(polyfeatures(X,2)) 
greedy_subset(L,6)

DPP.inclusion_prob — Method

inclusion_prob(L::AbstractLEnsemble,k)

First-order inclusion probabilities in a k-DPP with L-ensemble L. Uses a (typically very accurate) saddlepoint approximation from Barthelmé, Amblard, Tremblay (2019).

DPP.inclusion_prob — Method

inclusion_prob(L::AbstractLEnsemble)

Compute first-order inclusion probabilities, i.e. the probability that each item in 1..n is included in the DPP.

References

rahimi2007random: Ali Rahimi, Benjamin Recht, Random Features for Large-Scale Kernel Machines., In NIPS, 5, 2007.
vassilvitskii2006k: Sergei Vassilvitskii, David Arthur, k-means++: The advantages of careful seeding, In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027–1035, 2006.
williams2001using: Christopher Williams, Matthias Seeger, Using the Nystr{\"o}m method to speed up kernel machines, In Proceedings of the 14th annual conference on neural information processing systems, 682–688, 2001.
tremblay2021extended: Nicolas Tremblay, Simon Barthelm{\'e}, Konstantin Usevich, Pierre-Olivier Amblard, Extended L-ensembles: a new representation for Determinantal Point Processes, arXiv preprint arXiv:2107.06345, 2021.