DPP.jl: a Julia package for sampling Determinantal Point Processes
DPP.jl provides some types and functions for sampling from DPPs (and related mod els).
Brief background
A Determinantal Point Process is a random subset $\X$ of a "ground set" $\Omega$. Here we think of $\Omega$ as a set of items or points, and $\X$ is a random subset selected in a way that preserves some of the "diversity" in $\Omega$.
The traditional definition of a DPP you will see in most of the literature is based on inclusion probabilities. We say $\X$ is a DPP if for all fixed subsets $S \subseteq \Omega$ there exists a matrix $\bK$ (called the marginal kernel) such that:
\[p(S \subseteq \X) = \det \bK_{S} \]
where $\bK_{S}$ is the submatrix of $\bK$ indexed by $S$. For instance, if $S=\{i\}$, a single item, then this says that
\[p(i \in \X) = K_{i,i} \]
so the inclusion probabilities for single items can be read out of the diagonal of $\bK$. If $S =\{ i,j\}$, then we have
\[p(S \subseteq \X) = K_{ii} K_{jj}  K_{ij}^2 \]
Because $K_{ij}^2$ is nonnegative, this means that the probability of getting both $i$ and $j$ is less than the product of the invididual probabilities. This tells us that a DPP is in general repulsive (compared to a Poisson process).
Here is an example of defining a very simple DPP in DPP.jl, over a set of size 2, using the matrix
\[\bK = \begin{pmatrix} 3/4 & 1/4 \\ 1/4 & 3/4 \end{pmatrix}\]
using DPP
#A 2x2 matrix, so here Ω={1,2}
K = [3 1; 1 3]/4
dpp = MarginalDPP(K)
sample(dpp) #𝒳, a random subset
#estimate prob. that item 1 is in 𝒳
sum([1 ∈ sample(dpp) for _ in 1:1000])/1000
#should be approx. equal to K[1,1]
0.74
The name "MarginalDPP" refers to the fact that the DPP is defined based on its marginal kernel. There is one important constraint when defining DPPs based on the marginal kernel $\bK$; the eigenvalues of $\bK$ need to be in $[0,1]$. Because of this, and for reasons of interpretability, it is often more convenient to work with Lensembles or extended Lensembles.
Lensembles are a (large) subset of DPPs with the following property. We say $\X$ is a Lensemble if there exists $\bL$ such that:
\[p(\X=X) \propto \det \bL_{X}\]
This equation relates the likelihood (probability mass function) of the DPP to the determinant of a submatrix. Subsets s.t. $\bL_{X}$ is large have a high probability of being selected. It is quite natural to define a Lensemble based on kernel matrix that measures similarity (i.e. where $L_{ij}$ measures the similarity of points $i$ and $j$). Submatrices of $\bL$ with points that are unlike one another (diverse) will be closer to the identity matrix and therefore have a higher determinant. The next example shows a more realistic use of DPP.jl
using LinearAlgebra
x = randn(2,500) #some points in dim 2
#compute a kernel matrix for the points in x
L = [ exp(norm(ab)^2) for a in eachcol(x), b in eachcol(x) ]
dpp = EllEnsemble(L) #form an Lensemble based on the 𝐋 matrix
rescale!(dpp,50) #scale so that the expected size is 50
ind = sample(dpp) #a sample from the DPP (indices)
using Plots
scatter(x[1,:],x[2,:],marker_z = map((v) > v ∈ ind, 1:size(x,2)),legend=:none,alpha=.75) #hide
The line rescale!(dpp,50)
sets the expected size of the Lensemble to 50. One could also decide to use a fixedsize DPP, and call instead sample(dpp,50)
, which always returns a subset of size 50. For more on DPPs, Lensembles, etc. see Nicolas Tremblay, Simon Barthelm{\'e}, Konstantin Usevich, PierreOlivier Amblard (2021).
Using other kernels
EllEnsemble requires a SPD matrix as input. For more exotic kernels than the Gaussian, you can either do things by hand or use KernelFunctions.jl, which DPP.jl supports.
using KernelFunctions
x = randn(2,100)
L = EllEnsemble(ColVecs(x),ExponentialKernel())
LEnsemble.
Number of items in ground set : 100. Max. rank : 100. Rescaling constant α=1.0
Here we need to specify whether the $2\times 100$ matrix 'x' should be considered to represent 100 points in dimension 2, or 2 points in dimension 100. ColVecs specifies the former, RowVecs the latter. This mechanism is borrowed from KernelFunctions and used in other places as well.
See the documentation of KernelFunctions.jl for a list of available kernels.
Using lowrank matrices for speed
Lensembles defined using a fullrank matrix are expensive in large $n$, because they require an eigendecomposition of the $\bL$ matrix (at cost $\O(n^3)$). For practical applications in large $n$ it is preferable to use a lowrank ensemble, i.e. one such that $\bL = \bM \bM^t$ with $\bM$ a $n$ times $m$ matrix with $m \ll n$.
DPP.jl provides a type called "LowRank" that represents a symmetric lowrank matrix efficiently:
Z = randn(5,2)
K = Z*Z' #a rank 2 matrix of size 5x5
K_lr = LowRank(Z)
all(K .≈ K_lr)
true
It provides a specialised implementation of eigen that only returns the nonnull eigenvalues. Here is an example:
eigen(K_lr).values
2element Vector{Float64}:
2.1106966534426714
6.892307131610616
Because K has rank 2, there are only two eigenvalues.
The constructor for EllEnsemble accepts matrices of LowRank type as arguments:
EllEnsemble(K_lr)
LEnsemble.
Number of items in ground set : 5. Max. rank : 2. Rescaling constant α=1.0
The advantage of using LowRank is that the cost of forming the Lensemble drops from $\O(n^3)$ to $\O(nm^2)$, where $m$ is the rank. Note that the maximum size of the DPP cannot exceed $m$ in this case.
The rff functions computes a lowrank approximation to a Gaussian kernel matrix using Random Fourier Features, Ali Rahimi, Benjamin Recht (2007)
x = randn(2,1000) #some points in dim 2
L_rff = rff(ColVecs(x),150,.5) #second argument determines rank, third is standard deviation of Gaussian kernel
L_exact = gaussker(ColVecs(x),.5)
lr=EllEnsemble(L_rff)
lex = EllEnsemble(L_exact)
The plot shows the approximation of the spectrum of the kernel matrix by the lowrank approximation.
Using lowrank representations DPP.jl can scale up to millions of points. Keep in mind that DPPs have good scaling in the size of $\Omega$ (n) but poor scaling in the rank ($m$, number of columns of $\bM$). The overall cost scales as $\O(nm^2)$, so $m$ should be kept in the hundreds at most.
As an alternative to Random Fourier Features, we also provide an implementation of the Nyström approximation Christopher Williams, Matthias Seeger (2001). The function again returns a matrix $\bM$ such that $\bL \approx \bM \bM^t$, but the approximation is formed using a subset of points.
x = randn(3,1000)
M_n = nystrom_approx(ColVecs(x),SqExponentialKernel(),50) #use 50 points
L = EllEnsemble(ColVecs(x),SqExponentialKernel())
L_n = EllEnsemble(M_n)
plot!(sort(L_n.λ,rev=:true),legend=:false) # hide
Not all subsets give good Nyström approximations. You can indicate a specific subset to use:
nystrom_approx(ColVecs(x),SqExponentialKernel(),1:50); #use first 50 points
Instead of using lowrank approximations to kernel matrices, you can also design your own features
x = randn(3,1000)
#some nonlinear features we've just made up
feat = (xi,xj,a) > @. exp(a*xi*xj)*cos(xi+xj)
ftrs = [ feat(vec(x[i,:]),vec(x[j,:]),a) for i in 1:3, j in 1:3, a in [0,1,2] if i >= j ]
M = reduce(hcat,ftrs)
ll = EllEnsemble(LowRank(M))
rescale!(ll,14)
sample(ll)
A sensible set of features to use are multivariate polynomial features, here used to set up a ProjectionEnsemble (a special case of a lowrank DPP that has fixed sample size, equal to rank of $\bM$)
x = randn(2,1000)
Lp = polyfeatures(ColVecs(x),10) > ProjectionEnsemble
ind = sample(Lp)
Plots.scatter(x[1,:],x[2,:],color=:gray,alpha=.5) # hide
Plots.scatter!(x[1,ind],x[2,ind],color=:red,alpha=1) # hide
Inclusion probabilities
An attractive aspect of DPPs is that inclusion probabilities are easy to compute. An inclusion probability is the probability that a certain item (or items) is included in the random set produced by a DPP.
using StatsBase,Statistics
#sample 1,000 times and compute empirical inclusion frequencies
reps = [StatsBase.counts(sample(Lp),1:Lp.n) for _ in 1:1000];
#compare to theoretical values
scatter(inclusion_prob(Lp),mean(reps))
So far these are just firstorder inclusion probabilities. More generally, you can obtain higherorder probabilities (ie prob that items i,j,k,... are in the set jointly) from the marginal kernel of the DPP, given by "marginal_kernel"
In the next example we compute the empirical inclusion probability of a set of items:
using Statistics
x = randn(2,10)
L = DPP.gaussker(ColVecs(x),.5) > EllEnsemble
rescale!(L,4)
set = [3,5]
incl = [ length(intersect(set,sample(L)))==length(set) for _ in 1:10000];
#empirical inclusion prob.
emp = mean(incl)
0.1899
The theoretical value is given by
th = det(marginal_kernel(L)[set,set])
0.19209489048181533
Other sampling methods
DPP.jl offers other sampling methods that are based on interpoint distances.
In these algorithms, the initial point is selected uniformly. At all subsequent steps, the next point is selected based on its distance to the current set $\X(t)$, meaning $d_t(x,\X(t)) = \min \{ d(x,x_i)  x_i \in \X(t) \}$. The sampling probability depends on the method. In farthestpoint sampling, which is deterministic, at each step, the point selected is one that is farthest from all currently selected points. In D²sampling Sergei Vassilvitskii, David Arthur (2006), which is a relaxed stochastic version of farthestpoint sampling, points are selected with prob. proportional to squared distance to the current set.
x = rand(2,1000);
ind = distance_sampling(ColVecs(x),40,:farthest)
scatter(x[1,ind],x[2,ind],title="Farthestpoint sample")
ind = distance_sampling(ColVecs(x),40,:d2)
scatter(x[1,ind],x[2,ind],title="D² sample")
You can obtain other methods by changing how the prob. of selection depends on the distance. For instance, selecting points uniformly as long as they are more than distance $r$ away from the other points gives a socalled "hardsphere" sample.
ind = distance_sampling(ColVecs(x),40,(v)> v > .1) #may get fewer than 40 points
scatter(x[1,ind],x[2,ind],title="Hardsphere sample")
Distancebased sampling is quite general, all it needs is a (pseudo)distance matrix.
D = [sum(abs.(ab)) for a in eachcol(x), b in eachcol(x)] #L1 distance
ind = distance_sampling(D,40,(v)> v > .1) #may get fewer than 40 points
scatter(x[1,ind],x[2,ind],title="Hardsphere sample in L1 dist")
For datasets that are large, it may not be wise (or even possible) to precompute and hold a full distance matrix in memory. You can use the LazyDist type, which behaves like a standard matrix, but whose entries are computed onthefly and not stored:
D = LazyDist(ColVecs(x),(a,b) > sum(abs.(ab)))
D[3,1] #not precomputed!
ind = distance_sampling(D,40,(v)> v >.5);
5element Vector{Int64}:
494
610
622
769
784
Extended Lensembles
Extended Lensembles are representations of DPPs that extend Lensembles, introduced in Nicolas Tremblay, Simon Barthelm{\'e}, Konstantin Usevich, PierreOlivier Amblard (2021). They are defined by a pair of matrices $\bL$ and $\bV$, such that
\[p(\X=X) \propto \det \begin{pmatrix} \bL_{X} & \bV_{X,:} \\ \bV_{X,:}^t & \mathbf{0} \end{pmatrix}\]
\[\bV\]
should be thought of as defining "mandatory" features of the DPP, while $\bL$ can be interpreted more or less as a regular kernel, see Nicolas Tremblay, Simon Barthelm{\'e}, Konstantin Usevich, PierreOlivier Amblard (2021).
DPP.jl provides some basic support for defining extended Lensembles. The following is the "default" DPP described in Nicolas Tremblay, Simon Barthelm{\'e}, Konstantin Usevich, PierreOlivier Amblard (2021), at order 3.
x = randn(2,1000)
L = [norm(ab).^3 for a in eachcol(x), b in eachcol(x)]
V = polyfeatures(ColVecs(x),1)
ele = ExtEnsemble(L,V)
rescale!(ele,50)
ind = sample(ele)
Plots.scatter(x[1,:],x[2,:],color=:gray,alpha=.5) # hide
Plots.scatter!(x[1,ind],x[2,ind],color=:red,alpha=1,legend=:none) # hide
Functions and types
DPP.cardinal
— Methodcardinal(L::AbstractLEnsemble)
The size of the set sampled by a DPP is a random variable. This function returns its mean and standard deviation. See also: rescale!, which changes the mean set size.
DPP.distance_sampling
— Functiondistance_sampling(x,m,sampling)
Select a random subset of size m from x based on a greedy distance criterion. The initial point is selected uniformly. Then, if sampling == :farthest, at each step, the point selected is one that is farthest from all currently selected points. If sampling == :d2, the algorithm is D²sampling [vassilvitskii2006k](@vassilvitskii2006k), which is a relaxed stochastic version of farthestpoint sampling (selecting points with prob. proportional to squared distance).
@example x = rand(2,200); ind = distance_sampling(x,40,:farthest) scatter(x[1,:],x[2,:],marker_z = map((v) > v ∈ ind, 1:size(x,2)),legend=:none)
DPP.esp
— Functionesp(L::AbstractLEnsemble,approx=false)
Compute the elementary symmetric polynomials of the Lensemble L, e₁(L) ... eₙ(L). e₁(L) is the trace and eₙ(L) is the determinant. The ESPs determine the distribution of the sample size of a DPP:
$p(X = k) = \frac{e_k}{\sum_{i=1}^n e_i}$
The default algorithm uses the Newton equations, but may be unstable numerically for n large. If approx=true, a stable saddlepoint approximation (as in Barthelmé et al. (2019)) is used instead for all eₖ with k>5.
DPP.gaussker
— Methodgaussker(X,σ)
Compute the Gaussian kernel matrix for points X and parameter σ, ie. a matrix with entry i,j equal to $\exp(\frac{(x_ix_j)^2}{2σ^2})$
If σ is missing, it is set using the median heuristic. If the number of points is very large, the median is estimated on a random subset.
x = randn(2,6)
gaussker(ColVecs(x),.1)
See also: rff, KernelMatrix:kernelmatrix
DPP.greedy_subset
— Methodgreedy_subset(L :: AbstractLEnsemble,k)
Perform greedy subset selection: find a subset $\mathcal{X}$ of size $k$, such that $\det{L}_{\mathcal{X}}$ is high, by building the set itembyitem and picking at each step the item that maximises the determinant. You can view this as finding an (approximate) mode of the DPP with Lensemble L.
If $k$ is too large relative to the (numerical) rank of L, the problem is not welldefined as so the algorithm will stop prematurely.
Example
X = randn(2,100) #10 points in dim 2
L = LowRankEnsemble(polyfeatures(X,2))
greedy_subset(L,6)
DPP.inclusion_prob
— Methodinclusion_prob(L::AbstractLEnsemble,k)
Firstorder inclusion probabilities in a kDPP with Lensemble L. Uses a (typically very accurate) saddlepoint approximation from Barthelmé, Amblard, Tremblay (2019).
DPP.inclusion_prob
— Methodinclusion_prob(L::AbstractLEnsemble)
Compute firstorder inclusion probabilities, i.e. the probability that each item in 1..n is included in the DPP.
See also: marginal_kernel
DPP.kl_divergence
— Methodkl_divergence(L1::AbstractLEnsemble,L2::AbstractLEnsemble,k::Int;nsamples=100)
Estimate the KL divergence between two (k)DPPs. The KL divergence is estimated by sampling from the kDPP with Lensemble L1 and computing the mean logratio of the probabilities. nsamples controls how many samples are taken.
DPP.marginal_kernel
— Method marginal_kernel(L::AbstractLEnsemble)
Compute and return the marginal kernel of a DPP, K. The marginal kernel of a DPP is a (n x n) matrix which can be used to find the inclusion probabilities. For any fixed set of indices ind, the probability that ind is included in a sample from the DPP equals det(K[ind,ind]).
DPP.nystrom_approx
— Methodnystrom_approx(x :: AbstractVector,ker :: Kernel,ind)
Compute a lowrank approximation of a kernel matrix (with kernel "ker") using the rows and columns indexed by "ind".
using KernelFunctions
x = rand(2,100)
K = kernelmatrix(SqExponentialKernel(),ColVecs(x))
#build a rank 30 approx. to K
V = nystrom_approx(ColVecs(x),SqExponentialKernel(),1:30)
norm(KV*V') #should be small
DPP.polyfeatures
— Methodpolyfeatures(x,order)
Compute monomial features up to a certain degree. For instance, if the input consists in vectors in ℝ², then the monomials of degree ≤ 2 are 1,x₁,x₂,x₁x₂,x₁²,x₂² Note that the number of monomials of degree r in dimension d equals ${ d+r \choose r}$
If the input points are in d > 1, indicate whether the input x is d * n or n * d using ColVecs or RowVecs.
The output has points along rows and monomials along columns.
Examples
x = randn(2,10) #10 points in dim 2
polyfeatures(ColVecs(x),2) #Output has 10 rows and 6 column
polyfeatures(RowVecs(x),2) #Output has 2 rows and 66 columns
DPP.rescale!
— Functionrescale!(L,k)
$\DeclareMathOperator{\Tr}{Tr}$
Rescale the Lensemble such that the expected number of samples equals k. The expected number of samples of a DPP equals $\Tr \mathbf{L}\left( \mathbf{L} + \mathbf{I} \right)$. The function rescales $\mathbf{L}$ to $\alpha \mathbf{L}$ such that $\Tr \alpha \mathbf{L}\left( \alpha \mathbf{L} + \mathbf{I} \right) = k$
DPP.rff
— Methodrff(X,m,σ)
Compute Random Fourier Features Ali Rahimi, Benjamin Recht (2007) for the Gaussian kernel matrix with input points X and parameter σ. Returns a random matrix M such that, in expectation $\mathbf{MM}^t = \mathbf{K}$, the Gaussian kernel matrix. M has 2*m columns. The higher m, the better the approximation.
Examples
x = randn(2,10) #10 points in dim 2
rff(ColVecs(x),4,1.0)
See also: gaussker, kernelmatrix
DPP.sample
— Methodsample(L::AbstractLEnsemble,k)
Sample a kDPP, i.e. a DPP with fixed size. k needs to be strictly smaller than the rank of L (if it equals the rank of L, use a ProjectionEnsemble).
The algorithm uses a saddlepoint approximation adapted from Barthelmé, Amblard, Tremblay (2019).
DPP.sample
— Method sample(L::AbstractEnsemble)
Sample from a DPP with Lensemble L. The return type is a BitSet (indicating which indices are sampled), use collect to get a vector of indices instead.
DPP.EllEnsemble
— TypeEllEnsemble{T}
This type represents an Lensemble, a (broad) special case of Determinantal Point Process.
The type parameter corresponds to the type of the entries in the matrix given as input (most likely, double precision floats).
DPP.EllEnsemble
— MethodEllEnsemble(X::Matrix{T},k :: Kernel)
Construct a fullrank ensemble from a set of points and a kernel function.
X is vector of column vectors (ColVecs) or a vector of row vectors (RowVecs) k is a kernel (see doc for package KernelFunctions.jl)
Example: points in 2d along the circle, and an exponential kernel
t = LinRange(pi,pi,10)'
X = vcat(cos.(t),sin.(t))
using KernelFunctions
L=EllEnsemble(ColVecs(X),ExponentialKernel())
DPP.EllEnsemble
— MethodEllEnsemble(L :: AbstractMatrix{T})
Construct an Lensemble from a (symmetric, nonnegative definite) matrix L.
Z = randn(5,2)
EllEnsemble(Z*Z') #not very useful, presumably
Note that eigen(L) will be called at construction, which may be computationally costly.
An LEnsemble can also be constructed based on lazy matrix types, i.e. types that leverage a lowrank representation. In the example above we could also use the LowRank type:
Z = randn(5,2)
EllEnsemble(LowRank(Z)) #more efficient than Z*Z'
DPP.LazyDist
— MethodLazyDist(x)
Lazy representation of a distance matrix, meaning that the object can be indexed like a matrix, but the values are computed onthefly. This is useful whenever an algorithm only requires some of the entries, or when the dataset is very large.
x = randn(2,100)
D= LazyDist(x) #Euclidean dist. by default
D[3,2] #the getindex method is defined
D = LazyDist(x,(a,b) > sum(abs.(ab))) #L1 distance
D[3,2]
DPP.MarginalDPP
— MethodMarginalDPP(V::Matrix{T})
Construct a DPP from a matrix defining the marginal kernel. Here the matrix must be square and its eigenvalues must be between 0 and 1.
DPP.ProjectionEnsemble
— MethodProjectionEnsemble(V::Matrix{T},orth=true)
Construct a projection ensemble from a matrix of features. Here we assume $\mathbf{L} = \mathbf{V}\mathbf{V}^t$, so that V must be n \times r, where n is the number of items and r is the rank. V needs not be orthogonal. If orth is set to true (default), then a QR decomposition is performed. If V is orthogonal already, then this computation may be skipped, and you can set orth to false.
References
 rahimi2007random

Ali Rahimi, Benjamin Recht, Random Features for LargeScale Kernel Machines., In NIPS, 5, 2007.
 vassilvitskii2006k

Sergei Vassilvitskii, David Arthur, kmeans++: The advantages of careful seeding, In Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms, 1027–1035, 2006.
 williams2001using

Christopher Williams, Matthias Seeger, Using the Nystr{\"o}m method to speed up kernel machines, In Proceedings of the 14th annual conference on neural information processing systems, 682–688, 2001.
 tremblay2021extended

Nicolas Tremblay, Simon Barthelm{\'e}, Konstantin Usevich, PierreOlivier Amblard, Extended Lensembles: a new representation for Determinantal Point Processes, arXiv preprint arXiv:2107.06345, 2021.