Rank-based criteria measuring the high-dimensional neighborhood preservation in the low-dimensional embedding from Lee et al. (2009, 2010). These criteria are used in the experiments reported in de Bodt et al. (2020).
A SingleCellExperiment
object.
character()
with the reducedDims
low dimension
embeddings to be assessed. Default is to use all available
with reducedDimNames(object)
.
numeric(1)
defining the maximum number of nearest
neighbours to compute the quality metrics Rx
for. Default is
ceiling(ncol(object)/2)
, i.e. neighborhood sizes of 1 up to
half the number of cells. Setting Kup
to NA will compute the
Rx
metric for all values (i.e. 1 to N-2). This will however
be at a considerable cost in computation time.
An optional BiocParallelParam
instance from the
package BiocParallel
, determining the parallel back-end to
be used during evaluation. Default is
BiocParallel::bpparam()
.
A data.frame
, as produced by drQuality()
.
Specification of colours to be used for plotting. By
default, numeric values from to ncol(x)
are used, which is
however limited to 9 different colours. If more a needed,
specify a broader patette of colours with, for example,
colorspace::diverging_hcl(ncol(x))
.
Additional arguments passed to graphics::matplot()
.
A data.frame
containing the Rx
values for the low
dimensional embeddings defined by dimred
(along the
columns). The number of rows will be be Kup
(is set) or
ncol(object)
if Kup
was NA
(see details). The areas
under the respective curves (AUC) are stored as an attribute,
named "AUC"
.
The drQuality()
function computes the dimensionality reduction
quality assessment criteria \(R_{NX}(K)\) (Rx
) and area under
the curve (AUC
), as defined in Lee et al. (2009, 2010, 2103) and
Lee and Verleysen (2009, 2010) and as used in the experiments
reported in de Bodt et al. (2020).
These criteria measure the neighborhood preservation around the data points from the high-dimensional space to the low-dimensional space.
The plotDrQuality()
function can be used to visualise the
results of the quality assessment produced by drQuality()
.
Based on the high-dimensional and low-dimensional Euclidean distances, the sets \(v_i^K\) (resp. \(n_i^K\)) of the K nearest neighbours of data point i in the high-dimensional space (resp. low-dimensional space) can first be computed.
Their average normalized agreement develops as
$$Q_{NX}(K) = \frac{1}{N} \times \sum_{i=1}^{N} \frac{|v_i^K \cap n_i^K|}{K}$$
where N refers to the number of data points and \(\cap\) to the set intersection operator. \(Q_{NX}(K)\) ranges between 0 and 1; the closer to 1, the better.
As the expectation of \(Q_{NX}(K)\) with random low-dimensional coordinates is equal to \(\frac{K}{N-1}\), which is increasing with K
$$R_{NX}(K) = \frac{(N-1) \times Q_{NX}(K)-K}{N-1-K}$$
enables to more easily compare different neighbourhood sizes K. \(R_{NX}(K)\) ranges between -1 and 1, but a negative value indicates that the embedding performs worse than random. Therefore, \(R_{NX}(K)\) typically lies between 0 and 1. The \(R_{NX}(K)\) values of K ranging from 1 to N-2 can be displayed as a curve with a log scale for K, as closer neighbours typically prevail.
The area under the resulting curve (AUC) is a scalar score that grows with dimensionality reduction quality, quantified at all scales with an emphasis on small ones. The AUC lies between -1 and 1, but a negative value implies performances which are worse than random.
Given a dataset with N cells, the function has \(O(N^2 \times log(N))\)
time complexity. The Kup
parameter sets the maximum neighborhood
size when computing the quality criteria, that is computed only
for the neighborhood sizes of K equal to 1 up to Kup, as opposed
to all possible neighborhood sizes (for Kup
set to NA
). This
implementation reduces the time complexity to \(O(N \times Kup
\times log(N))\). Setting Kup
can hence be
run using much larger dataset, provided that Kup
is small
compared to the total number of cells N.
Note however that when using Kup
, in particlar a small one, one
does not quantify the quality of the low-dimension embedding in
terms of global structure preservation (larger neighborhood
sizes), and instead focuses on local aspects (smaller
neighborhoods), which favors local DR methods over global ones and
misses, parly at least) the advantage of multi-scale
approaches. Experiment in de Bodt et al. (2020) indicate that
multi-scale methods are superior to single-scale ones, on both
moderate-sized databases (N < 10e4; DR quality is quantified using
K = 1, 2, ... N-2) and large-scale data sets (N >> 10e; DR quality is
quantified using K = 1, 2, ... Kup).
Lee, J. A., & Verleysen, M. (2009). Quality assessment of dimensionality reduction: Rank-based criteria. Neurocomputing, 72(7-9), 1431-1443.
Lee, J. A., & Verleysen, M. (2010). Scale-independent quality criteria for dimensionality reduction. Pattern Recognition Letters, 31(14), 2248-2257.
C. de Bodt, D. Mulders, M. Verleysen and J. A. Lee, "Fast Multiscale Neighbor Embedding," in IEEE Transactions on Neural Networks and Learning Systems, 2020, doi: 10.1109/TNNLS.2020.3042807.
Lee, J. A., & Verleysen, M. (2009). Quality assessment of dimensionality reduction: Rank-based criteria. Neurocomputing, 72(7-9), 1431-1443.
Lee, J. A., & Verleysen, M. (2010). Scale-independent quality criteria for dimensionality reduction. Pattern Recognition Letters, 31(14), 2248-2257.
Lee, J. A., Renard, E., Bernard, G., Dupont, P., & Verleysen, M. (2013). Type 1 and 2 mixtures of Kullback–Leibler divergences as cost functions in dimensionality reduction based on similarity preservation. Neurocomputing, 112, 92-108.
runFMSSNE()
, runFMSSNE()
, runMSTSNE()
and runMSSNE()
for
the functions that perform the multi-scale stochastic neighbour
embeddings.