Scalable Nyström-Based Kernel Two-Sample Testing with Permutations
Kernel two-sample tests, such as the Maximum Mean Discrepancy (MMD), are powerful tools for detecting distributional differences between samples in high dimensions. Yet, a naive implementation requires computing and storing large Gram matrices, which becomes prohibitive as data sizes grow. The Nyström method offers a principled way to build a compact, low-rank surrogate of the kernel matrix, enabling scalable testing without sacrificing too much statistical power.
Foundations: why kernel two-sample testing matters
At its core, the kernel two-sample test assesses whether two samples X and Y come from the same underlying distribution. The MMD statistic measures differences in mean embeddings in a reproducing kernel Hilbert space (RKHS). When the null hypothesis is true, a permutation procedure provides a user-friendly way to approximate the null distribution of the statistic and compute p-values. This approach is model-free, makes few distributional assumptions, and adapts to complex, nonlinear differences that may be invisible to classical tests.
The challenge isn’t just accuracy; it’s doing it at scale. Nyström approximations let us keep the power of kernel tests while handling millions of observations.
Nyström approximation: how it speeds things up
The Nyström method constructs a low-rank surrogate of the full kernel matrix by selecting a subset of m landmark points from the data. Let X be the full sample with n points and S be the selected landmarks (|S| = m). Compute:
- C = K_{X,S}, the cross-kernel between all points and the landmarks (an n × m matrix),
- W = K_{S,S}, the kernel among the landmarks (an m × m matrix).
Then the kernel matrix K ≈ Ḱ = C W^{-1} C^T serves as a low-rank approximation. In practice, this surrogate enables an approximate feature map in a finite-dimensional space, derived from the eigen-decomposition of W. With this, the infinite-dimensional RKHS operations reduce to O(n m) computations rather than O(n^2), which is a substantial saving for large n. The result is a scalable path to estimate MMD without forming the full Gram matrix.
Permutation testing with Nyström
Permutation testing remains a natural way to obtain p-values under the null hypothesis of equal distributions. With the Nyström surrogate in hand, you:
- Compute the approximate MMD statistic on the original labeling using the Nyström-derived features,
- Pool the combined samples, randomly shuffle the labels, split them back into two groups, and recompute the Nyström-based statistic for each permutation,
- Repeat B times to build the empirical null distribution,
- Calculate the p-value as the proportion of permuted statistics that exceed the observed statistic.
The accuracy of the permutation test hinges on the stability of the Nyström approximation. In practice, a modest number of landmarks often yields robust results, and the permutation procedure helps absorb any residual approximation bias by anchoring the null distribution directly to the data.
Practical guidelines for effective deployment
- Choose the landmark count m thoughtfully. A larger m improves the fidelity of the kernel surrogate but increases cost. Start with a small fraction of n (for example, m in the hundreds or low thousands) and monitor the change in the test statistic as m grows.
- Bandwidth selection matters. For Gaussian (RBF) kernels, the median heuristic is a common starting point, but consider cross-validation or domain-informed choices when the data have varying scales.
- Balance efficiency and accuracy. If the two samples differ greatly in size, consider stratified permutation or weighting schemes to avoid inflated type-I error from imbalanced labels.
- Seed reproducibility. Fix random seeds for landmark selection and permutations to ensure consistent results across runs.
- Diagnostics. Compare the Nyström-based statistic against a smaller, exact computation on a subsample to gauge approximation quality.
Extensions and caveats
The Nyström-based approach is compatible with a variety of kernels beyond the standard Gaussian family, including specialized kernels for structured data. When the eigenvalue spectrum of the kernel matrix decays slowly, you may need a larger m to preserve power. Be mindful of potential biases introduced by subsampling landmarks; the permutation framework helps mitigate this, but explicit bias analysis can provide further reassurance in critical applications.
For extremely large-scale problems, consider a two-stage strategy: (1) use Nyström with a moderate m to obtain an initial test, and (2) zoom in on regions or subsets of data where differences appear strongest, applying the same methodology to focused comparisons. This preserves scalability while maintaining interpretability of findings.
Putting it all together
Scalable Nyström-based kernel two-sample testing with permutations marries the power of kernel methods with a practical pathway to handle datasets that were previously out of reach. By trading a controlled amount of approximation for substantial gains in speed, you retain robust detection of distributional differences across high-dimensional spaces while keeping computation within practical bounds. When your goal is to compare large populations, monitor shifts with confidence, and do so efficiently, this approach offers a compelling balance of rigor and scalability.