Title: Kernel hypothesis tests for large-scale data: streaming implementation and optimal kernel choice
Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. A linear-time empirical estimate of the MMD is proposed, allowing us to compute the statistic in a streaming manner, without storing the data in memory. The test threshold is likewise computable in linear time, resulting in a hypothesis test suited to large-scale problems and to data streams.
The kernel used in computing the MMD is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed, which maximises test power (probability of true positives).