Clustering

The session Clustering will be held on thursday, 2019-09-19, from 14:00 to 16:00, at room 0.001. The session chair is Christel Vrain.

Talks

15:00 - 15:20
A Framework for Parallelizing Hierarchical Clustering Methods (149)
Silvio Lattanzi (Google Zürich), Thomas Lavastida (Carnegie Mellon University), Kefu Lu (Washington), Benjamin Moseley (Lee University)

Hierarchical clustering is a fundamental tool in data mining, machine learning and statistics. Popular hierarchical clustering algorithms include top-down divisive approaches such as bisecting k-means, k-median, and k-center and bottom-up agglomerative approaches such as single-linkage, average-linkage, and centroid-linkage. Unfortunately, only a few scalable hierarchical clustering algorithms are known, mostly based on the single-linkage algorithm. So, as datasets increase in size every day, there is a pressing need to scale other popular methods.We introduce efficient distributed algorithms for bisecting k-means, k-median, and k-center as well as centroid-linkage. In particular, we first formalize a notion of closeness for a hierarchical clustering algorithm, and then we use this notion to design new scalable distributed methods with strong worst case bounds on the running time and the quality of the solutions. Finally, we show experimentally that the introduced algorithms are efficient and close to their sequential variants in practice.

15:20 - 15:40
Heavy-tailed kernels reveal a finer cluster structure in t-SNE visualisations (327)
Dmitry Kobak (University of Tübingen), George Linderman (Yale University), Stefan Steinerberger (Yale University), Yuval Kluger (Yale University; Yale School of Medicine), Philipp Berens (University of Tübingen)

T-distributed stochastic neighbour embedding (t-SNE) is a widely used data visualisation technique. It differs from its predecessor SNE by the low-dimensional similarity kernel: the Gaussian kernel was replaced by the heavy-tailed Cauchy kernel, solving the `crowding problem' of SNE. Here, we develop an efficient implementation of t-SNE for a t-distribution kernel with an arbitrary degree of freedom ν, with ν→∞ corresponding to SNE and ν=1 corresponding to the standard t-SNE. Using theoretical analysis and toy examples, we show that ν<1 can further reduce the crowding problem and reveal finer cluster structure that is invisible in standard t-SNE. We further demonstrate the striking effect of heavier-tailed kernels on large real-life data sets such as MNIST, single-cell RNA-sequencing data, and the HathiTrust library. We use domain knowledge to confirm that the revealed clusters are meaningful. Overall, we argue that modifying the tail heaviness of the t-SNE kernel can yield additional insight into the cluster structure of the data.

Reproducible Research
14:00 - 14:20
Holistic Assessment of Structure Discovery Capabilities of Clustering Algorithms (678)
Frank Höppner (Ostfalia University of Applied Sciences), Maximilian Jahnke (Ostfalia University of Applied Sciences)

Existing cluster validity indices often possess a similar bias as theclustering algorithm they were introduced for, e.g. to determine theoptimal number of clusters. We suggest an efficient and holisticassessment of the structure discovery capabilities of clusteringalgorithms based on three criteria. We determine the robustness orstability of cluster assignments and interpret it as the confidence ofthe clustering algorithm in its result. This information is then usedto label the data and evaluate the consistency of thestability-assessment with the notion of a cluster as an area of denseand separated data. The resulting criteria of stability, structure andconsistency provide interpretable means to judge the capabilities ofclustering algorithms without the typical biases of prominent indices,including the judgment of a clustering tendency.

Reproducible Research
14:20 - 14:40
k is the Magic Number - Inferring the Number of Clusters Through Nonparametric Concentration Inequalities (811)
Sibylle Hess (Technische Universiteit Eindhoven), Wouter Duivesteijn (Technische Universiteit Eindhoven)

Most convex and nonconvex clustering algorithms come with one crucial parameter: the k in k-means. To this day, there is not one generally accepted way to accurately determine this parameter. Popular methods are simple yet theoretically unfounded, such as searching for an elbow in the curve of a given cost measure. In contrast, statistically founded methods often make strict assumptions over the data distribution or come with their own optimization scheme for the clustering objective. This limits either the set of applicable datasets or clustering algorithms. In this paper, we strive to determine the number of clusters by answering a simple question: given two clusters, is it likely that they jointly stem from a single distribution? To this end, we propose a bound on the probability that two clusters originate from the distribution of the unified cluster, specified only by the sample mean and variance. Our method is applicable as a simple wrapper to the result of any clustering method minimizing the objective of k-means, which includes Gaussian mixtures and Spectral Clustering. We focus in our experimental evaluation on an application for nonconvex clustering and demonstrate the suitability of our theoretical results. Our SpecialK clustering algorithm automatically determines the appropriate value for k, without requiring any data transformation or projection, and without assumptions on the data distribution. Additionally, it is capable to decide that the data consists of only a single cluster, which many existing algorithms cannot.

Reproducible Research
14:40 - 15:00
Uncovering Hidden Block Structure for Clustering (337)
Luce le Gorrec (Université de Toulouse), Sandrine Mouysset (Université de Toulouse), Philip A. Knight (STFC Rutherford Appleton Laboratory; CERFACS, Toulouse), Iain S. Duff (University of Strathclyde), Daniel Ruiz (Université de Toulouse)

We present a multistage procedure to cluster directed and undirected weighted graphs by finding the block structure of their adjacency matrices. A central part of the process is to scale the adjacency matrix into a doubly-stochastic form, which permits detection of the whole matrix block structure with minimal spectral information (theoretically a single pair of singular vectors suffices).We present the different stages of our method, namely the impact of the doubly-stochastic scaling on singular vectors, detection of the block structure by means of these vectors, and details such as cluster refinement and a stopping criterion. Then we test thealgorithm's effectiveness by using it on two unsupervised classification tasks: community detection in networks and shape detection in cloudsof points in two dimensions. By comparing results of our approach with thoseof widely used algorithms designed for specific purposes, we observe that our method is competitive (for community detection) if not superior (for shape detection) in comparison with existing methods.

Reproducible Research
15:40 - 16:00
Noise-free Latent Block Model for High Dimensional Data (J25)
Charlotte Laclau, Vincent Brault


Parallel Sessions