Mathematics Data Science Seminar: Prof. Long Nguyen, Hierarchical clustering and mixture modeling fo
This event is in the past.
2:30 p.m. to 3:30 p.m.
Speaker: Long Nguyen, University of Michigan
Time: Wednesday, Feb 28, 2:30pm-3:30pm
Place: Nelson room
Title: Hierarchical clustering and mixture modeling for heterogeneous data
Abstract:
Agglomerative hierarchical clustering is a well-known method for exploratory
data analysis and visualization but there is very little theoretical support.
Mixture modeling provides strong theoretical guarantees for learning
heterogeneous data populations, but it requires strong model assumptions
and can be brittle if the model is misspecified or only weakly identifiable.
This work provides a bridge to agglomerative hierarchical clustering
by following a mixture model-based approach. Starting with fitting a
finite mixture model on a heterogeneous data set with a finite number of
components larger than needed, a hierarchical clustering tree (also known as
the dendrogram) is constructed in a way analogous to an agglomerative
hierarchical clustering algorithm that sequentially merges clusters.
The specific way in which the merging is developed is derived from an
optimal transport based theory of convergence of the mixing measures,
where competing atoms that provide support for the estimated mixing measures
are merged via a suitable projection under the L2 Wasserstein metric.
With this algorithm we can consistently select the true number of components
and obtain a pointwise optimal convergence rate for parameter estimation
from the hierarchical tree, even when the model parameters are only weakly identifiable.
In theory, it also explicates the choice of the optimal number of clusters
in hierarchical clustering. In practice, the dendrogram reveals more
information on the hierarchy of subpopulations compared to traditional
ways of summarizing mixture models. Illustrations on simulated data and
a single-cell RNA sequence data set will be discussed. This work is joint
with Dat Do, Linh Do, Scott McKinley and Jonathan Terhost.