[HN Gopher] The CAP theorem of Clustering: Why Every Algorithm M...
___________________________________________________________________
The CAP theorem of Clustering: Why Every Algorithm Must Sacrifice
Something
Author : fagnerbrack
Score : 197 points
Date : 2024-12-26 23:04 UTC (23 hours ago)
(HTM) web link (blog.codingconfessions.com)
(TXT) w3m dump (blog.codingconfessions.com)
| joe_the_user wrote:
| It's interesting idea but so I don't see it as a matter of
| tradeoffs since just the "richness" sounds undecidable by itself.
| I mean, dividing a set into "things have low Kolmogorov
| complexity and things having high Kolmogorov complexity" is
| definitely undecidable so "any grouping that might make sense for
| your data" seems impossible without any other requirements.
| ironSkillet wrote:
| The "richness" definition also seemed hand wavey to me so I
| looked at the referenced paper. The actual definition of
| "richness" of an algorithm is that for any arbitrary partition
| P of your original data (singletons, one cluster, etc), there
| is a distance function on the data, which when used in the
| clustering algorithm, produces P.
| throw_pm23 wrote:
| A cute and thought-provoking theorem for sure, but arguably none
| of the three given criteria for clustering are well motivated, so
| the result is much less relevant than usually claimed.
|
| - scale-invariance: stretching data along some dimensions should
| not change clustering.
|
| This is clearly not true: . . . (three well-spaced spots) may be
| reasonably seen as three clusters, whereas ||| (three nearby
| elongated bars) not.
|
| - richness: all groupings must be reachable.
|
| Also not quite true, both of the two cases: (1) all clusters are
| singleton points and (2) a single cluster that contains all
| points, mean the same: no useful cluster structure found. So it
| is enough if one of these groupings are reachable, and not both.
|
| - consistency: increasing inter-cluster differences and
| decreasing intra-cluster differences should not change
| clustering.
|
| Also not quite true: suppose we have 9 clusters:
| . . . . . . . . .
|
| now move the points so that the columns get further apart, at
| some point we will get: | | |, where 3 clusters are more
| reasonable.
| yshklarov wrote:
| Actually, scale-invariance only refers to scaling _all
| dimensions by the same scalar_ (this is more clearly specificed
| in the paper linked by the article, page 3). For arbitrary
| scaling on _each_ coordinate, of course you 're correct, it's
| impossible to have a clustering algorithm that is invariant for
| such transformations (e.g., the 6-point group ::: may look like
| either 2 or 3 clusters, depending on whether it's stretched
| horizontally or vertically).
|
| As for your last two points, I believe I agree! It seems that
| in the counterexample you give for consistency, some notion of
| scale-invariance is implicitly assumed -- perhaps this
| connection plays some role in the theorem's proof (which I
| haven't read).
|
| This reminds me a bit of Arrow's impossibility theorem for
| voting, which similarly has questionable premises.
| hnaccount_rng wrote:
| But in almost all cases that doesn't make any sense?
| Typically the data in different dimensions will have
| different "units". So there isn't any meaning in the scale in
| the first place. How could scaling by a single scalar be
| "more natural"?
| bubblyworld wrote:
| This is the point I think - there's no inherent meaning to
| the scaling factor(s) as far as overall structure is
| concerned (they're dimensionless, so the units thing isn't
| a problem), so the outcome of a clustering algorithm should
| not depend on it.
| amelius wrote:
| What if you first did PCA on your data?
| bubblyworld wrote:
| You tell me? I'm not sure I understand the question.
| amelius wrote:
| Basically scale and rotate your data such that the
| variation in all dimensions becomes equal, so to speak.
| bubblyworld wrote:
| Ah I see. As I understand it a general linear map like
| that isn't what the linked paper means by "scale-
| invariance", so it wouldn't be considered a violation for
| a dataset and it's PCA to be given different clusters by
| your clustering algorithm. It's only the dataset and its
| scaled up or down counterparts (i.e. the metric is
| multiplied by a fixed non-zero constant) that are
| required to get the same clusters for scale-invariance to
| hold.
|
| In fact the paper doesn't assume that your dataset is
| contained in a vector space at all. All you have to give
| a clustering algorithm (as they define it) is a set and a
| metric function on it.
|
| (the paper if you don't have a link:
| https://www.cs.cornell.edu/home/kleinber/nips15.pdf)
| yshklarov wrote:
| If different components of the dataset have different
| units, I would argue that it is a prerequisite of
| clustering to first specify the relative importance of each
| particular unit (thereby putting all units on the same
| scale). Otherwise, there's no way the clustering algorithm
| could possibly know what to in certain cases (such as the
| ::: example).
|
| It's true that there is no _intrinsic_ meaning to the
| scale, but you must specify at least a relative scale --
| how you want to compare (or weigh) different units --
| before you can meaningfully cluster the data. Clustering
| can only work on dimensionless data.
| nextaccountic wrote:
| Scale invariance is well motivated in the sense that perhaps
| you already normalized your data (usually to fit [-1, 1] or
| something) and you would be bummed to discover it fucked up
| your clusters
|
| Or likewise: if you have physical data recorded in some units
| (say, meters), it would suck if the clusters changed if you had
| measured stuff in another unit instead
| jpcom wrote:
| Sounds like Richness is an index in SQL terms. Would you agree?
| keithalewis wrote:
| Let S = {1, 2}. Every distance function is determined by d(1,2) =
| a, a >= 0. Define f(d) = {{1,2}} if a = 0 and f(d) = {{1},{2}}
| otherwise. Isn't this a clustering algorithm that is scale
| invariant, rich, and consistent?
| bubblyworld wrote:
| Looks that way to me, yeah, though this is obviously a super
| simple case. It's clearly scale invariant and there are only
| two partitions, which your algorithm hits, so it's rich.
| Completeness is trivially satisfied in both cases too.
| n0bra1n wrote:
| i think i found the issue: the paper says distance function
| is 0 IFF elements are equal. so for this example, you can not
| define d(1,2) as equal to 0. so it is not rich, as this is
| the only way to get the partition {{1,2}}.
| bubblyworld wrote:
| Oh, I see, it's not a true metric. That's fair enough,
| though I wonder if the result depends critically on that
| assumption.
|
| (you can pass to equivalence classes to recover a true
| metric, and I didn't see anything obviously incompatible
| with that in the paper, but I admit I didn't look very
| deeply)
| piker wrote:
| Interesting the author neglects to mention the Gap statistic[1]
| for the stopping criteria. It's a bit of a mouthful, but it does
| provide a metric for determining how many clusters the data
| indicates are present as compared to a random distribution in the
| same dimensions.
|
| [1] https://academic.oup.com/jrsssb/article-
| abstract/63/2/411/70...
| Xcelerate wrote:
| I've never found a use case for clustering where if the goal is
| accurate prediction, some other technique doesn't work better.
|
| Now if the goal is a quick prototype or to get an intuitive sense
| of the structure of the data, then sure, it's fine.
|
| But of course you're always sacrificing something desirable when
| you try to shoehorn data into a model that doesn't fit.
| zelphirkalt wrote:
| I mean, clustering is for actually finding the structure of the
| data ... It is a collection of structure discovering methods.
| If one doesn't know the probable structure of course one starts
| with little knowledge.
| monkeyjoe wrote:
| I think there is hidden gold in the linked research paper. In the
| second to last paragraph, it says if you are willing to discard
| the trivial partition (each point on its own) from Richness, then
| you *can* have Scale-Invariance and Refinement-Consistency as
| well. To me this suggests an optimal (in some sense) class of
| algorithms, perhaps to be chosen from based on computational
| complexity, cross-validation, etc.
___________________________________________________________________
(page generated 2024-12-27 23:01 UTC)