[HN Gopher] The CAP theorem of Clustering: Why Every Algorithm M...
       ___________________________________________________________________
        
       The CAP theorem of Clustering: Why Every Algorithm Must Sacrifice
       Something
        
       Author : fagnerbrack
       Score  : 197 points
       Date   : 2024-12-26 23:04 UTC (23 hours ago)
        
 (HTM) web link (blog.codingconfessions.com)
 (TXT) w3m dump (blog.codingconfessions.com)
        
       | joe_the_user wrote:
       | It's interesting idea but so I don't see it as a matter of
       | tradeoffs since just the "richness" sounds undecidable by itself.
       | I mean, dividing a set into "things have low Kolmogorov
       | complexity and things having high Kolmogorov complexity" is
       | definitely undecidable so "any grouping that might make sense for
       | your data" seems impossible without any other requirements.
        
         | ironSkillet wrote:
         | The "richness" definition also seemed hand wavey to me so I
         | looked at the referenced paper. The actual definition of
         | "richness" of an algorithm is that for any arbitrary partition
         | P of your original data (singletons, one cluster, etc), there
         | is a distance function on the data, which when used in the
         | clustering algorithm, produces P.
        
       | throw_pm23 wrote:
       | A cute and thought-provoking theorem for sure, but arguably none
       | of the three given criteria for clustering are well motivated, so
       | the result is much less relevant than usually claimed.
       | 
       | - scale-invariance: stretching data along some dimensions should
       | not change clustering.
       | 
       | This is clearly not true: . . . (three well-spaced spots) may be
       | reasonably seen as three clusters, whereas ||| (three nearby
       | elongated bars) not.
       | 
       | - richness: all groupings must be reachable.
       | 
       | Also not quite true, both of the two cases: (1) all clusters are
       | singleton points and (2) a single cluster that contains all
       | points, mean the same: no useful cluster structure found. So it
       | is enough if one of these groupings are reachable, and not both.
       | 
       | - consistency: increasing inter-cluster differences and
       | decreasing intra-cluster differences should not change
       | clustering.
       | 
       | Also not quite true: suppose we have 9 clusters:
       | . . .       . . .       . . .
       | 
       | now move the points so that the columns get further apart, at
       | some point we will get: | | |, where 3 clusters are more
       | reasonable.
        
         | yshklarov wrote:
         | Actually, scale-invariance only refers to scaling _all
         | dimensions by the same scalar_ (this is more clearly specificed
         | in the paper linked by the article, page 3). For arbitrary
         | scaling on _each_ coordinate, of course you 're correct, it's
         | impossible to have a clustering algorithm that is invariant for
         | such transformations (e.g., the 6-point group ::: may look like
         | either 2 or 3 clusters, depending on whether it's stretched
         | horizontally or vertically).
         | 
         | As for your last two points, I believe I agree! It seems that
         | in the counterexample you give for consistency, some notion of
         | scale-invariance is implicitly assumed -- perhaps this
         | connection plays some role in the theorem's proof (which I
         | haven't read).
         | 
         | This reminds me a bit of Arrow's impossibility theorem for
         | voting, which similarly has questionable premises.
        
           | hnaccount_rng wrote:
           | But in almost all cases that doesn't make any sense?
           | Typically the data in different dimensions will have
           | different "units". So there isn't any meaning in the scale in
           | the first place. How could scaling by a single scalar be
           | "more natural"?
        
             | bubblyworld wrote:
             | This is the point I think - there's no inherent meaning to
             | the scaling factor(s) as far as overall structure is
             | concerned (they're dimensionless, so the units thing isn't
             | a problem), so the outcome of a clustering algorithm should
             | not depend on it.
        
               | amelius wrote:
               | What if you first did PCA on your data?
        
               | bubblyworld wrote:
               | You tell me? I'm not sure I understand the question.
        
               | amelius wrote:
               | Basically scale and rotate your data such that the
               | variation in all dimensions becomes equal, so to speak.
        
               | bubblyworld wrote:
               | Ah I see. As I understand it a general linear map like
               | that isn't what the linked paper means by "scale-
               | invariance", so it wouldn't be considered a violation for
               | a dataset and it's PCA to be given different clusters by
               | your clustering algorithm. It's only the dataset and its
               | scaled up or down counterparts (i.e. the metric is
               | multiplied by a fixed non-zero constant) that are
               | required to get the same clusters for scale-invariance to
               | hold.
               | 
               | In fact the paper doesn't assume that your dataset is
               | contained in a vector space at all. All you have to give
               | a clustering algorithm (as they define it) is a set and a
               | metric function on it.
               | 
               | (the paper if you don't have a link:
               | https://www.cs.cornell.edu/home/kleinber/nips15.pdf)
        
             | yshklarov wrote:
             | If different components of the dataset have different
             | units, I would argue that it is a prerequisite of
             | clustering to first specify the relative importance of each
             | particular unit (thereby putting all units on the same
             | scale). Otherwise, there's no way the clustering algorithm
             | could possibly know what to in certain cases (such as the
             | ::: example).
             | 
             | It's true that there is no _intrinsic_ meaning to the
             | scale, but you must specify at least a relative scale --
             | how you want to compare (or weigh) different units --
             | before you can meaningfully cluster the data. Clustering
             | can only work on dimensionless data.
        
         | nextaccountic wrote:
         | Scale invariance is well motivated in the sense that perhaps
         | you already normalized your data (usually to fit [-1, 1] or
         | something) and you would be bummed to discover it fucked up
         | your clusters
         | 
         | Or likewise: if you have physical data recorded in some units
         | (say, meters), it would suck if the clusters changed if you had
         | measured stuff in another unit instead
        
       | jpcom wrote:
       | Sounds like Richness is an index in SQL terms. Would you agree?
        
       | keithalewis wrote:
       | Let S = {1, 2}. Every distance function is determined by d(1,2) =
       | a, a >= 0. Define f(d) = {{1,2}} if a = 0 and f(d) = {{1},{2}}
       | otherwise. Isn't this a clustering algorithm that is scale
       | invariant, rich, and consistent?
        
         | bubblyworld wrote:
         | Looks that way to me, yeah, though this is obviously a super
         | simple case. It's clearly scale invariant and there are only
         | two partitions, which your algorithm hits, so it's rich.
         | Completeness is trivially satisfied in both cases too.
        
           | n0bra1n wrote:
           | i think i found the issue: the paper says distance function
           | is 0 IFF elements are equal. so for this example, you can not
           | define d(1,2) as equal to 0. so it is not rich, as this is
           | the only way to get the partition {{1,2}}.
        
             | bubblyworld wrote:
             | Oh, I see, it's not a true metric. That's fair enough,
             | though I wonder if the result depends critically on that
             | assumption.
             | 
             | (you can pass to equivalence classes to recover a true
             | metric, and I didn't see anything obviously incompatible
             | with that in the paper, but I admit I didn't look very
             | deeply)
        
       | piker wrote:
       | Interesting the author neglects to mention the Gap statistic[1]
       | for the stopping criteria. It's a bit of a mouthful, but it does
       | provide a metric for determining how many clusters the data
       | indicates are present as compared to a random distribution in the
       | same dimensions.
       | 
       | [1] https://academic.oup.com/jrsssb/article-
       | abstract/63/2/411/70...
        
       | Xcelerate wrote:
       | I've never found a use case for clustering where if the goal is
       | accurate prediction, some other technique doesn't work better.
       | 
       | Now if the goal is a quick prototype or to get an intuitive sense
       | of the structure of the data, then sure, it's fine.
       | 
       | But of course you're always sacrificing something desirable when
       | you try to shoehorn data into a model that doesn't fit.
        
         | zelphirkalt wrote:
         | I mean, clustering is for actually finding the structure of the
         | data ... It is a collection of structure discovering methods.
         | If one doesn't know the probable structure of course one starts
         | with little knowledge.
        
       | monkeyjoe wrote:
       | I think there is hidden gold in the linked research paper. In the
       | second to last paragraph, it says if you are willing to discard
       | the trivial partition (each point on its own) from Richness, then
       | you *can* have Scale-Invariance and Refinement-Consistency as
       | well. To me this suggests an optimal (in some sense) class of
       | algorithms, perhaps to be chosen from based on computational
       | complexity, cross-validation, etc.
        
       ___________________________________________________________________
       (page generated 2024-12-27 23:01 UTC)