[HN Gopher] Introduction to K-Means Clustering
___________________________________________________________________
Introduction to K-Means Clustering
Author : gk1
Score : 168 points
Date : 2022-03-14 15:12 UTC (7 hours ago)
(HTM) web link (www.pinecone.io)
(TXT) w3m dump (www.pinecone.io)
| jdeaton wrote:
| After having implemented k-means ~5 times in undergraduate
| classes, it was interesting to learn that k-means is just a
| special case of expectation maximization.
| petters wrote:
| ...which in turn is just a special case of block coordinate
| descent. One set of variables is fixed, while another set is
| optimized and vice versa.
| amelius wrote:
| How do you benchmark a clustering algorithm?
| mistrial9 wrote:
| times are changing online -- I definitely recall a two volume,
| hard bound, >$100USD edition of "Clustering Methods" in the
| university book store and did not buy it. I was being mocked by
| my non-native English colleague with a Masters in Computer
| Science from great (non-English) university, about the ability to
| even read that work. Now, I look at the Scikit-learn docs mainly.
| Good work pinecone dot IO.
| monkeybutton wrote:
| I think its now turned in to somewhat of a competition. If you
| want your algorithm/package adopted, you have to make it open
| source and provide good documentation like this:
| https://lifelines.readthedocs.io/en/latest/Survival%20Analys...
| https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.h...
| danuker wrote:
| Props to Lifelines. It arranged my thoughts on how to handle
| survival data, by requiring the data in a clear format.
| oofbey wrote:
| K-means has a lot going for it, but it's not very good at
| recovering "natural" clusters. Like if you have a dataset that
| you know actually is grouped in a certain way, k-means very often
| will not discover those clusters. Other more complex clustering
| algorithms will do a more reliable job. If you have to implement
| clustering yourself for some reason, k-means' simplicity is
| great. But how often do you need to write your own algorithm
| these days?
| [deleted]
| notafraudster wrote:
| One thing that's nice about k-means is that it can quite
| literally be grasped intuitively by, say, a 10 year old.
| k-nearest neighbors is similarly graspable. I think these are
| wonderful techniques to introduce kids to data science without
| having to get into the math. Obviously you're going to have to
| abstract away high-dimensionality and a lot of the underlying
| math of "how" -- but that's ok!
|
| I'm not sure I've found either to be massively useful on real-
| world data, but that's OK too!
| globular-toast wrote:
| One thing not mentioned with k-means is it requires your data to
| be in Euclidean space. You can't just come up with a distance
| function and do clustering with it. That's one reason the other
| techniques often make sense. In fact, those techniques often
| don't care what the data actually is, they only need the
| distances, and even then they don't need all of them.
|
| Still, k-means is enormously practical despite its shortcomings.
| n4r9 wrote:
| An alternative I've used is k-medoids:
|
| https://en.m.wikipedia.org/wiki/K-medoids
|
| This is especially useful in cases where the distance between
| points is given by a matrix, and cannot be derived from the
| points themselves. For example, the drive time between
| locations on a road network.
| whiplash451 wrote:
| Not sure why this was down voted.
|
| K-means in its standard form does make a Euclidean hypothesis
| (it does not minimize a distance but the in-cluster variance).
|
| It is not correct to use arbitrary distances because k-means
| may stop converging with other distance functions.
|
| See this thread for more info:
| https://stats.stackexchange.com/questions/81481/why-does-k-m...
| isoprophlex wrote:
| That's a really pedantic take, nothing precludes you from
| using whatever distance function you want to get the distance
| to k centroids and act accordingly
| amilios wrote:
| Losing convergence guarantees is somewhat important!
| isoprophlex wrote:
| I totally agree with your point, but sometimes you can
| get away by being handwavey about it and making sure the
| centroids aren't skipping all over the place... and
| that's good enough
| isoprophlex wrote:
| But you can. Get the distance to all your centroids, assign to
| closest centroid, update centroid location.
|
| The distance function can be anything.
| globular-toast wrote:
| What's a centroid? It's the point that minimises the sum of
| squared Euclidean distances between itself and every point in
| the cluster.
| isoprophlex wrote:
| It's a tuple of floats that you move slightly towards the
| closest point it has just seen, optionally with some rate
| of forgetting.
|
| You shuffle your big ass wad of data, and loop over each
| element doing the above, optionally decreasing the
| forgetting rate.
|
| You track the location of your centroids and if your data /
| loss fn isn't too pathological, they won't misbehave too
| badly and you have handwavey assurance of convergence...
| Definitions be damned.
| civilized wrote:
| So, you made up your own algorithm that is kinda like
| k-means, this algorithm might sometimes work, and this
| shows that k-means works in non-Euclidean spaces?
| CardenB wrote:
| What prevents you from using a different distance than
| Euclidean distance?
| danuker wrote:
| Suppose you use prices and square meters, to cluster
| apartments. Suppose you also use a simple metric, like:
| distance <- sqrt(d_price*2 + d_area*2)
|
| Without normalizing the units somehow, so that they have a
| meaningfully similar numeric expression, you are essentially
| only clustering by price, which is in the tens- or hundreds-
| of-thousands.
|
| You use very little information about the area, because that
| is in the hundreds, at most thousands.
| nerdponx wrote:
| This has nothing to do with using non-Euclidean distance,
| though. This is a separate issue related to having features
| on different scales.
| jacquesm wrote:
| That is not a proper distance function.
| danuker wrote:
| What do you mean? Is it not the Euclidean distance?
|
| https://en.wikipedia.org/wiki/Euclidean_distance
| [deleted]
| amilios wrote:
| The way k-means is constructed is not based on distances.
|
| K-means minimizes within-cluster variance. If you look at the
| definition of variance, it is identical to the sum of squared
| Euclidean distances from the centroid
| nerdponx wrote:
| Do you lose convergence guarantees if you minimize "sum of
| squared distances from the centroid" using your other-than-
| Euclidean distance metric?
| fmakunbound wrote:
| It's been ages since I've used K-means, but I thought you had a
| lot of freedom in how you define your distance function.
| tus666 wrote:
| Why not just do a proper course in data science?
| nerdponx wrote:
| Why read articles about computer science when you could take a
| proper course in computer science?
| tus666 wrote:
| Because it is far more generic and likely to be actually
| useful.
|
| Could you imagine someone doing K-means outside of a specific
| data science application?
| magicalhippo wrote:
| > Could you imagine someone doing K-means outside of a
| specific data science application?
|
| I implemented k-means clustering to reduce noise in a path
| tracer. I've never done a data science course or anything
| like that.
| lichtenberger wrote:
| That was one of the first algorithms I learned at the university,
| thanks for posting :-)
| Bostonian wrote:
| Mixtures of multivariate normals can be regarded as a method of
| fuzzy clustering, where each point has a probability of coming
| from each of the component Gaussians. The EM (expectation
| maximization) algorithm can be used to fit mixture models.
| sideproject wrote:
| Simplicity wins in many cases and it is with K-Means clustering.
| There are so many clustering algorithms, far superior to K-Means,
| but its simplicity and easy-to-apply approach to most of the
| datasets available allow pretty much anyone with programming
| skills to understand and implement quickly to get results and
| gives you that feeling of "oh, hey, that's my first data science
| thingy". Simplicity is very powerful.
| jacquesm wrote:
| K-Means Clustering is one of those tools that once you understand
| it becomes more and more useful over time. You keep finding new
| ways to apply it, it's a bit like Quicksort, Bresenham or Divide
| and Conquer in that sense.
| csunbird wrote:
| It is also super intuitive. The logic makes sense immediately
| when you see it first time.
| Helmut10001 wrote:
| Working in spatial data science, I rarely find applications
| where k-means is the best tool. The problem is that it is
| difficult to know how many clusters you can expect on maps. Is
| it 5; 500; or 10,000? Here HDBSCAN [1] shines because it will
| cluster _and_ select the most suitable number of clusters, to
| cut the single linkage cluster tree.
|
| [1]: https://github.com/scikit-learn-contrib/hdbscan
| falkenb0t wrote:
| I found myself having to do some clustering a few months back
| and learned about mean-shift clustering [1]. I found it to be
| a nice clustering algorithm for when you don't know quantity
| of clusters ahead of time.
|
| That said, its worth noting that there are algorithms out
| there for determining optimal number of clusters for k-means
| (though personally I found them to be costly and subject to
| overfitting) like using the silhouette coefficient or elbow
| method [2].
|
| [1]: https://www.geeksforgeeks.org/ml-mean-shift-clustering/
| [2]: https://towardsdatascience.com/k-means-clustering-how-
| it-wor...
| CuriouslyC wrote:
| A bigger problem with kmeans than figuring out the number
| of clusters is the fact that the model assumes multivariate
| gaussian spheres. That's a bad model for most non-toy data.
| Royi wrote:
| Not only limited to Gaussian Spheres but also being
| isotropic (Unless data is pre processed or distance is
| defined by Mahalanobis Distance).
| jacquesm wrote:
| Can you explain this please?
| nerdponx wrote:
| Note also that specifically for one-dimensional data, there
| is a globally optimal solution to the k-means clustering
| problem. There is an R package that implements it using a
| C++ core implementation [1], and also a Python wrapper [2].
|
| This implementation is also surprisingly fast, so you can
| use it to brute-force check many different numbers of
| clusters and check using silhouette distance. The advantage
| over traditional k-means is that you don't need to check
| multiple initializations for any given number of clusters,
| because the algorithm is deterministic and guaranteed to
| find the global optimum.
|
| [1]: https://cran.r-project.org/package=Ckmeans.1d.dp
|
| [2]: https://github.com/djdt/ckwrap
| Royi wrote:
| Is there a paper or a post which describes the solution?
| nerdponx wrote:
| Yes, the R package was written by the authors of the
| paper, and the CRAN info page references the paper.
| civilized wrote:
| The elbow method exists in many fields. It essentially
| boils down to "hope that an obviously good choice exists"
| jacquesm wrote:
| Yes that is a good point, if you don't have a good idea on
| how many subdivisions you want then K-Means won't help you
| and hdbscan is better.
|
| I always liken the latter to 'automatic feature detection'.
| NicoJuicy wrote:
| I'd be interested to hear some of those examples actually, i
| know what it is. But I haven't used it in practise ( yet).
| jacquesm wrote:
| Just one example from recent practice, to automatically
| collect a number of texts (pdfs) into a set of groups for
| easier subsequent processing (summarization). Without K-Means
| you'd have a pretty hard time figuring out for each
| subsequent document which group you want to add it to or
| whether or not you need to make a new group. With K-Means
| given a fairly simple 'distance' function all of the grouping
| is automatic. In fact, once you have K-Means coded up the
| distance function _is_ the problem to solve. If you can 't
| meaningfully define a distance then you probably won't be
| able to use it.
| NicoJuicy wrote:
| Do you have an example of the groups ( = output) and what
| you are using as variables for input ?
|
| Eg. Wouldn't KNN be easier if you want to classify new
| groups later on? I'm probably missing something, but
| defining the groups upfront seems more labor intensive (
| note: novice in this area )
| dilyevsky wrote:
| Not tp but I've tried using k-means for similar task and
| and LDA and found LDA to be better. Particularly if your
| documents have mixture of topics
| flashfaffe2 wrote:
| Depending on the task but using K- means implies you have
| have idea of the clusters you might have ( unsupervised
| learning). KNN assumes the groups already exist.
| jacquesm wrote:
| Well, you don't so much define the groups up front as
| that you define the _number_ of groups up front, the
| distance function is what will eventually result in the
| mapping of inputs to groups if you use the groups (for
| instance: sample texts in those groups) as a part of the
| scoring system.
| nerdponx wrote:
| This is probably the biggest (only?) advantage of k-means
| in real problems: it's easy to put new data points into
| existing clusters. You can do this with mean-shift
| clustering as well.
|
| It's not obvious how to do this with HDBSCAN or
| hierarchical clustering. Maybe you'd have to draw some kind
| of boundary around each cluster after fixing the
| parameters.
| anonymousDan wrote:
| You mean to update the clusters incrementally as new data
| arrives?
| importantbrian wrote:
| Interesting did you try any other techniques like HDBSCAN?
| I've found HDBSCAN to produce more meaningful clusters for
| a similar workflow.
| time_to_smile wrote:
| My experience has been the opposite, K-Means clustering is
| generally pretty useless in the real world with real data.
|
| In general clustering algorithms tend not to be very useful
| because they are ultimately just very poorly defined latent
| variable models. Most people starting out with clustering don't
| even understand that they're making the assumption that K
| latent variables are responsible for generating the observed
| data.
|
| K-means in particular has loads of problems for practical
| applications: the final location of the means is heavily
| dependent on the initial locations, the final location can also
| be heavily dependent on the particular dataset used (try
| bootstrap resampling to get variance of your means), assumes
| the data rests on a Euclidean surface, making geometric
| assumptions about high dimensional spaces is almost always a
| bad a bad idea (things being close by in Euclidean terms in
| high dimensional spaces often misses things that are nearly
| identical except on feature that is quite different).
|
| With over a decade of modeling experience and building data
| science products I have never seen a product ship that relies
| on k-means and never seen analysis that isn't coming form
| someone junior in the field that relies on it.
|
| K-means is a great demo the expectation maximization algorithm,
| which is a useful tool to understand, but generally I would
| avoid k-means, and clustering in general unless you have an
| explicit understanding of why you are doing it. Even then,
| there is probably a more stable and appropriate latent variable
| model you could use.
| nerdponx wrote:
| > With over a decade of modeling experience and building data
| science products I have never seen a product ship that relies
| on k-means and never seen analysis that isn't coming form
| someone junior in the field that relies on it.
|
| I used it exactly once, and it worked quite well :) But that
| was because I was clustering 1-dimensional data, which has a
| globally-optimal solution and there was a fast implementation
| for it available (see my post
| https://news.ycombinator.com/item?id=30675594). So it was
| okay in that instance to brute-force check a lot of different
| cluster numbers and then post-process the results with some
| heuristics.
| anonymousDan wrote:
| The more I learn about machine learning, the more it seems to me
| that the real key is to understand optimization algorithms (both
| analytical and numerical) and how/when they are likely to
| converge. Am I wrong? Are there good books summarizing the
| relationship between all the different approaches to optimization
| used in ML?
| cinntaile wrote:
| You're basically trying to fit data to some model and to help
| with that you have a cost function that you try to optimize.
| Any decent machine learning curriculum will have an
| optimization course.
___________________________________________________________________
(page generated 2022-03-14 23:00 UTC)