[HN Gopher] BERTopic: The Future of Topic Modeling
___________________________________________________________________
BERTopic: The Future of Topic Modeling
Author : gk1
Score : 89 points
Date : 2022-05-11 15:19 UTC (7 hours ago)
(HTM) web link (www.pinecone.io)
(TXT) w3m dump (www.pinecone.io)
| tbonza wrote:
| What happens on a slightly different task where domain experts
| have tried to create a set of topics, not all domain experts talk
| to each other, and so we instead need a way to merge existing
| topics? I continue to see benchmarks where human expertise
| significantly outperforms AI on common sense reasoning tasks
| (most recently https://arxiv.org/abs/2112.11446).
|
| What about an approach using directed acyclic graphs and
| entities?
| whakim wrote:
| In traditional qualitative research, you'd usually have a bunch
| of experts get together and figure out a set of topics (or
| import and adapt a set of topics from similar work) _before_
| you go about classifying the bulk of your data.
| visarga wrote:
| Next step: automatically naming clusters using few-shot GPT-3.
| Cluster naming is a non-trivial problem.
| vosper wrote:
| I've run into this problem at a previous employer. Do you know
| if anyone's working on it?
| jll29 wrote:
| "topic labeling" papers: https://scholar.google.com/scholar?h
| l=en&as_sdt=0%2C5&q=%22t...
| visarga wrote:
| There is a recent discussion in r/machinelearning about it
|
| https://www.reddit.com/r/MachineLearning/comments/umgdts/p_c.
| ..
| malshe wrote:
| Absolutely. This is the point where I think we shift from
| astronomy to astrology!
| victorianoi wrote:
| take a look at Graphext ( https://www.graphext.com ) it
| automatically creates the clustering embeddings using BERT for
| you + great visualization libraries to interpret the clusters :D
| it took us 5 years to build the product
| CabSauce wrote:
| It sure has to be much, much better than free. Especially if
| your pricing is 'contact sales'.
| m1sta_ wrote:
| I'm not going to contact sales. I have no problem paying
| though.
|
| Part of your value proposition is saving people time, but your
| sales model is time expensive.
| lmeyerov wrote:
| Latest pygraphistry has this flow for free and OSS, just `pip
| install graphistry[umap-learn]` or, for transformers, `pip
| install graphistry[ai]` :)
|
| And per the article, with pluggable sentence transformers ->
| UMAP automatically as part of the auto-featurization:
| graphistry.nodes(accounts_df).umap().plot() :)
|
| We haven't published tutorials yet, just been using with some
| fraud/cyber/misinfo/genomics/sales/gov/etc teams (including
| with RAPIDS gpu accel), so cool to see excitement here already!
| Till then, it should work out-of-the-box with no parameters,
| and then all sorts of fun things to tune:
| https://github.com/graphistry/pygraphistry/blob/21fad42412cc...
| ahoho wrote:
| Color me skeptical on BERTopic. Without human validation, I'm not
| convinced that it's an improvement over existing methods.
|
| I'm an author on a recent paper about automated topic model
| evaluation [1], and we found that current metrics do not line up
| with human judgements as well as previously thought. To my
| knowledge, BERTopic has only been evaluated on these automated
| metrics.
|
| For datasets of under a few hundred thousand documents, Mallet
| (LDA estimated with Gibbs sampling) can produce stable, high-
| quality outputs in minutes on a laptop [2]. Even larger datasets
| remain tractable, although depending on your use case you may be
| better off subsampling.
|
| It's possible that I've missed something, but I'm not clear on
| what benefits BERTopic has that existing methods do not. I don't
| mean to be overly negative---it has a nice API and the approach
| seems reasonable---I'm just wondering what's really new here.
|
| [1]:
| https://proceedings.neurips.cc/paper/2021/hash/0f83556a305d7...
| [2]: https://mimno.github.io/Mallet/ [3]:
| https://maartengr.github.io/BERTopic/faq.html#why-are-the-re...
| toxik wrote:
| How does this compare to LDA? It doesn't seem like there's a huge
| difference here. For good reason perhaps, the BERT part is only
| to embed the sentences.
| berto4 wrote:
| yeah exactly my question. LDA is probabilistic and very
| performant if you clean up the documents well. The approach
| using Bert seems pretty powerful given that you can now cluster
| based on semantics, not just word occurrence/frequencies as in
| LDA (though ngrams help). However using a clustering approach
| would mean that each document is a part of a single topic,
| rather than being made up of multiple topics. But this is a
| cool idea nonetheless. [EDIT] quickly checked it out, seems
| like it uses some kind of soft clustering so documents can
| occur in many clusters (topics)
| a-dub wrote:
| would it make sense to preprocess with a transformer style
| model to produce per document semantic vectors which can then
| be piped into LDA to find topic mixtures of those vectors?
| uoaei wrote:
| Is that not exactly what's happening in TFA?
| a-dub wrote:
| if TFA means "the forememtioned article", i don't think
| so. i'm not convinced that the clusters found and the
| frequencies in those clusters would be the same as what
| LDA computes with gibbs sampling or the variational
| calculus method would find. but i must admit it's been a
| while since i've played with this stuff.
|
| if TFA is some other method, i am unfamiliar and would
| like to know more.
|
| in my experience, while it's true that it's hard to score
| and verify these sorts of models, the hierarchical
| multinomial nature of LDA topic models makes it easy to
| generate data and then verify behavior in the fitting
| process by recovering generative model parameters used by
| the test data generation process. obviously this makes no
| sense for the bert frontend, but a comparison of the
| differing backend clustering methods could be
| interesting.
| uoaei wrote:
| Well, they're not _supposed_ to be the same clusters. The
| reason people develop new methods is to surpass the old
| ones.
|
| I'm just saying that the method described in the link
| seems to be exactly what you are describing: using
| document embedding vectors as input to soft clustering
| mechanisms akin to LDA. Of course it does not interface
| perfectly with the theoretical underpinnings of LDA
| because those are quite constrained to tf-idf (generally
| count-based) inputs.
|
| As an aside, "TFA" translates to "the fucking article"
| and is a reference to the classic Internet acronym "RTFM"
| standing for "read the fucking manual". Both are passive-
| aggressive-cum-colloquial ways to imply that answers are
| in places you would expect to find them, if only you go
| to read the source.
| chaxor wrote:
| I think he _might_ mean term frequency analysis?
| chaxor wrote:
| Nevermind. I should have read what was said.
| uniqueuid wrote:
| It is true that bertopic is a great tool. It's modern, it's
| modular, and it's pretty performant.
|
| That said, I want to caution against using topic modeling as a
| one-fits-all-solution. As the author stresses, this is one
| particular approach which uses a combination of embeddings
| (sentence, or other), umap and hdbscan. Both umap and hdbscan can
| be slow, so it might be worthwhile to check out the GPU enabled
| versions of both from the cuml package.
|
| In addition, topic models have a huge number of degrees of
| freedom, and the solution you will get depends on many (seemingly
| arbitrary) choices. In other words, these are not _the_ topics,
| they are _some_ topics.
|
| That said, it's awesome, really great work by Maarten
| Grootendorst and a great blog post by James Briggs.
|
| [edit] here is a link to the fast cuda version of bertopic by
| rapidsai: https://github.com/rapidsai/rapids-
| examples/tree/main/cuBERT...
| hack_ml wrote:
| Its seamless to accelerate BERTOPIC on GPU's with cuML now with
| the latest release. (v0.10.0)
|
| Checkout the docs at:
| https://maartengr.github.io/BERTopic/faq.html#can-i-use-the-...
|
| All you need to do is below from bertopic
| import BERTopic from cuml.cluster import HDBSCAN
| from cuml.manifold import UMAP # Create instances
| of GPU-accelerated UMAP and HDBSCAN umap_model =
| UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
| hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)
| # Pass the above models to be used in BERTopic
| topic_model = BERTopic(umap_model=umap_model,
| hdbscan_model=hdbscan_model) topics, probs =
| topic_model.fit_transform(docs)
| whakim wrote:
| I agree that this is a cool. That being said, the results show
| that we have a long, long way to go. The topics are pretty
| incoherent: what are "would", "should", and "use" doing in
| there? The words have no context, so (for example) "self"
| clearly refers to the python reserved keyword, but you have no
| way of knowing that. Not to mention (as another comment brings
| up), the topics aren't named so it's pretty hard to actually
| figure out what they're about. If we think about real-world
| usage this output would be practically useless - it tells you
| that people talk about investing in r/investing and pytorch in
| r/pytorch. If you want meaningful, actionable information about
| what people are talking about in a large corpus of unstructured
| text data then for the forseeable future you'll need to involve
| humans in the loop even if ML assistance plays a big part.
| Der_Einzige wrote:
| A huggingface space I wrote to let you play with BERTopic in your
| browser:
|
| https://huggingface.co/spaces/Hellisotherpeople/HF-BERTopic
| leobg wrote:
| Wow. This is great!
___________________________________________________________________
(page generated 2022-05-11 23:01 UTC)