[HN Gopher] BERTopic: The Future of Topic Modeling
       ___________________________________________________________________
        
       BERTopic: The Future of Topic Modeling
        
       Author : gk1
       Score  : 89 points
       Date   : 2022-05-11 15:19 UTC (7 hours ago)
        
 (HTM) web link (www.pinecone.io)
 (TXT) w3m dump (www.pinecone.io)
        
       | tbonza wrote:
       | What happens on a slightly different task where domain experts
       | have tried to create a set of topics, not all domain experts talk
       | to each other, and so we instead need a way to merge existing
       | topics? I continue to see benchmarks where human expertise
       | significantly outperforms AI on common sense reasoning tasks
       | (most recently https://arxiv.org/abs/2112.11446).
       | 
       | What about an approach using directed acyclic graphs and
       | entities?
        
         | whakim wrote:
         | In traditional qualitative research, you'd usually have a bunch
         | of experts get together and figure out a set of topics (or
         | import and adapt a set of topics from similar work) _before_
         | you go about classifying the bulk of your data.
        
       | visarga wrote:
       | Next step: automatically naming clusters using few-shot GPT-3.
       | Cluster naming is a non-trivial problem.
        
         | vosper wrote:
         | I've run into this problem at a previous employer. Do you know
         | if anyone's working on it?
        
           | jll29 wrote:
           | "topic labeling" papers: https://scholar.google.com/scholar?h
           | l=en&as_sdt=0%2C5&q=%22t...
        
           | visarga wrote:
           | There is a recent discussion in r/machinelearning about it
           | 
           | https://www.reddit.com/r/MachineLearning/comments/umgdts/p_c.
           | ..
        
         | malshe wrote:
         | Absolutely. This is the point where I think we shift from
         | astronomy to astrology!
        
       | victorianoi wrote:
       | take a look at Graphext ( https://www.graphext.com ) it
       | automatically creates the clustering embeddings using BERT for
       | you + great visualization libraries to interpret the clusters :D
       | it took us 5 years to build the product
        
         | CabSauce wrote:
         | It sure has to be much, much better than free. Especially if
         | your pricing is 'contact sales'.
        
         | m1sta_ wrote:
         | I'm not going to contact sales. I have no problem paying
         | though.
         | 
         | Part of your value proposition is saving people time, but your
         | sales model is time expensive.
        
         | lmeyerov wrote:
         | Latest pygraphistry has this flow for free and OSS, just `pip
         | install graphistry[umap-learn]` or, for transformers, `pip
         | install graphistry[ai]` :)
         | 
         | And per the article, with pluggable sentence transformers ->
         | UMAP automatically as part of the auto-featurization:
         | graphistry.nodes(accounts_df).umap().plot() :)
         | 
         | We haven't published tutorials yet, just been using with some
         | fraud/cyber/misinfo/genomics/sales/gov/etc teams (including
         | with RAPIDS gpu accel), so cool to see excitement here already!
         | Till then, it should work out-of-the-box with no parameters,
         | and then all sorts of fun things to tune:
         | https://github.com/graphistry/pygraphistry/blob/21fad42412cc...
        
       | ahoho wrote:
       | Color me skeptical on BERTopic. Without human validation, I'm not
       | convinced that it's an improvement over existing methods.
       | 
       | I'm an author on a recent paper about automated topic model
       | evaluation [1], and we found that current metrics do not line up
       | with human judgements as well as previously thought. To my
       | knowledge, BERTopic has only been evaluated on these automated
       | metrics.
       | 
       | For datasets of under a few hundred thousand documents, Mallet
       | (LDA estimated with Gibbs sampling) can produce stable, high-
       | quality outputs in minutes on a laptop [2]. Even larger datasets
       | remain tractable, although depending on your use case you may be
       | better off subsampling.
       | 
       | It's possible that I've missed something, but I'm not clear on
       | what benefits BERTopic has that existing methods do not. I don't
       | mean to be overly negative---it has a nice API and the approach
       | seems reasonable---I'm just wondering what's really new here.
       | 
       | [1]:
       | https://proceedings.neurips.cc/paper/2021/hash/0f83556a305d7...
       | [2]: https://mimno.github.io/Mallet/ [3]:
       | https://maartengr.github.io/BERTopic/faq.html#why-are-the-re...
        
       | toxik wrote:
       | How does this compare to LDA? It doesn't seem like there's a huge
       | difference here. For good reason perhaps, the BERT part is only
       | to embed the sentences.
        
         | berto4 wrote:
         | yeah exactly my question. LDA is probabilistic and very
         | performant if you clean up the documents well. The approach
         | using Bert seems pretty powerful given that you can now cluster
         | based on semantics, not just word occurrence/frequencies as in
         | LDA (though ngrams help). However using a clustering approach
         | would mean that each document is a part of a single topic,
         | rather than being made up of multiple topics. But this is a
         | cool idea nonetheless. [EDIT] quickly checked it out, seems
         | like it uses some kind of soft clustering so documents can
         | occur in many clusters (topics)
        
           | a-dub wrote:
           | would it make sense to preprocess with a transformer style
           | model to produce per document semantic vectors which can then
           | be piped into LDA to find topic mixtures of those vectors?
        
             | uoaei wrote:
             | Is that not exactly what's happening in TFA?
        
               | a-dub wrote:
               | if TFA means "the forememtioned article", i don't think
               | so. i'm not convinced that the clusters found and the
               | frequencies in those clusters would be the same as what
               | LDA computes with gibbs sampling or the variational
               | calculus method would find. but i must admit it's been a
               | while since i've played with this stuff.
               | 
               | if TFA is some other method, i am unfamiliar and would
               | like to know more.
               | 
               | in my experience, while it's true that it's hard to score
               | and verify these sorts of models, the hierarchical
               | multinomial nature of LDA topic models makes it easy to
               | generate data and then verify behavior in the fitting
               | process by recovering generative model parameters used by
               | the test data generation process. obviously this makes no
               | sense for the bert frontend, but a comparison of the
               | differing backend clustering methods could be
               | interesting.
        
               | uoaei wrote:
               | Well, they're not _supposed_ to be the same clusters. The
               | reason people develop new methods is to surpass the old
               | ones.
               | 
               | I'm just saying that the method described in the link
               | seems to be exactly what you are describing: using
               | document embedding vectors as input to soft clustering
               | mechanisms akin to LDA. Of course it does not interface
               | perfectly with the theoretical underpinnings of LDA
               | because those are quite constrained to tf-idf (generally
               | count-based) inputs.
               | 
               | As an aside, "TFA" translates to "the fucking article"
               | and is a reference to the classic Internet acronym "RTFM"
               | standing for "read the fucking manual". Both are passive-
               | aggressive-cum-colloquial ways to imply that answers are
               | in places you would expect to find them, if only you go
               | to read the source.
        
               | chaxor wrote:
               | I think he _might_ mean term frequency analysis?
        
               | chaxor wrote:
               | Nevermind. I should have read what was said.
        
       | uniqueuid wrote:
       | It is true that bertopic is a great tool. It's modern, it's
       | modular, and it's pretty performant.
       | 
       | That said, I want to caution against using topic modeling as a
       | one-fits-all-solution. As the author stresses, this is one
       | particular approach which uses a combination of embeddings
       | (sentence, or other), umap and hdbscan. Both umap and hdbscan can
       | be slow, so it might be worthwhile to check out the GPU enabled
       | versions of both from the cuml package.
       | 
       | In addition, topic models have a huge number of degrees of
       | freedom, and the solution you will get depends on many (seemingly
       | arbitrary) choices. In other words, these are not _the_ topics,
       | they are _some_ topics.
       | 
       | That said, it's awesome, really great work by Maarten
       | Grootendorst and a great blog post by James Briggs.
       | 
       | [edit] here is a link to the fast cuda version of bertopic by
       | rapidsai: https://github.com/rapidsai/rapids-
       | examples/tree/main/cuBERT...
        
         | hack_ml wrote:
         | Its seamless to accelerate BERTOPIC on GPU's with cuML now with
         | the latest release. (v0.10.0)
         | 
         | Checkout the docs at:
         | https://maartengr.github.io/BERTopic/faq.html#can-i-use-the-...
         | 
         | All you need to do is below                   from bertopic
         | import BERTopic         from cuml.cluster import HDBSCAN
         | from cuml.manifold import UMAP              # Create instances
         | of GPU-accelerated UMAP and HDBSCAN         umap_model =
         | UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
         | hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)
         | # Pass the above models to be used in BERTopic
         | topic_model = BERTopic(umap_model=umap_model,
         | hdbscan_model=hdbscan_model)         topics, probs =
         | topic_model.fit_transform(docs)
        
         | whakim wrote:
         | I agree that this is a cool. That being said, the results show
         | that we have a long, long way to go. The topics are pretty
         | incoherent: what are "would", "should", and "use" doing in
         | there? The words have no context, so (for example) "self"
         | clearly refers to the python reserved keyword, but you have no
         | way of knowing that. Not to mention (as another comment brings
         | up), the topics aren't named so it's pretty hard to actually
         | figure out what they're about. If we think about real-world
         | usage this output would be practically useless - it tells you
         | that people talk about investing in r/investing and pytorch in
         | r/pytorch. If you want meaningful, actionable information about
         | what people are talking about in a large corpus of unstructured
         | text data then for the forseeable future you'll need to involve
         | humans in the loop even if ML assistance plays a big part.
        
       | Der_Einzige wrote:
       | A huggingface space I wrote to let you play with BERTopic in your
       | browser:
       | 
       | https://huggingface.co/spaces/Hellisotherpeople/HF-BERTopic
        
         | leobg wrote:
         | Wow. This is great!
        
       ___________________________________________________________________
       (page generated 2022-05-11 23:01 UTC)