[HN Gopher] Multilingual transformer and BERTopic for topic mode...
       ___________________________________________________________________
        
       Multilingual transformer and BERTopic for topic modeling: The case
       of Serbian
        
       Author : nikolamilosevic
       Score  : 33 points
       Date   : 2024-02-09 10:18 UTC (1 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | uniqueuid wrote:
       | BERTopic is great, but some people forget that even magic, er,
       | UMAP+HDBScan on embeddings cannot solve some problems:
       | 
       | - statistical tools (including LDA and variants) define topics to
       | be coherent latent clusters of words/embeddings. These correspond
       | to a _mixture_ of real-world concepts, including events, topics,
       | issues etc. So when you apply BERTopic, you often get clusters
       | that represent very different things on a conceptual level
       | 
       | - the end-to-end pipeline is very nice, especially when adding
       | things like cluster labeling from LLMs on top. But we should not
       | forget that this stacks many steps with implicit errors on top of
       | each other. It is not easy to get a transparent and robust story
       | why one cluster solution is better than any other.
       | 
       | - one of the implicit choices is picking UMAP, which will tend to
       | find very coherent clusters but "throw out" many (up to ~50%)
       | cases into an outlier cluster (-1). Sometimes that's not what we
       | want, and then tuning is needed (e.g. use kmeans instead).
       | 
       | - random footnote: cuML for really fast BERTopic is great, but
       | seems to produce inferior solutions. Better test that before
       | putting it into production.
       | 
       | With all that said, I love that now we can use this tool and
       | debate its merits on this level, rather than everyone
       | implementing their own homegrown and probably bug-rich version of
       | it.
        
         | lmeyerov wrote:
         | +1
         | 
         | We do a variant of gpu umap+hdbscan => viz + scores a LOT. We
         | are big fans!
         | 
         | We've grown in 2 pretty different directions to berttopic to
         | support our users investigations over the last few years:
         | 
         | -- At least for our users, before going deep into berttopic and
         | the tweaks you are suggesting, we find a much earlier and more
         | basic step is to play with which columns to roll in. Berttopic
         | is about text columns, but in practice, that's often just 1-3
         | of the 10-100 columns our users are working with! For example,
         | in Splunk logs, beyond some message column, we also care
         | timestamp, risk level, IP address columns, etc. Same thing if
         | you are say analyzing transactions or user activity in
         | Databricks or Snowflake, there's a lot of impactful metadata
         | outside of the text columns. IMO, much of the beauty of UMAP is
         | its success with 100s and, with GPUs, 1000s of columns.
         | 
         | -- For interactive visual analysis, we found it super valuable
         | since early on to show the similarity connections that UMAP
         | finds, and make them interactive for reclustering. Most umap
         | visualizers are instead static, basically a scatterplot you can
         | zoom in on. In contrast, being able to filter, recluster,
         | recolor, etc, is pretty important part of the iteration flow as
         | it eliminates needing to go back to coding for every little
         | step. By making UMAP's inferred similarity edges 'live', you
         | can now treat it as an interactive similarity graph, and
         | filter->recluster on-the-fly. (It also helps understand nuance
         | within a cluster, as you can see which edges exist, with what
         | strength, and even interactive summaries of why they exist.)
         | 
         | I actually just gave a talk that got recorded on how we used
         | this to win a US Cyber Command AI competition:
         | https://www.youtube.com/live/4kRMI1wEU7I?si=bL-182aA8d-V9m5J...
         | 
         | That talk also deals with the scale problems of extending this
         | to say all of your customer data or log data. Especially when
         | supporting more than just some text columns, we need to easily
         | & quickly encode those as well for the UMAP to pick them up. We
         | recently released cu_cat (our GPU fork of dirty cat) to
         | preprocess all these wild datatypes, and will be turning on
         | soon by default for pygraphistry's "g.nodes(df).umap().plot()"
         | -- these three pieces have beocme the lego pieces we use for
         | enabling workflows like in the talk. It's super fun, and for so
         | little code, surprisingly effective!
        
       | unhammer wrote:
       | > Serbian [and] other morphologically rich low-resource languages
       | 
       | What term do we use for the ~6000 languages of less resources
       | (and typically richer morphology)?
        
         | mcswell wrote:
         | Not sure what you're asking. "Lesser resourced", "less
         | resourced", "low density", and this "low resource" are all
         | terms I've heard. Not to be confused with "less commonly
         | taught" languages, obviously.
         | 
         | If you're asking whether Serbian is really low/less resource,
         | there's no defining line.
         | 
         | And of course there are still unwritten languages.
        
       ___________________________________________________________________
       (page generated 2024-02-10 23:00 UTC)