[HN Gopher] Multilingual transformer and BERTopic for topic mode...
___________________________________________________________________
Multilingual transformer and BERTopic for topic modeling: The case
of Serbian
Author : nikolamilosevic
Score : 33 points
Date : 2024-02-09 10:18 UTC (1 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| uniqueuid wrote:
| BERTopic is great, but some people forget that even magic, er,
| UMAP+HDBScan on embeddings cannot solve some problems:
|
| - statistical tools (including LDA and variants) define topics to
| be coherent latent clusters of words/embeddings. These correspond
| to a _mixture_ of real-world concepts, including events, topics,
| issues etc. So when you apply BERTopic, you often get clusters
| that represent very different things on a conceptual level
|
| - the end-to-end pipeline is very nice, especially when adding
| things like cluster labeling from LLMs on top. But we should not
| forget that this stacks many steps with implicit errors on top of
| each other. It is not easy to get a transparent and robust story
| why one cluster solution is better than any other.
|
| - one of the implicit choices is picking UMAP, which will tend to
| find very coherent clusters but "throw out" many (up to ~50%)
| cases into an outlier cluster (-1). Sometimes that's not what we
| want, and then tuning is needed (e.g. use kmeans instead).
|
| - random footnote: cuML for really fast BERTopic is great, but
| seems to produce inferior solutions. Better test that before
| putting it into production.
|
| With all that said, I love that now we can use this tool and
| debate its merits on this level, rather than everyone
| implementing their own homegrown and probably bug-rich version of
| it.
| lmeyerov wrote:
| +1
|
| We do a variant of gpu umap+hdbscan => viz + scores a LOT. We
| are big fans!
|
| We've grown in 2 pretty different directions to berttopic to
| support our users investigations over the last few years:
|
| -- At least for our users, before going deep into berttopic and
| the tweaks you are suggesting, we find a much earlier and more
| basic step is to play with which columns to roll in. Berttopic
| is about text columns, but in practice, that's often just 1-3
| of the 10-100 columns our users are working with! For example,
| in Splunk logs, beyond some message column, we also care
| timestamp, risk level, IP address columns, etc. Same thing if
| you are say analyzing transactions or user activity in
| Databricks or Snowflake, there's a lot of impactful metadata
| outside of the text columns. IMO, much of the beauty of UMAP is
| its success with 100s and, with GPUs, 1000s of columns.
|
| -- For interactive visual analysis, we found it super valuable
| since early on to show the similarity connections that UMAP
| finds, and make them interactive for reclustering. Most umap
| visualizers are instead static, basically a scatterplot you can
| zoom in on. In contrast, being able to filter, recluster,
| recolor, etc, is pretty important part of the iteration flow as
| it eliminates needing to go back to coding for every little
| step. By making UMAP's inferred similarity edges 'live', you
| can now treat it as an interactive similarity graph, and
| filter->recluster on-the-fly. (It also helps understand nuance
| within a cluster, as you can see which edges exist, with what
| strength, and even interactive summaries of why they exist.)
|
| I actually just gave a talk that got recorded on how we used
| this to win a US Cyber Command AI competition:
| https://www.youtube.com/live/4kRMI1wEU7I?si=bL-182aA8d-V9m5J...
|
| That talk also deals with the scale problems of extending this
| to say all of your customer data or log data. Especially when
| supporting more than just some text columns, we need to easily
| & quickly encode those as well for the UMAP to pick them up. We
| recently released cu_cat (our GPU fork of dirty cat) to
| preprocess all these wild datatypes, and will be turning on
| soon by default for pygraphistry's "g.nodes(df).umap().plot()"
| -- these three pieces have beocme the lego pieces we use for
| enabling workflows like in the talk. It's super fun, and for so
| little code, surprisingly effective!
| unhammer wrote:
| > Serbian [and] other morphologically rich low-resource languages
|
| What term do we use for the ~6000 languages of less resources
| (and typically richer morphology)?
| mcswell wrote:
| Not sure what you're asking. "Lesser resourced", "less
| resourced", "low density", and this "low resource" are all
| terms I've heard. Not to be confused with "less commonly
| taught" languages, obviously.
|
| If you're asking whether Serbian is really low/less resource,
| there's no defining line.
|
| And of course there are still unwritten languages.
___________________________________________________________________
(page generated 2024-02-10 23:00 UTC)