[HN Gopher] Tokens, n-grams, and bag-of-words models (2023)
___________________________________________________________________
Tokens, n-grams, and bag-of-words models (2023)
Author : fzliu
Score : 84 points
Date : 2024-04-04 09:27 UTC (2 days ago)
(HTM) web link (zilliz.com)
(TXT) w3m dump (zilliz.com)
| politelemon wrote:
| I found this useful, so thanks for sharing it; ngrams and bag-of-
| words are terms I've encountered in the past but skipped without
| thinking about htem.
|
| It's making me wonder, why are models usually in Python? Could
| these models be implemented in say, Scala, Kotlin, or NodeJS, and
| have there been attempts to do so?
| gentleman11 wrote:
| There's some libraries, numpy, pandas, and sklearn, that are
| written in python and highly optimized. People use python
| because those libraries, plus a few other tools, are python.
| Plus python is nice and easy to use on any platform
| politelemon wrote:
| Oh that bit does make sense, there's a natural gravitation
| towards it and the more people use it the better it gets.
| Have there been attempts to recreate a similar ecosystem in
| other languages though.
| rabbits77 wrote:
| Yes, of course. Things like numpy are far from new and in
| many cases are easy to use wrappers around the real
| computational workhorses written in, say, Fortran or C. For
| example, check out lapack
| https://hpc.llnl.gov/software/mathematical-software/lapack
| which is still, as far as I know, the gold standard.
|
| Python is widely adapted not really because the language
| itself is any good or particularly performant (it's not at
| all), but because it presents easy to use wrapper APIs to
| developers who may have a poor background in Computer
| Science, but are rather stronger in statistics, general
| data analysis, or applied fields like economics.
| pedrosorio wrote:
| Before Python, I believe Fortran (one of the first
| programming languages) was for many years a key language in
| scientific computing.
|
| MATLAB is a proprietary computing platform with its own
| language that was very widely used (and probably still the
| standard in some fields of engineering). The fact that it
| is proprietary and the language is not great as a general
| programming language were significant drawbacks to wide
| adoption.
|
| As far as ML is concerned, the deep learning revolution
| happened when Python was the dominant language for
| scientific computing (mostly due to NumPy and SciPy), so
| naturally a lot of the ecosystem was built to have Python
| as the main (scripting) language. The rest is history.
|
| As far as "attempts to recreate similar ecosystem":
|
| PyTorch (currently the most popular deep learning
| framework), was originally Torch (initial release: 2002 -
| long before "Deep Learning" was a thing) with Lua as the
| scripting language. Python's momentum in 2010's meant that
| eventually it was rewritten in Python thus becoming
| PyTorch.
|
| Julia language is a famous somewhat recent example (first
| release 2012, stable 1.0 in 2018) of a language that was
| built partially to address some of Python's shortcomings
| and to "replace" it as the default for scientific
| computing. It didn't succeed - it's hard to move people
| away from an ecosystem with so much head start and momentum
| as Python had in the 2010s.
| jjtheblunt wrote:
| I think you're mixing distinct things, because numpy, pandas,
| sklearn are NOT written primarily in Python, but generally in
| C, which is why they are highly optimized, relying on this
| underlying C code, for example, but with Python bindings for
| convenient ergonomic use from Python.
| PaulHoule wrote:
| People aren't going to do this in Java which has a
| xenophobic attitude about libraries. I mean you could, but
| people wouldn't.
| fbdab103 wrote:
| Don't forget the Jupyter notebook ecosystem. Having a nice
| REPL environment where you can quickly iterate on code is a
| huge boon compared to a language which might have a slower
| feedback cycle.
|
| This is incredibly important in an environment where you are
| exploring datasets and generating new hypotheses.
| marginalia_nu wrote:
| N-grams are unfortunately an overloaded term.
|
| It's used in the context of multiple "words" as in this
| article, but also tuples of characters. The confusion stems
| from the fact that "token" may refer to words as is used in
| this article (this is common in modern NLP) but older sources
| often use the word token to refer to graphemes ("letters") and
| various other breakdowns. This wikipedia article[1] is a good
| example of such usage.
|
| [1] https://en.wikipedia.org/wiki/Bigram
| 3abiton wrote:
| Does it relate to tokenization?
___________________________________________________________________
(page generated 2024-04-06 23:00 UTC)