[HN Gopher] Tokens, n-grams, and bag-of-words models (2023)
       ___________________________________________________________________
        
       Tokens, n-grams, and bag-of-words models (2023)
        
       Author : fzliu
       Score  : 84 points
       Date   : 2024-04-04 09:27 UTC (2 days ago)
        
 (HTM) web link (zilliz.com)
 (TXT) w3m dump (zilliz.com)
        
       | politelemon wrote:
       | I found this useful, so thanks for sharing it; ngrams and bag-of-
       | words are terms I've encountered in the past but skipped without
       | thinking about htem.
       | 
       | It's making me wonder, why are models usually in Python? Could
       | these models be implemented in say, Scala, Kotlin, or NodeJS, and
       | have there been attempts to do so?
        
         | gentleman11 wrote:
         | There's some libraries, numpy, pandas, and sklearn, that are
         | written in python and highly optimized. People use python
         | because those libraries, plus a few other tools, are python.
         | Plus python is nice and easy to use on any platform
        
           | politelemon wrote:
           | Oh that bit does make sense, there's a natural gravitation
           | towards it and the more people use it the better it gets.
           | Have there been attempts to recreate a similar ecosystem in
           | other languages though.
        
             | rabbits77 wrote:
             | Yes, of course. Things like numpy are far from new and in
             | many cases are easy to use wrappers around the real
             | computational workhorses written in, say, Fortran or C. For
             | example, check out lapack
             | https://hpc.llnl.gov/software/mathematical-software/lapack
             | which is still, as far as I know, the gold standard.
             | 
             | Python is widely adapted not really because the language
             | itself is any good or particularly performant (it's not at
             | all), but because it presents easy to use wrapper APIs to
             | developers who may have a poor background in Computer
             | Science, but are rather stronger in statistics, general
             | data analysis, or applied fields like economics.
        
             | pedrosorio wrote:
             | Before Python, I believe Fortran (one of the first
             | programming languages) was for many years a key language in
             | scientific computing.
             | 
             | MATLAB is a proprietary computing platform with its own
             | language that was very widely used (and probably still the
             | standard in some fields of engineering). The fact that it
             | is proprietary and the language is not great as a general
             | programming language were significant drawbacks to wide
             | adoption.
             | 
             | As far as ML is concerned, the deep learning revolution
             | happened when Python was the dominant language for
             | scientific computing (mostly due to NumPy and SciPy), so
             | naturally a lot of the ecosystem was built to have Python
             | as the main (scripting) language. The rest is history.
             | 
             | As far as "attempts to recreate similar ecosystem":
             | 
             | PyTorch (currently the most popular deep learning
             | framework), was originally Torch (initial release: 2002 -
             | long before "Deep Learning" was a thing) with Lua as the
             | scripting language. Python's momentum in 2010's meant that
             | eventually it was rewritten in Python thus becoming
             | PyTorch.
             | 
             | Julia language is a famous somewhat recent example (first
             | release 2012, stable 1.0 in 2018) of a language that was
             | built partially to address some of Python's shortcomings
             | and to "replace" it as the default for scientific
             | computing. It didn't succeed - it's hard to move people
             | away from an ecosystem with so much head start and momentum
             | as Python had in the 2010s.
        
           | jjtheblunt wrote:
           | I think you're mixing distinct things, because numpy, pandas,
           | sklearn are NOT written primarily in Python, but generally in
           | C, which is why they are highly optimized, relying on this
           | underlying C code, for example, but with Python bindings for
           | convenient ergonomic use from Python.
        
             | PaulHoule wrote:
             | People aren't going to do this in Java which has a
             | xenophobic attitude about libraries. I mean you could, but
             | people wouldn't.
        
           | fbdab103 wrote:
           | Don't forget the Jupyter notebook ecosystem. Having a nice
           | REPL environment where you can quickly iterate on code is a
           | huge boon compared to a language which might have a slower
           | feedback cycle.
           | 
           | This is incredibly important in an environment where you are
           | exploring datasets and generating new hypotheses.
        
         | marginalia_nu wrote:
         | N-grams are unfortunately an overloaded term.
         | 
         | It's used in the context of multiple "words" as in this
         | article, but also tuples of characters. The confusion stems
         | from the fact that "token" may refer to words as is used in
         | this article (this is common in modern NLP) but older sources
         | often use the word token to refer to graphemes ("letters") and
         | various other breakdowns. This wikipedia article[1] is a good
         | example of such usage.
         | 
         | [1] https://en.wikipedia.org/wiki/Bigram
        
           | 3abiton wrote:
           | Does it relate to tokenization?
        
       ___________________________________________________________________
       (page generated 2024-04-06 23:00 UTC)