[HN Gopher] My NumPy year: Creating a DType for the next generat...
       ___________________________________________________________________
        
       My NumPy year: Creating a DType for the next generation of
       scientific computing
        
       Author : elashri
       Score  : 68 points
       Date   : 2024-10-23 17:44 UTC (5 hours ago)
        
 (HTM) web link (quansight.com)
 (TXT) w3m dump (quansight.com)
        
       | jofer wrote:
       | First off, major kudos and this is very very cool work. It's also
       | a great article and really, everyone should go read it.
       | 
       | Beyond "just" better string arrays, my favorite side effect of
       | this is efficient NaN support in string arrays. The article talks
       | about this a lot, but I had already started this comment before
       | fully reading the article :p
       | 
       | I mean, sure, the old approach was object arrays, and you can do
       | it there because each element is an independent object, but
       | they're super inefficient. This both makes things efficient _and_
       | has a really cool side effect of supporting something that had
       | become common partly as an accident of the old object array
       | approach - NaNs in arrays of strings.
       | 
       | This is really really really useful work and it's _super_ cool!!
        
         | ngoldbaum wrote:
         | Thank you this really means a lot.
        
       | davidinosauro wrote:
       | The NEP (Numpy Enhancement Proposal) link for those more curious
       | about the details than the story:
       | https://numpy.org/neps/nep-0055-string_dtype.html
        
         | davidinosauro wrote:
         | In particular from the article I was confused by this:
         | 
         | > NumPy doesn't offer a way to store data outside of the array
         | buffer--there's no concept of "sidecar storage" in NumPy.
         | 
         | But then it goes on and say to he strings are stored on the
         | heap (which clearly is also possible with dtype=object) with an
         | arena allocator. Reading the NEP now
        
           | ngoldbaum wrote:
           | Well, there _was_ no concept of sidecar storage. Now we have
           | the hack we came up with for StringDType to store data on the
           | DType instance and also make it so StringDType arrays don 't
           | share StringDType instances, unless the array is a view.
           | 
           | EDIT: looking back at the NEP, I'm not sure it does a great
           | job explaining exactly how the per-array descriptor works.
           | Ultimately it's powered by a hook in the DType API:
           | https://github.com/numpy/numpy/pull/24988. There is only one
           | spot in NumPy where array buffers are allocated, so we hooked
           | there and made sure any arrays with newly allocated buffers
           | get a new DType instance.
        
       | hnman10 wrote:
       | Would people not use Arrow for this now? Arrow has all these
       | types and has also replaced NumPy in Pandas.
        
         | ngoldbaum wrote:
         | This was a case of convergent evolution, both projects ended up
         | working simultaneously on similar ideas.
         | 
         | One issue with using Arrow directly in NumPy is PyArrow exposes
         | an immutable 1D array, while NumPy exposes a mutable ND array.
         | 
         | See also
         | https://numpy.org/neps/nep-0055-string_dtype.html#related-wo...
        
           | hopfenspergerj wrote:
           | Are the pandas people considering this as the default string
           | type? Seems like it would be a slam dunk.
        
       ___________________________________________________________________
       (page generated 2024-10-23 23:00 UTC)