[HN Gopher] My NumPy year: Creating a DType for the next generat...
___________________________________________________________________
My NumPy year: Creating a DType for the next generation of
scientific computing
Author : elashri
Score : 68 points
Date : 2024-10-23 17:44 UTC (5 hours ago)
(HTM) web link (quansight.com)
(TXT) w3m dump (quansight.com)
| jofer wrote:
| First off, major kudos and this is very very cool work. It's also
| a great article and really, everyone should go read it.
|
| Beyond "just" better string arrays, my favorite side effect of
| this is efficient NaN support in string arrays. The article talks
| about this a lot, but I had already started this comment before
| fully reading the article :p
|
| I mean, sure, the old approach was object arrays, and you can do
| it there because each element is an independent object, but
| they're super inefficient. This both makes things efficient _and_
| has a really cool side effect of supporting something that had
| become common partly as an accident of the old object array
| approach - NaNs in arrays of strings.
|
| This is really really really useful work and it's _super_ cool!!
| ngoldbaum wrote:
| Thank you this really means a lot.
| davidinosauro wrote:
| The NEP (Numpy Enhancement Proposal) link for those more curious
| about the details than the story:
| https://numpy.org/neps/nep-0055-string_dtype.html
| davidinosauro wrote:
| In particular from the article I was confused by this:
|
| > NumPy doesn't offer a way to store data outside of the array
| buffer--there's no concept of "sidecar storage" in NumPy.
|
| But then it goes on and say to he strings are stored on the
| heap (which clearly is also possible with dtype=object) with an
| arena allocator. Reading the NEP now
| ngoldbaum wrote:
| Well, there _was_ no concept of sidecar storage. Now we have
| the hack we came up with for StringDType to store data on the
| DType instance and also make it so StringDType arrays don 't
| share StringDType instances, unless the array is a view.
|
| EDIT: looking back at the NEP, I'm not sure it does a great
| job explaining exactly how the per-array descriptor works.
| Ultimately it's powered by a hook in the DType API:
| https://github.com/numpy/numpy/pull/24988. There is only one
| spot in NumPy where array buffers are allocated, so we hooked
| there and made sure any arrays with newly allocated buffers
| get a new DType instance.
| hnman10 wrote:
| Would people not use Arrow for this now? Arrow has all these
| types and has also replaced NumPy in Pandas.
| ngoldbaum wrote:
| This was a case of convergent evolution, both projects ended up
| working simultaneously on similar ideas.
|
| One issue with using Arrow directly in NumPy is PyArrow exposes
| an immutable 1D array, while NumPy exposes a mutable ND array.
|
| See also
| https://numpy.org/neps/nep-0055-string_dtype.html#related-wo...
| hopfenspergerj wrote:
| Are the pandas people considering this as the default string
| type? Seems like it would be a slam dunk.
___________________________________________________________________
(page generated 2024-10-23 23:00 UTC)