[HN Gopher] Show HN: Array - A Better Python List
___________________________________________________________________
Show HN: Array - A Better Python List
Author : lauriat
Score : 71 points
Date : 2021-01-10 12:10 UTC (10 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| topper-123 wrote:
| I'd like to have a chaining operator in Python, like R is
| getting. Then the example could be:
|
| > a |> zip(b) |> map(func1) |> filter(func2) |> forall(func3)
|
| The advantages would be that this would work with all
| lists/iterables, so no need to make a special types.
| brundolf wrote:
| The hard part with this is it sort of requires currying once
| you have >1 arguments, or something equivalent. I suppose
| Python could carve out an implicit behavior where the first or
| last argument is what gets fed into, but that feels potentially
| confusing as the calling syntax is now "lying" to you. In
| JavaScript doing a proper currying style isn't too hard because
| of arrow-syntax, but using python's function definition syntax
| to make a curried function would be hideous (not to mention,
| the standard library isn't done that way). Maybe you could have
| a "curryify" higher-order function. Or, the final option would
| be to have an explicit "insert previous value here" syntax as a
| part of the pipeline syntax, which is something the JS proposal
| has played with. Makes things more verbose (|> double(#)
| instead of |> double), but is maximally flexible and minimally
| confusing.
|
| In short: it's a lot more complicated than it seems, but I
| agree that this style makes this type of thing 1000x more
| readable.
| topper-123 wrote:
| Just letting it implicitly be the first parameter would be
| good enough IMO, and a nice symmetry to self` in methods.
| That'd be very simple, which would be a plus in my book.
|
| Pandas allows the first param in a pipe to be a
| tuple[callable, str], where the second argument would signify
| the parameter location, e.g. `val |> (func, "param_name")`
| which gives some flexibility.
|
| But yeah, if you open up to piping, there are a lot of
| possible choices to be made and easy to go overboard also
| IMO.
| jamespwilliams wrote:
| Looks cool. bool (__bool__) Returns whether all
| elements evaluate to True.
|
| I'd be worried that this will trip people up who use the
| if l: print l[0] # or whatever
|
| pattern
| fantod wrote:
| To be fair, using "if something" in Python is pretty much
| always a good way to trip yourself up.
| nemetroid wrote:
| I've yet to see a (popular) style guide recommend against "if
| something:".
| fantod wrote:
| I don't really see how this is a reason not to carefully
| consider the type of an object when using truthiness.
| pansa2 wrote:
| PEP8 recommends using `if seq:` instead of more verbose
| alternatives like `if len(seq):`.
| orf wrote:
| For good reason - len(something) or alternatives might be
| expensive to compute, bool(something) is actually what you
| are trying to do and can be optimised depending on the
| container.
| fantod wrote:
| For the basic sequence types (list, tuple, and range),
| len is definitely not expensive to compute. For custom
| types, it will depend on your implementation of __len__
| (but then, computation of bool(...) will also depend on
| your implementation of __bool__).
| orf wrote:
| Yes, len() with the basic types are not expensive, and it
| can vary based on the container implementation, but
| that's not really the point.
|
| The reason you should use "if x" is the same reason you
| should use "if x not in y" rather than "if not x in y".
| It better expresses the semantics of your operation with
| the side effect that it may be faster.
| fantod wrote:
| Oh yeah, totally agree that it's idiomatic and (in my
| opinion) cleaner.
| fantod wrote:
| If you know that "something" is a sequence, then is the
| idiomatic thing to do. The point is that any time you rely
| on truthiness of a value, you need to think (sometimes
| quite carefully) about what type of value you're dealing
| with.
| lauriat wrote:
| Thanks!
|
| Good point. However setting def __bool__(self):
| return self.nonEmpty
|
| would mess up certain methods e.g. .index for nested Arrays as
| __eq__ is computed elementwise and bool(Array(False, False))
| would evaluate to True.
|
| Maybe a warning would be appropriate? (as is the case with
| ndarrays)
| pansa2 wrote:
| > _bool(Array(False, False)) would evaluate to True_
|
| Isn't that consistent with the built-in `list`, though,
| because `bool([False, False])` is True?
| lauriat wrote:
| My explanation was pretty poor, let me rephrase
|
| For example, when calling Array((x, y), (z,
| w)).index((z, w))
|
| the following piece of code is executed
| bool(Array((x, y)).__eq__((z, w))) =
| bool(Array(False, False))
|
| If __bool__ returned whether the Array is nonempty,
| bool(Array(False, False)) would evaluate to True and the
| method would wrongly return 0.
|
| You're right that it would be more clear if __bool__ would
| behave similarly, but since Array computes operations
| element-wise, it isn't possible.
| brian_herman wrote:
| Can you do the same thing with dicts and make it so
| d['non_existant_key'] does not create an exception?
| basdftrewq wrote:
| from collections import defaultdict d =
| defaultdict(int) d['non_existant_key']
| st0le wrote:
| Not quite the same what OP asked for. This will create the
| key and assign value 0 to it.
| lauriat wrote:
| You can already do that with
| d.get("non_existant_key", default)
| njharman wrote:
| > all(map(func3, filter(func2, map(func1, zip(a, b))))
|
| That is super readable to me. Working left to right or inside
| out. There is one, clear, balanced, familiar, consistently used
| punctuation to guide you, parens, if you need it but adds little
| noise if you dont.
|
| The "bunch of functions taking and returning an iterator" is a
| great paradigm. So clean and flexiable, and powerfull. ESP
| combined with Python's "many things are iterable" and is trivial
| to write your own iterator
| faitswulff wrote:
| I come from Ruby, but it's pretty unreadable to me. Not that I
| _couldn't_ , I just don't want to. So I doubt that it's
| objectively super easy to read.
|
| Any sort of reading inside out, right to left is a barrier to
| easy reading. This is why people like pipes in functional
| languages, right? You just read it in one direction.
| lifthrasiir wrote:
| I have used Python for decades (not so much nowadays, but
| still) and it is very unreadable for me. It's clear that it is
| a data pipeline but the input and filters are all in a wrong
| order, thus backtracking is required for reading. I have the
| same complaint about str.join.
| orf wrote:
| > all(map(func3, filter(func2, map(func1, zip(a, b)))))
|
| You definitely wouldn't do this in "traditional Python". You'd
| use a comprehension of some kind, or even the walrus operator,
| which is quite possibly faster and more readable than several
| chained lambdas.
| [deleted]
| lauriat wrote:
| Fair enough, the example is a bit exaggerated. You could
| implement it with comprehensions all(func3(y)
| for y in (func1(x) for x in zip(a, b)) if func2(y))
|
| It most likely is a bit faster, but I wouldn't say it's more
| readable.
| nerdponx wrote:
| I feel obligated to point out the existence of the "array"
| package in the Python standard library:
| https://docs.python.org/3/library/array.html
|
| I'm sure the author is aware of it, but readers might not be.
| lauriat wrote:
| That's why the A is capitalised ;)
| Immortal333 wrote:
| Chaining has its own benefits. But I think this doesn't fit the
| definition of "Pythonic". Again, "Pythonic" is highly debatable.
| But, You can always break down big chain of operations, into
| smaller chain using good variable naming in-between.
|
| Many operations are implemented as iterator in python on list,
| like filter, groupby. Looking at your code, its looks like you're
| not doing lazy computation. (Correct me if I wrong). This could
| be huge performance impact, depending upon use case of list.
| nerdponx wrote:
| I am with you on this. Personally, I would rather continue
| using Toolz (https://github.com/pytoolz/toolz), and contribute
| additional helper/utility methods to that library.
|
| The whole point of some things being functions versus methods
| is that they are generic rather than specialized. The generic
| iterator protocol is probably the _best_ feature about the
| Python language, and it 's both a damn shame and bad design to
| not use it.
|
| If you really wanted to make an improvement over built in
| lists, the thing to do would be to implement some kind of fully
| lazy "query planning" engine, like what Apache Spark has. Every
| method call registers a new method to be applied with the query
| planner, but does not execute it. Execution only occurs when
| you explicitly request it. That way you can effectively compile
| in efficient but readable code that takes multiple passes over
| the data into efficient operations internally that only make
| one pass, or at least fewer passes. This also naturally lends
| itself to parallelization/concurrency.
| jmuhlich wrote:
| Dask does the lazy evaluation and query planning thing on
| numpy arrays and pandas dataframes, and can execute in
| parallel. It mimics most of their native interfaces which
| makes it a pretty easy drop-in.
|
| https://docs.dask.org/en/latest/
| lauriat wrote:
| I understand the unpythonic nature of Arrays may startle some
| hardcore pythonistas, but ability to chain functions was one of
| the main reasons why I wrote the package as I find nested
| function calls ugly and sometimes rather hard to decipher.
|
| Regarding the perfomance, Arrays aren't meant to be super high
| performing but rather a simple way to manipulate sequences. For
| the best performance you should go with generic python, toolz
| or other.
| feanaro wrote:
| > But, You can always break down big chain of operations, into
| smaller chain using good variable naming in-between.
|
| I don't think so. Very frequently the intermediate values
| represent nothing in particular and naming them simply results
| in visual noise.
|
| I think this is comparable to SQL or LINQ statements. Consider
| what those would look like if you had to name every
| intermediate values instead of being able to filter and group
| on-the-fly.
|
| Of course you can make a mess out of those too, by building
| huge unreadable expressions, but that's also an extreme,
| similar to naming every intermediate step.
| notretarded wrote:
| Why would I use this over numpy?
| lauriat wrote:
| If you're doing matrix multiplication or other math operations
| on fixed size sequences, you shouldn't.
|
| If, however, you need the dynamic nature of the built-in list
| or functional methods with a touch of numpyness, you should
| give Array a spin.
| RocketSyntax wrote:
| Thank you for improving things and sharing.
|
| I use numpy & pandas, lists & dicts every day. I read your
| docs/github page, but can you help me see the value?
|
| However, I do think there are lots of common tasks that need to
| be done with lists that should be methods rather than fancy
| footwork =)
|
| For example: https://stackoverflow.com/questions/3462143/get-
| difference-b...
|
| As you allude to w your zip loop:
| https://stackoverflow.com/questions/1919044/is-there-a-bette...
| lauriat wrote:
| Thank you for taking the time to check it out!
|
| Naturally if you're dealing with big arrays/tensors, numpy is
| the best choice for operating on sequences.
|
| However, ndarrays have downsides for certain use cases - as
| ndarrays are fixed size, adding elements is very slow, also
| they don't support functional methods (or rather you have to
| create a new array every time you apply e.g. a map), and
| ndarrays of any other type than numbers doesn't really make
| sense.
|
| Many of the methods are wrappers for built-ins, but I find the
| syntax of Arrays cleaner than the weirdness of the builtins.
|
| For example, while applying an async "starmap" to an Array is
| just a method call, with built-in lists you would have go
| through the whole hassle of importing both ThreadPoolExecutor
| and starmap, creating an executor, scheduling the function, and
| finally converting the result back to a list.
| RocketSyntax wrote:
| ndarrays "create a new array every time you apply"
|
| That resonates with me now that you explain that I can't do
| it.
|
| I do like chaining things in pandas like
| `df.select_types("float").head(100).plot.hist()`
| nerdponx wrote:
| Be careful though. Numpy and Pandas go through some trouble
| to make sure that the data _inside_ the array is not
| actually copied. For instance, reshaping and slicing just
| return memory views. Pandas emits a somewhat-infamous
| warning about it that often confuses newbies.
| goodside wrote:
| I know this is an early/experimental project, but the README
| could use more motivation before diving into basic usage. Asking
| someone to change their general-purpose containers is a big ask.
|
| It looks like Array mostly consolidates functional features
| already available in standard libraries, and the main innovation
| is a redesigned swiss-army-knife API.
|
| Good APIs are important, but my instinct is they aren't _this_
| important. Using enhanced versions of built-in container types
| sounds nice, but do you really want to be keeping track of
| whether something is a normal list or an Array? Do you want to
| force people who read your code to learn this library to work
| with something as fundamental as lists? It's not an impossible
| bar to clear (e.g. NumPy, Pandas, Dask, xarray) but it's a high
| one.
| lauriat wrote:
| Thanks for the feedback! Redesigned swiss-army-knife is well
| put.
|
| I'm sure Array's not for everyone, but for some, including me
| it's a nifty tool. I don't expect people to memorise all the
| features of the library - the aim was to name and document each
| feature clearly such that finding the right method would be
| easy with the help of an IDE.
| asimjalis wrote:
| I find this really useful and plan to use it. Thanks for writing
| and sharing.
|
| One use case for the chaining/FP style that I find particularly
| powerful is building out logic on the REPL. The chaining style
| allows me to incrementally grow my chain like a unix pipeline,
| see the results, use that to tweak the chain, until I finally
| have what I want.
|
| This type of instantaneous feedback loop is both highly
| productive and also extremely fun.
| lauriat wrote:
| cheers!
| pedrovhb wrote:
| I think this is neat but I'm not sure it's the best way to go
| about things.
|
| > all(map(func3, filter(func2, map(func1, zip(a, b)))))
|
| > a.zip(b).map(func1).filter(func2).forall(func3)
|
| The original is indeed terrible and the second version is a bit
| better. A lot better than either one, though, is splitting your
| logic into multiple lines and assigning a descriptive identifier
| to each step. Maybe even throw in some inline comments if you're
| particularly respectful of others' time.
|
| As tempting as it is to do something super clever and cram a ton
| of functionality into a small number of lines or characters (it
| does feel good), it's just better to be a bit more verbose and
| write simple, obvious code. I feel like code should be read like
| a book, not a puzzle.
| bko wrote:
| > a.zip(b).map(func1).filter(func2).forall(func3)
|
| Lets make this a somewhat concrete example.
|
| ---
|
| heights = [1,2,3]
|
| widths = [4,5,6]
|
| # printing area greater than 10
|
| # functional
|
| heights.zip(widths).map(to_area).filter(lambda area: area >
| 10).forall(lambda a: print("Area " + a)
|
| #Verbose way
|
| hw_zipped = zip(a,b)
|
| areas = hw_zipped.map(to_inches)
|
| big_areas = areas.filter(a: a > 10)
|
| for a in big_areas: print("Area " + a)
|
| ---
|
| Which do you prefer? I would argue the right level of
| abstraction is the functional way in this example, and its
| often the case in my experience, especially in python where you
| don't often use a namespace to store these intermediary
| variables and you have can't rely on typing
| claytonjy wrote:
| As another point of comparison, as of python 3.8 you can do
| this in one list comp without nesting or double-computing
| areas with the walrus: result = [area for
| x,y in zip(heights,widths) if (area := to_area(x,y)) > 10]
|
| I don't think that's very easy to read; I'd opt for two list
| comps like areas = [to_area(x,y) for x,y in
| zip(heights,widths)] result = [area for area in areas
| if area > 10]
|
| But I agree with OP that map+filter is easier to read.
| bko wrote:
| I agree. My main problem is I don't want intermediary
| variables floating around. Especially something like
| "areas". If python localized variables to a blocked
| namespace, I wouldn't mind
|
| In scala:
|
| ---
|
| val widths = Seq(1,2,3)
|
| val heights = Seq(4,5,6)
|
| widths.zip(heights).foreach { case (w, h) => {
| val area = w * h if (area > 10) {
| println(s"Area: ${area}") }
|
| }}
|
| println(area) // error: not found: value area
| syrrim wrote:
| for x, y in zip(a,b): area = to_area(x, y)
| if area > 10: print(f"Area {area}")
|
| >in python where you don't often use a namespace to store
| these intermediary variables
|
| Hm? Most python code is within a function, in my experience.
| bko wrote:
| You can abstract it out to a function but I think its
| overkill, even if you generalize to something like
| print_area_filter(heights, widths, value, cmp) or whatever
|
| If its not in a function, your example may (or may not
| depending on length if either a or b have a length of zero)
| create a floating variable called area out there.
| derwiki wrote:
| Code is read more often than it's written; optimize for
| reading.
| 6gvONxR4sf7o wrote:
| You can split the a.b.c.d onto different lines and comment
| each, which is a decent middle ground sometimes
| (a\n.b\n.c\n.d). A problem, still, is exceptions and debugging.
| You get paged and see that something went wrong in that
| expression that does so many different things, and it's much
| more frustrating to track down the bug. It makes step debugging
| trickier too. I'd love better error message/debugger support
| for that kind of programming.
| Phemist wrote:
| This feels luke a strawman example. I feel like list
| comprehension results in a much more readable example here. I
| think, at least.
|
| > all(func3(a) for h,w in zip(a,b) for a in func1(h,w) if
| func2(a))
| lauriat wrote:
| Fair enough. Readability is subjective but I understand the
| sentiment. Constructing list comprehensions of such long
| chained expressions can be rather tedious and error prone,
| though (as your example shows).
| snicker7 wrote:
| > assigning a descriptive identifier to each step
|
| Working with data scientists, in practice, these identifiers
| are usually "arr1", "arr2", &c. I'd rather have method
| chaining. Often the intermediates are not meaningful.
| disgruntledphd2 wrote:
| I agree with you in general, people (especially data
| scientists) are bad at naming things.
|
| It's probably the core skill of good programmers though, so
| it should be taught more. I don't think anyone sets out to
| use misleading names, but it's easy for name and code to
| diverge, and it's crippling to readability.
|
| However, often when refactoring/updating such data scientist
| code (or even understanding), I need to break apart the long
| method chains, and this is much, much more annoying than
| dealing with crummy names.
|
| At least I can print the values associated with the names,
| which is not easily possible in the really long method chain.
| brundolf wrote:
| What I like about "cramming a ton of functionality into [a
| single expression]" is that it doesn't leak any intermediates
| to the rest of the block, and it doesn't allow for mutation.
| There's a single output exposed; you can't accidentally use the
| wrong value downstream. You could wrap it all in an inner
| function, I guess, but that seems like overkill unless you plan
| to reuse it.
|
| Though to be fair, having explicit intermediate variables is
| idiomatic in Python, from what I've seen. It's one of my
| biggest pet-peeves about the language, but it's not without
| precedent.
| lauriat wrote:
| I agree, and yes, the line may be a bit excessive. The idea of
| Arrays is not just to cram a heap of functions to a single
| line. The readability (at least to me) is improved even with
| e.g. a single map arr.map(func)
|
| vs. list(map(func, arr))
| ElevenPhonons wrote:
| Are these really the same?
|
| The idiomatic Python 3 version uses generators to compose the
| computation and to avoid unnecessary memory allocations. Does
| funct.Array also do this?
|
| - https://docs.python.org/3/library/functions.html#map -
| https://docs.python.org/3/library/functions.html#filter
___________________________________________________________________
(page generated 2021-01-10 23:03 UTC)