[HN Gopher] Show HN: Array - A Better Python List
       ___________________________________________________________________
        
       Show HN: Array - A Better Python List
        
       Author : lauriat
       Score  : 71 points
       Date   : 2021-01-10 12:10 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | topper-123 wrote:
       | I'd like to have a chaining operator in Python, like R is
       | getting. Then the example could be:
       | 
       | > a |> zip(b) |> map(func1) |> filter(func2) |> forall(func3)
       | 
       | The advantages would be that this would work with all
       | lists/iterables, so no need to make a special types.
        
         | brundolf wrote:
         | The hard part with this is it sort of requires currying once
         | you have >1 arguments, or something equivalent. I suppose
         | Python could carve out an implicit behavior where the first or
         | last argument is what gets fed into, but that feels potentially
         | confusing as the calling syntax is now "lying" to you. In
         | JavaScript doing a proper currying style isn't too hard because
         | of arrow-syntax, but using python's function definition syntax
         | to make a curried function would be hideous (not to mention,
         | the standard library isn't done that way). Maybe you could have
         | a "curryify" higher-order function. Or, the final option would
         | be to have an explicit "insert previous value here" syntax as a
         | part of the pipeline syntax, which is something the JS proposal
         | has played with. Makes things more verbose (|> double(#)
         | instead of |> double), but is maximally flexible and minimally
         | confusing.
         | 
         | In short: it's a lot more complicated than it seems, but I
         | agree that this style makes this type of thing 1000x more
         | readable.
        
           | topper-123 wrote:
           | Just letting it implicitly be the first parameter would be
           | good enough IMO, and a nice symmetry to self` in methods.
           | That'd be very simple, which would be a plus in my book.
           | 
           | Pandas allows the first param in a pipe to be a
           | tuple[callable, str], where the second argument would signify
           | the parameter location, e.g. `val |> (func, "param_name")`
           | which gives some flexibility.
           | 
           | But yeah, if you open up to piping, there are a lot of
           | possible choices to be made and easy to go overboard also
           | IMO.
        
       | jamespwilliams wrote:
       | Looks cool.                   bool (__bool__) Returns whether all
       | elements evaluate to True.
       | 
       | I'd be worried that this will trip people up who use the
       | if l:             print l[0] # or whatever
       | 
       | pattern
        
         | fantod wrote:
         | To be fair, using "if something" in Python is pretty much
         | always a good way to trip yourself up.
        
           | nemetroid wrote:
           | I've yet to see a (popular) style guide recommend against "if
           | something:".
        
             | fantod wrote:
             | I don't really see how this is a reason not to carefully
             | consider the type of an object when using truthiness.
        
           | pansa2 wrote:
           | PEP8 recommends using `if seq:` instead of more verbose
           | alternatives like `if len(seq):`.
        
             | orf wrote:
             | For good reason - len(something) or alternatives might be
             | expensive to compute, bool(something) is actually what you
             | are trying to do and can be optimised depending on the
             | container.
        
               | fantod wrote:
               | For the basic sequence types (list, tuple, and range),
               | len is definitely not expensive to compute. For custom
               | types, it will depend on your implementation of __len__
               | (but then, computation of bool(...) will also depend on
               | your implementation of __bool__).
        
               | orf wrote:
               | Yes, len() with the basic types are not expensive, and it
               | can vary based on the container implementation, but
               | that's not really the point.
               | 
               | The reason you should use "if x" is the same reason you
               | should use "if x not in y" rather than "if not x in y".
               | It better expresses the semantics of your operation with
               | the side effect that it may be faster.
        
               | fantod wrote:
               | Oh yeah, totally agree that it's idiomatic and (in my
               | opinion) cleaner.
        
             | fantod wrote:
             | If you know that "something" is a sequence, then is the
             | idiomatic thing to do. The point is that any time you rely
             | on truthiness of a value, you need to think (sometimes
             | quite carefully) about what type of value you're dealing
             | with.
        
         | lauriat wrote:
         | Thanks!
         | 
         | Good point. However setting                 def __bool__(self):
         | return self.nonEmpty
         | 
         | would mess up certain methods e.g. .index for nested Arrays as
         | __eq__ is computed elementwise and bool(Array(False, False))
         | would evaluate to True.
         | 
         | Maybe a warning would be appropriate? (as is the case with
         | ndarrays)
        
           | pansa2 wrote:
           | > _bool(Array(False, False)) would evaluate to True_
           | 
           | Isn't that consistent with the built-in `list`, though,
           | because `bool([False, False])` is True?
        
             | lauriat wrote:
             | My explanation was pretty poor, let me rephrase
             | 
             | For example, when calling                 Array((x, y), (z,
             | w)).index((z, w))
             | 
             | the following piece of code is executed
             | bool(Array((x, y)).__eq__((z, w)))       =
             | bool(Array(False, False))
             | 
             | If __bool__ returned whether the Array is nonempty,
             | bool(Array(False, False)) would evaluate to True and the
             | method would wrongly return 0.
             | 
             | You're right that it would be more clear if __bool__ would
             | behave similarly, but since Array computes operations
             | element-wise, it isn't possible.
        
       | brian_herman wrote:
       | Can you do the same thing with dicts and make it so
       | d['non_existant_key'] does not create an exception?
        
         | basdftrewq wrote:
         | from collections import defaultdict              d =
         | defaultdict(int)              d['non_existant_key']
        
           | st0le wrote:
           | Not quite the same what OP asked for. This will create the
           | key and assign value 0 to it.
        
         | lauriat wrote:
         | You can already do that with
         | d.get("non_existant_key", default)
        
       | njharman wrote:
       | > all(map(func3, filter(func2, map(func1, zip(a, b))))
       | 
       | That is super readable to me. Working left to right or inside
       | out. There is one, clear, balanced, familiar, consistently used
       | punctuation to guide you, parens, if you need it but adds little
       | noise if you dont.
       | 
       | The "bunch of functions taking and returning an iterator" is a
       | great paradigm. So clean and flexiable, and powerfull. ESP
       | combined with Python's "many things are iterable" and is trivial
       | to write your own iterator
        
         | faitswulff wrote:
         | I come from Ruby, but it's pretty unreadable to me. Not that I
         | _couldn't_ , I just don't want to. So I doubt that it's
         | objectively super easy to read.
         | 
         | Any sort of reading inside out, right to left is a barrier to
         | easy reading. This is why people like pipes in functional
         | languages, right? You just read it in one direction.
        
         | lifthrasiir wrote:
         | I have used Python for decades (not so much nowadays, but
         | still) and it is very unreadable for me. It's clear that it is
         | a data pipeline but the input and filters are all in a wrong
         | order, thus backtracking is required for reading. I have the
         | same complaint about str.join.
        
       | orf wrote:
       | > all(map(func3, filter(func2, map(func1, zip(a, b)))))
       | 
       | You definitely wouldn't do this in "traditional Python". You'd
       | use a comprehension of some kind, or even the walrus operator,
       | which is quite possibly faster and more readable than several
       | chained lambdas.
        
         | [deleted]
        
         | lauriat wrote:
         | Fair enough, the example is a bit exaggerated. You could
         | implement it with comprehensions                 all(func3(y)
         | for y in (func1(x) for x in zip(a, b)) if func2(y))
         | 
         | It most likely is a bit faster, but I wouldn't say it's more
         | readable.
        
       | nerdponx wrote:
       | I feel obligated to point out the existence of the "array"
       | package in the Python standard library:
       | https://docs.python.org/3/library/array.html
       | 
       | I'm sure the author is aware of it, but readers might not be.
        
         | lauriat wrote:
         | That's why the A is capitalised ;)
        
       | Immortal333 wrote:
       | Chaining has its own benefits. But I think this doesn't fit the
       | definition of "Pythonic". Again, "Pythonic" is highly debatable.
       | But, You can always break down big chain of operations, into
       | smaller chain using good variable naming in-between.
       | 
       | Many operations are implemented as iterator in python on list,
       | like filter, groupby. Looking at your code, its looks like you're
       | not doing lazy computation. (Correct me if I wrong). This could
       | be huge performance impact, depending upon use case of list.
        
         | nerdponx wrote:
         | I am with you on this. Personally, I would rather continue
         | using Toolz (https://github.com/pytoolz/toolz), and contribute
         | additional helper/utility methods to that library.
         | 
         | The whole point of some things being functions versus methods
         | is that they are generic rather than specialized. The generic
         | iterator protocol is probably the _best_ feature about the
         | Python language, and it 's both a damn shame and bad design to
         | not use it.
         | 
         | If you really wanted to make an improvement over built in
         | lists, the thing to do would be to implement some kind of fully
         | lazy "query planning" engine, like what Apache Spark has. Every
         | method call registers a new method to be applied with the query
         | planner, but does not execute it. Execution only occurs when
         | you explicitly request it. That way you can effectively compile
         | in efficient but readable code that takes multiple passes over
         | the data into efficient operations internally that only make
         | one pass, or at least fewer passes. This also naturally lends
         | itself to parallelization/concurrency.
        
           | jmuhlich wrote:
           | Dask does the lazy evaluation and query planning thing on
           | numpy arrays and pandas dataframes, and can execute in
           | parallel. It mimics most of their native interfaces which
           | makes it a pretty easy drop-in.
           | 
           | https://docs.dask.org/en/latest/
        
         | lauriat wrote:
         | I understand the unpythonic nature of Arrays may startle some
         | hardcore pythonistas, but ability to chain functions was one of
         | the main reasons why I wrote the package as I find nested
         | function calls ugly and sometimes rather hard to decipher.
         | 
         | Regarding the perfomance, Arrays aren't meant to be super high
         | performing but rather a simple way to manipulate sequences. For
         | the best performance you should go with generic python, toolz
         | or other.
        
         | feanaro wrote:
         | > But, You can always break down big chain of operations, into
         | smaller chain using good variable naming in-between.
         | 
         | I don't think so. Very frequently the intermediate values
         | represent nothing in particular and naming them simply results
         | in visual noise.
         | 
         | I think this is comparable to SQL or LINQ statements. Consider
         | what those would look like if you had to name every
         | intermediate values instead of being able to filter and group
         | on-the-fly.
         | 
         | Of course you can make a mess out of those too, by building
         | huge unreadable expressions, but that's also an extreme,
         | similar to naming every intermediate step.
        
       | notretarded wrote:
       | Why would I use this over numpy?
        
         | lauriat wrote:
         | If you're doing matrix multiplication or other math operations
         | on fixed size sequences, you shouldn't.
         | 
         | If, however, you need the dynamic nature of the built-in list
         | or functional methods with a touch of numpyness, you should
         | give Array a spin.
        
       | RocketSyntax wrote:
       | Thank you for improving things and sharing.
       | 
       | I use numpy & pandas, lists & dicts every day. I read your
       | docs/github page, but can you help me see the value?
       | 
       | However, I do think there are lots of common tasks that need to
       | be done with lists that should be methods rather than fancy
       | footwork =)
       | 
       | For example: https://stackoverflow.com/questions/3462143/get-
       | difference-b...
       | 
       | As you allude to w your zip loop:
       | https://stackoverflow.com/questions/1919044/is-there-a-bette...
        
         | lauriat wrote:
         | Thank you for taking the time to check it out!
         | 
         | Naturally if you're dealing with big arrays/tensors, numpy is
         | the best choice for operating on sequences.
         | 
         | However, ndarrays have downsides for certain use cases - as
         | ndarrays are fixed size, adding elements is very slow, also
         | they don't support functional methods (or rather you have to
         | create a new array every time you apply e.g. a map), and
         | ndarrays of any other type than numbers doesn't really make
         | sense.
         | 
         | Many of the methods are wrappers for built-ins, but I find the
         | syntax of Arrays cleaner than the weirdness of the builtins.
         | 
         | For example, while applying an async "starmap" to an Array is
         | just a method call, with built-in lists you would have go
         | through the whole hassle of importing both ThreadPoolExecutor
         | and starmap, creating an executor, scheduling the function, and
         | finally converting the result back to a list.
        
           | RocketSyntax wrote:
           | ndarrays "create a new array every time you apply"
           | 
           | That resonates with me now that you explain that I can't do
           | it.
           | 
           | I do like chaining things in pandas like
           | `df.select_types("float").head(100).plot.hist()`
        
             | nerdponx wrote:
             | Be careful though. Numpy and Pandas go through some trouble
             | to make sure that the data _inside_ the array is not
             | actually copied. For instance, reshaping and slicing just
             | return memory views. Pandas emits a somewhat-infamous
             | warning about it that often confuses newbies.
        
       | goodside wrote:
       | I know this is an early/experimental project, but the README
       | could use more motivation before diving into basic usage. Asking
       | someone to change their general-purpose containers is a big ask.
       | 
       | It looks like Array mostly consolidates functional features
       | already available in standard libraries, and the main innovation
       | is a redesigned swiss-army-knife API.
       | 
       | Good APIs are important, but my instinct is they aren't _this_
       | important. Using enhanced versions of built-in container types
       | sounds nice, but do you really want to be keeping track of
       | whether something is a normal list or an Array? Do you want to
       | force people who read your code to learn this library to work
       | with something as fundamental as lists? It's not an impossible
       | bar to clear (e.g. NumPy, Pandas, Dask, xarray) but it's a high
       | one.
        
         | lauriat wrote:
         | Thanks for the feedback! Redesigned swiss-army-knife is well
         | put.
         | 
         | I'm sure Array's not for everyone, but for some, including me
         | it's a nifty tool. I don't expect people to memorise all the
         | features of the library - the aim was to name and document each
         | feature clearly such that finding the right method would be
         | easy with the help of an IDE.
        
       | asimjalis wrote:
       | I find this really useful and plan to use it. Thanks for writing
       | and sharing.
       | 
       | One use case for the chaining/FP style that I find particularly
       | powerful is building out logic on the REPL. The chaining style
       | allows me to incrementally grow my chain like a unix pipeline,
       | see the results, use that to tweak the chain, until I finally
       | have what I want.
       | 
       | This type of instantaneous feedback loop is both highly
       | productive and also extremely fun.
        
         | lauriat wrote:
         | cheers!
        
       | pedrovhb wrote:
       | I think this is neat but I'm not sure it's the best way to go
       | about things.
       | 
       | > all(map(func3, filter(func2, map(func1, zip(a, b)))))
       | 
       | > a.zip(b).map(func1).filter(func2).forall(func3)
       | 
       | The original is indeed terrible and the second version is a bit
       | better. A lot better than either one, though, is splitting your
       | logic into multiple lines and assigning a descriptive identifier
       | to each step. Maybe even throw in some inline comments if you're
       | particularly respectful of others' time.
       | 
       | As tempting as it is to do something super clever and cram a ton
       | of functionality into a small number of lines or characters (it
       | does feel good), it's just better to be a bit more verbose and
       | write simple, obvious code. I feel like code should be read like
       | a book, not a puzzle.
        
         | bko wrote:
         | > a.zip(b).map(func1).filter(func2).forall(func3)
         | 
         | Lets make this a somewhat concrete example.
         | 
         | ---
         | 
         | heights = [1,2,3]
         | 
         | widths = [4,5,6]
         | 
         | # printing area greater than 10
         | 
         | # functional
         | 
         | heights.zip(widths).map(to_area).filter(lambda area: area >
         | 10).forall(lambda a: print("Area " + a)
         | 
         | #Verbose way
         | 
         | hw_zipped = zip(a,b)
         | 
         | areas = hw_zipped.map(to_inches)
         | 
         | big_areas = areas.filter(a: a > 10)
         | 
         | for a in big_areas: print("Area " + a)
         | 
         | ---
         | 
         | Which do you prefer? I would argue the right level of
         | abstraction is the functional way in this example, and its
         | often the case in my experience, especially in python where you
         | don't often use a namespace to store these intermediary
         | variables and you have can't rely on typing
        
           | claytonjy wrote:
           | As another point of comparison, as of python 3.8 you can do
           | this in one list comp without nesting or double-computing
           | areas with the walrus:                   result = [area for
           | x,y in zip(heights,widths) if (area := to_area(x,y)) > 10]
           | 
           | I don't think that's very easy to read; I'd opt for two list
           | comps like                   areas = [to_area(x,y) for x,y in
           | zip(heights,widths)]         result = [area for area in areas
           | if area > 10]
           | 
           | But I agree with OP that map+filter is easier to read.
        
             | bko wrote:
             | I agree. My main problem is I don't want intermediary
             | variables floating around. Especially something like
             | "areas". If python localized variables to a blocked
             | namespace, I wouldn't mind
             | 
             | In scala:
             | 
             | ---
             | 
             | val widths = Seq(1,2,3)
             | 
             | val heights = Seq(4,5,6)
             | 
             | widths.zip(heights).foreach { case (w, h) => {
             | val area = w * h            if (area > 10) {
             | println(s"Area: ${area}")            }
             | 
             | }}
             | 
             | println(area) // error: not found: value area
        
           | syrrim wrote:
           | for x, y in zip(a,b):           area = to_area(x, y)
           | if area > 10:               print(f"Area {area}")
           | 
           | >in python where you don't often use a namespace to store
           | these intermediary variables
           | 
           | Hm? Most python code is within a function, in my experience.
        
             | bko wrote:
             | You can abstract it out to a function but I think its
             | overkill, even if you generalize to something like
             | print_area_filter(heights, widths, value, cmp) or whatever
             | 
             | If its not in a function, your example may (or may not
             | depending on length if either a or b have a length of zero)
             | create a floating variable called area out there.
        
         | derwiki wrote:
         | Code is read more often than it's written; optimize for
         | reading.
        
         | 6gvONxR4sf7o wrote:
         | You can split the a.b.c.d onto different lines and comment
         | each, which is a decent middle ground sometimes
         | (a\n.b\n.c\n.d). A problem, still, is exceptions and debugging.
         | You get paged and see that something went wrong in that
         | expression that does so many different things, and it's much
         | more frustrating to track down the bug. It makes step debugging
         | trickier too. I'd love better error message/debugger support
         | for that kind of programming.
        
         | Phemist wrote:
         | This feels luke a strawman example. I feel like list
         | comprehension results in a much more readable example here. I
         | think, at least.
         | 
         | > all(func3(a) for h,w in zip(a,b) for a in func1(h,w) if
         | func2(a))
        
           | lauriat wrote:
           | Fair enough. Readability is subjective but I understand the
           | sentiment. Constructing list comprehensions of such long
           | chained expressions can be rather tedious and error prone,
           | though (as your example shows).
        
         | snicker7 wrote:
         | > assigning a descriptive identifier to each step
         | 
         | Working with data scientists, in practice, these identifiers
         | are usually "arr1", "arr2", &c. I'd rather have method
         | chaining. Often the intermediates are not meaningful.
        
           | disgruntledphd2 wrote:
           | I agree with you in general, people (especially data
           | scientists) are bad at naming things.
           | 
           | It's probably the core skill of good programmers though, so
           | it should be taught more. I don't think anyone sets out to
           | use misleading names, but it's easy for name and code to
           | diverge, and it's crippling to readability.
           | 
           | However, often when refactoring/updating such data scientist
           | code (or even understanding), I need to break apart the long
           | method chains, and this is much, much more annoying than
           | dealing with crummy names.
           | 
           | At least I can print the values associated with the names,
           | which is not easily possible in the really long method chain.
        
         | brundolf wrote:
         | What I like about "cramming a ton of functionality into [a
         | single expression]" is that it doesn't leak any intermediates
         | to the rest of the block, and it doesn't allow for mutation.
         | There's a single output exposed; you can't accidentally use the
         | wrong value downstream. You could wrap it all in an inner
         | function, I guess, but that seems like overkill unless you plan
         | to reuse it.
         | 
         | Though to be fair, having explicit intermediate variables is
         | idiomatic in Python, from what I've seen. It's one of my
         | biggest pet-peeves about the language, but it's not without
         | precedent.
        
         | lauriat wrote:
         | I agree, and yes, the line may be a bit excessive. The idea of
         | Arrays is not just to cram a heap of functions to a single
         | line. The readability (at least to me) is improved even with
         | e.g. a single map                 arr.map(func)
         | 
         | vs.                 list(map(func, arr))
        
         | ElevenPhonons wrote:
         | Are these really the same?
         | 
         | The idiomatic Python 3 version uses generators to compose the
         | computation and to avoid unnecessary memory allocations. Does
         | funct.Array also do this?
         | 
         | - https://docs.python.org/3/library/functions.html#map -
         | https://docs.python.org/3/library/functions.html#filter
        
       ___________________________________________________________________
       (page generated 2021-01-10 23:03 UTC)