[HN Gopher] Query Your Python Lists
       ___________________________________________________________________
        
       Query Your Python Lists
        
       Author : mkalioby
       Score  : 78 points
       Date   : 2024-11-15 06:18 UTC (5 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | sevensor wrote:
       | Having seen a lot of work come to grief because of the decision
       | to use pandas, anything that's not pandas has my vote. Pandas: if
       | you're not using it interactively, don't use it at all. This
       | advice goes double if your use case is "read a csv." Standard
       | library in Python has you covered there.
        
         | ttyprintk wrote:
         | Since DuckDB can read and write Pandas from memory, a team with
         | varying Pandas fluency can benefit from learning DuckDB.
        
           | adolph wrote:
           | Since Pandas 2, Apache Arrow replaced NumPy as the backend
           | for Pandas. Arrow is also used by Polars, DuckDB, Ibis, the
           | list goes on.
           | 
           | https://arrow.apache.org/overview/
           | 
           |  _Apache Arrow solves most discussed problems, such as
           | improving speed, interoperability, and data types, especially
           | for strings. For example, the new string[pyarrow] column type
           | is around 3.5 times more efficient. [...] The significant
           | achievement here is zero-copy data access, mapping complex
           | tables to memory to make accessing one terabyte of data on
           | disk as fast and easy as one megabyte._
           | 
           | https://airbyte.com/blog/pandas-2-0-ecosystem-arrow-
           | polars-d...
        
         | c0balt wrote:
         | Both duckdb and especially polars should also be mentioned
         | here. Polars in particular is quite good Ime if you want a
         | pandas-alike interface (it additionally also has a more sane
         | interface).
        
       | abdullahkhalids wrote:
       | I don't understand why numeric filters are included. The library
       | is written in python, so shouldn't a lambda function based filter
       | be roughly as fast but much easier/clearer to write.
        
         | MathMonkeyMan wrote:
         | I'm not the author, but this implementation has the benefit of
         | being a JSON compatible DSL that you can serialize. Maybe
         | that's intentional, maybe not.
         | 
         | It does look like Python's comprehensions would be a better
         | choice if you're writing them by hand anyway.
        
           | wis wrote:
           | Yea, In my opinion using Python's list comprehension is more
           | readable and code checkable.
           | 
           | Here's the usage example from the README:
           | from leopards import Q       l = [{"name":"John","age":"16"},
           | {"name":"Mike","age":"19"},{"name":"Sarah","age":"21"}]
           | filtered= Q(l,{'name__contains':"k", "age__lt":20})
           | print(list(filtered))
           | 
           | Versus:                 [x for x in l if ('k' in x['name']
           | and int(x['age']) < 20)]
           | 
           | Outputs:                 [{'name': 'Mike', 'age': '19'}]
           | 
           | Also from the readme:                 > Even though, age was
           | str in the dict, as the value of in the query dict was int,
           | Leopards converted the value in dict automatically to match
           | the query data type. This behaviour can be stopped by passing
           | False to convert_types parameter.
           | 
           | I don't like this default behavior.
        
             | WesleyJohnson wrote:
             | That's a bit biased, no? The actual comparison should be:
             | filtered = Q(l,{'name__contains':"k", "age__lt":20})
             | 
             | Verus:                 filtered = [x for x in l if ('k' in
             | x['name'] and int(x['age']) < 20)]
        
           | mkalioby wrote:
           | Sure, we wrote this to filter Json data based on user
           | provided values in a search form.
        
         | yunohn wrote:
         | AFAICT this should filter in one pass, so it would be faster
         | than multiple lambdas, or this plus a lambda for numeric.
        
       | tempcommenttt wrote:
       | It's nice it's fast at 10k dictionary entries, but how does it
       | scale?
        
         | fatih-erikli-cg wrote:
         | I think 10000 is a lot enough for a queryable dataset. More of
         | them is like computer generated things like logs etc.
        
         | HumblyTossed wrote:
         | > but how does it scale
         | 
         | Usually sideways, but if you stack them, you might get some
         | vertical.
        
       | glial wrote:
       | Interesting work. I'd be curious to know the timing relative to
       | list comprehensions for similar queries, since that's the common
       | standard library alternative for many of these examples.
        
         | mkalioby wrote:
         | Good point, but the libray allows you to generate custom
         | queries based on user input which will be tough by list
         | comprehension
        
       | dsp_person wrote:
       | Interesting... I've been playing with the idea of embedding more
       | python in my C, no cython or anything just using <Python.h> and
       | <numpy/arrayobject.h>. From one perspective it's just "free"
       | C-bindings to a lot of optimized packages. Trying some different
       | C-libraries, the python code is often faster. Python almost
       | becomes C's package manager
       | 
       | E.g. sorting 2^23 random 64-bit integers: qsort: 850ms, custom
       | radix sort: 250ms, ksort.h: 582ms, np.sort: 107ms (including
       | PyArray_SimpleNewFromData, PyArray_Sort). Where numpy uses
       | intel's x86-simd-sort I believe.
       | 
       | E.g. inserting 8M entries into a hash table (random 64-bit keys
       | and values): MSI-style hash table: ~100ns avg insert/lookup,
       | cc_map: ~95ns avg insert/lookup, Python.h: 91ns insert, 60ns
       | lookup
       | 
       | I'm curious if OPs tool might fit in similarly. I've found lmdb
       | to be quite slow even in tmpfs with no sync, etc.
        
         | sitkack wrote:
         | You should look at embedding Wasmtime into your C.
         | 
         | https://github.com/bytecodealliance/wasmtime/tree/main/examp...
        
       | anentropic wrote:
       | Django ORM for plain lists is interesting I guess... but being
       | faster than pandas at that is quite a surprise, bravo!
        
         | mkalioby wrote:
         | Thanks alot.
        
       | maweki wrote:
       | Embedding functionality into strings prevents any kind of static
       | analysis. The same issue as embedding plain SQL, plain regexes,
       | etc..
       | 
       | I am always in favor of declarative approaches where applicable.
       | But whenever they are embedded in this way, you get this static
       | analysis barrier and a possible mismatch between the imperative
       | and declarative code, where you change a return type or field
       | declaratively and it doesn't come up as an error in the
       | surrounding code.
       | 
       | A positive example is VerbalExpressions in Java, which only allow
       | expressing valid regular expressions and every invalid regular
       | expression is inexpressible in valid java code. Jooq is another
       | example, which makes incorrect (even incorrectly typed) SQL code
       | inexpressible in Java.
       | 
       | I know python is a bit different, as there is no extensive static
       | analysis in the compiler, but we do indeed have a lot of static
       | analysis tools for python that could be valuable. A statically
       | type-safe query is a wonderful thing for safety and
       | maintainability and we do have good type-checkers for python.
        
         | gpderetta wrote:
         | If your schema is dynamic, in most languages there isn't much
         | you can do, but at least in python
         | Q(name=contains('k'))
         | 
         | it is not particularly more complex to write and certainly more
         | composable, extensible and checkable.
         | 
         | Alternatively go full eval and do                  Q("'k' in
         | name")
        
         | notpushkin wrote:
         | I love how PonyORM does this for SQL: it's just Puthon `x for x
         | in ... if ...`.
         | 
         | Of course, if you use the same syntax for Python lists of
         | dicts, you don't need any library at all.
        
         | eddd-ddde wrote:
         | I disagree. You'll be surprised to hear this, but source
         | code... is just a very big string...
         | 
         | If you can run static analysis on that you can run static
         | analysis on string literals. Much like how C will give you
         | warnings for mismatched printf arguments.
        
           | maweki wrote:
           | You might be surprised to hear that most compilers and static
           | analysis tools in general do not inspect (string and other)
           | literals, while they do indeed inspect all the other parts
           | and structure of the abstract syntax tree.
        
             | eddd-ddde wrote:
             | I know, but that's the point, if you can get a string into
             | an AST you can just do the same thing with the string
             | literals. It's not magic.
        
               | saghm wrote:
               | You can't get an arbitrary string into an AST, only ones
               | that can be at parsed correctly. Rejecting the invalid
               | strings that wouldn't make sense to do analysis on is
               | pretty much the same thing that the parent comment is
               | saying to do with regexes, SQL, etc., just as part of the
               | existing compilation that's happening via the type system
               | rather than at runtime.
        
               | skeledrew wrote:
               | Everything can be abstracted away using specialized
               | objects, which can allow for better checking. The Python
               | AST itself is just specialized objects, and it can be
               | extended (but of course with much more work, esp in the
               | analysis tools). There's also this very ingenious - IMO -
               | monstrosity:
               | https://pydong.org/posts/PythonsPreprocessor/. Pick your
               | poison.
        
               | scott_w wrote:
               | Not in the standard language functions. If you wanted to
               | achieve this, you have to write your own parser. That
               | parser is, by definition, not the language parser, adding
               | a level of difficulty to proving any correctness of your
               | program.
               | 
               | There's a reason the term "stringly-typed" is used as a
               | criticism of a language.
        
               | jerf wrote:
               | This is one of those ideas that I've seen kicking around
               | for at least a decade now, but manifesting it in real
               | code is easier said than done. And that isn't even the
               | real challenge, the real challeng is _keeping it working
               | over time_.
               | 
               | I've seen some stuff based on treesitter that seems to be
               | prompting a revival of the idea, but it still has
               | fundamental issues, e.g., if I'm embedding in python:
               | sql = "SELECT * FROM table "         if
               | arbitrarilyComplicatedCondition:             sql +=
               | "INNER JOIN a AS joined ON table.thing = a.id "
               | else:             sql += "INNER JOIN b AS joined ON
               | table.thing = b.id "         sql += "WHERE joined.
               | 
               | and if you imagine trying to write something to
               | autocomplete at the point I leave off, you're
               | fundamentally stuck on not knowing which table to
               | autocomplete with. It doesn't matter what tech you swing
               | at the problem, since trying to analyze
               | "arbitrarilyComplicatedCondition" is basically Turing
               | Complete (which I will prove by vigorous handwave here
               | because turning that into a really solid statement would
               | be much larger than this entire post, but, it can be
               | done). And that's just a simple and quick example, it's
               | not just "autocomplete", it's _any_ analysis you may want
               | to do on the embedded content.
               | 
               | This is just a simple example; they get arbitrarily
               | complicated, quickly. This is one of those things that
               | when you think of the simple case it seems so easy but
               | when you try to bring it into the real world it
               | _immediately_ explodes with all the complexity your mind
               | 's eye was ignoring.
        
         | WesleyJohnson wrote:
         | Could this be mitigated by using `dict()` instead of the `{}`
         | literal, and then running an analysis to ensure the provided
         | dictionary keys all end with valid operations? E.g, __contains,
         | __lt, etc?
         | 
         | I don't have a strong background in static analysis.
        
       | graup wrote:
       | {'name__contains':"k", "age__lt":20}
       | 
       | Kind of tangential to this package, but I've always loved this
       | filter query syntax. Does it have a name?
       | 
       | I first encountered it in Django ORM, and then in DRF, which has
       | them as URL query params. I have recently built a parser for this
       | in Javascript to use it on the frontend. Does anyone know any JS
       | libraries that make working with this easy? I'm thinking parsing
       | and offering some kind of database-agnostic marshaling API. (If
       | not, I might have to open-source my own code!)
        
         | Daishiman wrote:
         | I think its first usage really was in Django. I've always
         | referred to it as the Django ORM query notation.
        
       | badmintonbaseba wrote:
       | I would prefer something like `{"name": contains("k")}`, where
       | contains("k") returns an object with a custom __eq__ that
       | compares equal to any string (or iterable) that contains "k".
       | Then you can just filter by equality.
       | 
       | I recently started using this pattern for pytest equality
       | assertions, as pytest helpfully produces a detailed diff on
       | mismatch. It's not perfect, as pytest doesn't always produce a
       | correct diff with this pattern, but it's better than some
       | alternatives.
        
         | gpderetta wrote:
         | Instead of returning an __eq__ object, I have had 'contains'
         | just return a predicate function (that you can pass to filter
         | for example. I guess in practice it doesn't really change much,
         | except that calling it __eq__ is a bit misleading.
         | 
         | A significant advantage is that you can just pass an inline
         | lambda.
        
       | mark_l_watson wrote:
       | Apologies for being off topic, but after reading the
       | implementation code, I was amazed at how short it is!
       | 
       | I have never been a huge fan of Python (Lisp person) but I really
       | appreciate how concise Python can be, and the dynamic nature of
       | Python allows the nice query syntax.
        
         | gabrielsroka wrote:
         | > Python can be seen as a dialect of Lisp
         | 
         | - Peter Norvig
         | 
         | https://www.norvig.com/python-lisp.html
        
           | mark_l_watson wrote:
           | Well, Peter has moved on to Python. I had lunch with him in
           | 2001, expecting to talk about Common Lisp, but he already
           | seemed more interested in Python.
           | 
           | It has been a while since I studied his Lisp code, but I
           | watch for new Python studies he releases.
        
         | pama wrote:
         | Agreed on conciseness of the implementation. It is short and
         | clear despite having a Max and Min that share all except one
         | character in 30 lines of code.
        
       | James_K wrote:
       | Maybe this is just me, but embedding the language in strings like
       | this seems like it's just asking for trouble.
        
         | HumblyTossed wrote:
         | Yes, it looks very fragile.
        
       | pphysch wrote:
       | I feel like the scale where a library like this is meaningful for
       | performance, and therefore worth the dependency+DSL complexity,
       | is also the scale where you should use a proper database (even
       | just SQLite).
        
       | bityard wrote:
       | Title should be prefixed with Show HN and the name of the project
       | in order to not mislead readers about the content of the link.
        
       | dmezzetti wrote:
       | Interesting project and approach, thanks for sharing!
       | 
       | If you're interested in a simple solution to query a list with
       | SQL including vector similarity, check this out:
       | https://gist.github.com/davidmezzetti/f0a0b92f5281924597c9d1...
        
       | Kalanos wrote:
       | You can create a dataframe from a list of dictionaries in pandas
       | 
       | `df = pd.DataFrame([{},{},{}])`
        
         | ciupicri wrote:
         | This is supposedly a bit faster
         | (https://github.com/mkalioby/leopards?tab=readme-ov-
         | file#comp...).
        
       ___________________________________________________________________
       (page generated 2024-11-20 23:01 UTC)