[HN Gopher] Query Your Python Lists
___________________________________________________________________
Query Your Python Lists
Author : mkalioby
Score : 78 points
Date : 2024-11-15 06:18 UTC (5 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| sevensor wrote:
| Having seen a lot of work come to grief because of the decision
| to use pandas, anything that's not pandas has my vote. Pandas: if
| you're not using it interactively, don't use it at all. This
| advice goes double if your use case is "read a csv." Standard
| library in Python has you covered there.
| ttyprintk wrote:
| Since DuckDB can read and write Pandas from memory, a team with
| varying Pandas fluency can benefit from learning DuckDB.
| adolph wrote:
| Since Pandas 2, Apache Arrow replaced NumPy as the backend
| for Pandas. Arrow is also used by Polars, DuckDB, Ibis, the
| list goes on.
|
| https://arrow.apache.org/overview/
|
| _Apache Arrow solves most discussed problems, such as
| improving speed, interoperability, and data types, especially
| for strings. For example, the new string[pyarrow] column type
| is around 3.5 times more efficient. [...] The significant
| achievement here is zero-copy data access, mapping complex
| tables to memory to make accessing one terabyte of data on
| disk as fast and easy as one megabyte._
|
| https://airbyte.com/blog/pandas-2-0-ecosystem-arrow-
| polars-d...
| c0balt wrote:
| Both duckdb and especially polars should also be mentioned
| here. Polars in particular is quite good Ime if you want a
| pandas-alike interface (it additionally also has a more sane
| interface).
| abdullahkhalids wrote:
| I don't understand why numeric filters are included. The library
| is written in python, so shouldn't a lambda function based filter
| be roughly as fast but much easier/clearer to write.
| MathMonkeyMan wrote:
| I'm not the author, but this implementation has the benefit of
| being a JSON compatible DSL that you can serialize. Maybe
| that's intentional, maybe not.
|
| It does look like Python's comprehensions would be a better
| choice if you're writing them by hand anyway.
| wis wrote:
| Yea, In my opinion using Python's list comprehension is more
| readable and code checkable.
|
| Here's the usage example from the README:
| from leopards import Q l = [{"name":"John","age":"16"},
| {"name":"Mike","age":"19"},{"name":"Sarah","age":"21"}]
| filtered= Q(l,{'name__contains':"k", "age__lt":20})
| print(list(filtered))
|
| Versus: [x for x in l if ('k' in x['name']
| and int(x['age']) < 20)]
|
| Outputs: [{'name': 'Mike', 'age': '19'}]
|
| Also from the readme: > Even though, age was
| str in the dict, as the value of in the query dict was int,
| Leopards converted the value in dict automatically to match
| the query data type. This behaviour can be stopped by passing
| False to convert_types parameter.
|
| I don't like this default behavior.
| WesleyJohnson wrote:
| That's a bit biased, no? The actual comparison should be:
| filtered = Q(l,{'name__contains':"k", "age__lt":20})
|
| Verus: filtered = [x for x in l if ('k' in
| x['name'] and int(x['age']) < 20)]
| mkalioby wrote:
| Sure, we wrote this to filter Json data based on user
| provided values in a search form.
| yunohn wrote:
| AFAICT this should filter in one pass, so it would be faster
| than multiple lambdas, or this plus a lambda for numeric.
| tempcommenttt wrote:
| It's nice it's fast at 10k dictionary entries, but how does it
| scale?
| fatih-erikli-cg wrote:
| I think 10000 is a lot enough for a queryable dataset. More of
| them is like computer generated things like logs etc.
| HumblyTossed wrote:
| > but how does it scale
|
| Usually sideways, but if you stack them, you might get some
| vertical.
| glial wrote:
| Interesting work. I'd be curious to know the timing relative to
| list comprehensions for similar queries, since that's the common
| standard library alternative for many of these examples.
| mkalioby wrote:
| Good point, but the libray allows you to generate custom
| queries based on user input which will be tough by list
| comprehension
| dsp_person wrote:
| Interesting... I've been playing with the idea of embedding more
| python in my C, no cython or anything just using <Python.h> and
| <numpy/arrayobject.h>. From one perspective it's just "free"
| C-bindings to a lot of optimized packages. Trying some different
| C-libraries, the python code is often faster. Python almost
| becomes C's package manager
|
| E.g. sorting 2^23 random 64-bit integers: qsort: 850ms, custom
| radix sort: 250ms, ksort.h: 582ms, np.sort: 107ms (including
| PyArray_SimpleNewFromData, PyArray_Sort). Where numpy uses
| intel's x86-simd-sort I believe.
|
| E.g. inserting 8M entries into a hash table (random 64-bit keys
| and values): MSI-style hash table: ~100ns avg insert/lookup,
| cc_map: ~95ns avg insert/lookup, Python.h: 91ns insert, 60ns
| lookup
|
| I'm curious if OPs tool might fit in similarly. I've found lmdb
| to be quite slow even in tmpfs with no sync, etc.
| sitkack wrote:
| You should look at embedding Wasmtime into your C.
|
| https://github.com/bytecodealliance/wasmtime/tree/main/examp...
| anentropic wrote:
| Django ORM for plain lists is interesting I guess... but being
| faster than pandas at that is quite a surprise, bravo!
| mkalioby wrote:
| Thanks alot.
| maweki wrote:
| Embedding functionality into strings prevents any kind of static
| analysis. The same issue as embedding plain SQL, plain regexes,
| etc..
|
| I am always in favor of declarative approaches where applicable.
| But whenever they are embedded in this way, you get this static
| analysis barrier and a possible mismatch between the imperative
| and declarative code, where you change a return type or field
| declaratively and it doesn't come up as an error in the
| surrounding code.
|
| A positive example is VerbalExpressions in Java, which only allow
| expressing valid regular expressions and every invalid regular
| expression is inexpressible in valid java code. Jooq is another
| example, which makes incorrect (even incorrectly typed) SQL code
| inexpressible in Java.
|
| I know python is a bit different, as there is no extensive static
| analysis in the compiler, but we do indeed have a lot of static
| analysis tools for python that could be valuable. A statically
| type-safe query is a wonderful thing for safety and
| maintainability and we do have good type-checkers for python.
| gpderetta wrote:
| If your schema is dynamic, in most languages there isn't much
| you can do, but at least in python
| Q(name=contains('k'))
|
| it is not particularly more complex to write and certainly more
| composable, extensible and checkable.
|
| Alternatively go full eval and do Q("'k' in
| name")
| notpushkin wrote:
| I love how PonyORM does this for SQL: it's just Puthon `x for x
| in ... if ...`.
|
| Of course, if you use the same syntax for Python lists of
| dicts, you don't need any library at all.
| eddd-ddde wrote:
| I disagree. You'll be surprised to hear this, but source
| code... is just a very big string...
|
| If you can run static analysis on that you can run static
| analysis on string literals. Much like how C will give you
| warnings for mismatched printf arguments.
| maweki wrote:
| You might be surprised to hear that most compilers and static
| analysis tools in general do not inspect (string and other)
| literals, while they do indeed inspect all the other parts
| and structure of the abstract syntax tree.
| eddd-ddde wrote:
| I know, but that's the point, if you can get a string into
| an AST you can just do the same thing with the string
| literals. It's not magic.
| saghm wrote:
| You can't get an arbitrary string into an AST, only ones
| that can be at parsed correctly. Rejecting the invalid
| strings that wouldn't make sense to do analysis on is
| pretty much the same thing that the parent comment is
| saying to do with regexes, SQL, etc., just as part of the
| existing compilation that's happening via the type system
| rather than at runtime.
| skeledrew wrote:
| Everything can be abstracted away using specialized
| objects, which can allow for better checking. The Python
| AST itself is just specialized objects, and it can be
| extended (but of course with much more work, esp in the
| analysis tools). There's also this very ingenious - IMO -
| monstrosity:
| https://pydong.org/posts/PythonsPreprocessor/. Pick your
| poison.
| scott_w wrote:
| Not in the standard language functions. If you wanted to
| achieve this, you have to write your own parser. That
| parser is, by definition, not the language parser, adding
| a level of difficulty to proving any correctness of your
| program.
|
| There's a reason the term "stringly-typed" is used as a
| criticism of a language.
| jerf wrote:
| This is one of those ideas that I've seen kicking around
| for at least a decade now, but manifesting it in real
| code is easier said than done. And that isn't even the
| real challenge, the real challeng is _keeping it working
| over time_.
|
| I've seen some stuff based on treesitter that seems to be
| prompting a revival of the idea, but it still has
| fundamental issues, e.g., if I'm embedding in python:
| sql = "SELECT * FROM table " if
| arbitrarilyComplicatedCondition: sql +=
| "INNER JOIN a AS joined ON table.thing = a.id "
| else: sql += "INNER JOIN b AS joined ON
| table.thing = b.id " sql += "WHERE joined.
|
| and if you imagine trying to write something to
| autocomplete at the point I leave off, you're
| fundamentally stuck on not knowing which table to
| autocomplete with. It doesn't matter what tech you swing
| at the problem, since trying to analyze
| "arbitrarilyComplicatedCondition" is basically Turing
| Complete (which I will prove by vigorous handwave here
| because turning that into a really solid statement would
| be much larger than this entire post, but, it can be
| done). And that's just a simple and quick example, it's
| not just "autocomplete", it's _any_ analysis you may want
| to do on the embedded content.
|
| This is just a simple example; they get arbitrarily
| complicated, quickly. This is one of those things that
| when you think of the simple case it seems so easy but
| when you try to bring it into the real world it
| _immediately_ explodes with all the complexity your mind
| 's eye was ignoring.
| WesleyJohnson wrote:
| Could this be mitigated by using `dict()` instead of the `{}`
| literal, and then running an analysis to ensure the provided
| dictionary keys all end with valid operations? E.g, __contains,
| __lt, etc?
|
| I don't have a strong background in static analysis.
| graup wrote:
| {'name__contains':"k", "age__lt":20}
|
| Kind of tangential to this package, but I've always loved this
| filter query syntax. Does it have a name?
|
| I first encountered it in Django ORM, and then in DRF, which has
| them as URL query params. I have recently built a parser for this
| in Javascript to use it on the frontend. Does anyone know any JS
| libraries that make working with this easy? I'm thinking parsing
| and offering some kind of database-agnostic marshaling API. (If
| not, I might have to open-source my own code!)
| Daishiman wrote:
| I think its first usage really was in Django. I've always
| referred to it as the Django ORM query notation.
| badmintonbaseba wrote:
| I would prefer something like `{"name": contains("k")}`, where
| contains("k") returns an object with a custom __eq__ that
| compares equal to any string (or iterable) that contains "k".
| Then you can just filter by equality.
|
| I recently started using this pattern for pytest equality
| assertions, as pytest helpfully produces a detailed diff on
| mismatch. It's not perfect, as pytest doesn't always produce a
| correct diff with this pattern, but it's better than some
| alternatives.
| gpderetta wrote:
| Instead of returning an __eq__ object, I have had 'contains'
| just return a predicate function (that you can pass to filter
| for example. I guess in practice it doesn't really change much,
| except that calling it __eq__ is a bit misleading.
|
| A significant advantage is that you can just pass an inline
| lambda.
| mark_l_watson wrote:
| Apologies for being off topic, but after reading the
| implementation code, I was amazed at how short it is!
|
| I have never been a huge fan of Python (Lisp person) but I really
| appreciate how concise Python can be, and the dynamic nature of
| Python allows the nice query syntax.
| gabrielsroka wrote:
| > Python can be seen as a dialect of Lisp
|
| - Peter Norvig
|
| https://www.norvig.com/python-lisp.html
| mark_l_watson wrote:
| Well, Peter has moved on to Python. I had lunch with him in
| 2001, expecting to talk about Common Lisp, but he already
| seemed more interested in Python.
|
| It has been a while since I studied his Lisp code, but I
| watch for new Python studies he releases.
| pama wrote:
| Agreed on conciseness of the implementation. It is short and
| clear despite having a Max and Min that share all except one
| character in 30 lines of code.
| James_K wrote:
| Maybe this is just me, but embedding the language in strings like
| this seems like it's just asking for trouble.
| HumblyTossed wrote:
| Yes, it looks very fragile.
| pphysch wrote:
| I feel like the scale where a library like this is meaningful for
| performance, and therefore worth the dependency+DSL complexity,
| is also the scale where you should use a proper database (even
| just SQLite).
| bityard wrote:
| Title should be prefixed with Show HN and the name of the project
| in order to not mislead readers about the content of the link.
| dmezzetti wrote:
| Interesting project and approach, thanks for sharing!
|
| If you're interested in a simple solution to query a list with
| SQL including vector similarity, check this out:
| https://gist.github.com/davidmezzetti/f0a0b92f5281924597c9d1...
| Kalanos wrote:
| You can create a dataframe from a list of dictionaries in pandas
|
| `df = pd.DataFrame([{},{},{}])`
| ciupicri wrote:
| This is supposedly a bit faster
| (https://github.com/mkalioby/leopards?tab=readme-ov-
| file#comp...).
___________________________________________________________________
(page generated 2024-11-20 23:01 UTC)