[HN Gopher] Qsv: Efficient CSV CLI Toolkit
___________________________________________________________________
Qsv: Efficient CSV CLI Toolkit
Author : s1291
Score : 56 points
Date : 2023-12-22 12:50 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| foehrenwald wrote:
| related: https://github.com/johnkerl/miller
|
| I am wondering who really uses these tools and for what since
| there are R and python data science tools available?
| snidane wrote:
| Out of core computations. While your python and R script will
| choke after reading few hundred megs, my compiled binary cli
| will keep streaming through many such files with memory usage
| sitting somewhere near zero.
| mbreese wrote:
| That's just the effect of streaming IO vs reading in the file
| into memory all at once. That has nothing to do with the
| language you use, but how you process the data.
|
| I keep multiple little Python scripts around to do things
| like sum lists of numbers (think extracting a column with
| awk, then calculating a sum). Compiled vs an interpreted
| script really doesn't matter. What matters is using the right
| algorithm for the job. R and Python data science libraries
| like to read in all of the data at once into one single data
| structure. That's the anti-pattern to avoid if at all
| possible.
|
| (But they are very handy for small datasets of complex
| calculations that require the entire dataset in memory. )
| hermitcrab wrote:
| Also: https://github.com/BurntSushi/xsv
| https://csvkit.readthedocs.io/en/latest/
| dima55 wrote:
| For simple analyses (i.e. what most people do most of the time)
| doing this on the commandline gets you there faster. I use
| vnlog (https://github.com/dkogan/vnlog/). By the time you fired
| up your editor to write your Python code, I already have
| analyses and plots ready.
| fbdab103 wrote:
| I write Python every day, but still use miller here and there.
| If I am doing a "simple" operation (eye of the beholder), being
| able to pipe it on the command line is great.
|
| To do a comparable amount of manipulation in Python takes a lot
| more boilerplate (imports, command line arguments, diety-can-
| we-default-to-Int64 already?, etc), plus you have to ensure you
| have a virtual environment with correct dependencies. Which is
| more or less standard numpy+pandas, but a single executable
| tool to do some data workup is always appreciated.
|
| I am never performance constrained, but I have been told that
| miller is one of the slower tools in this space, but I still
| reach for it do to its wide format support.
| dima55 wrote:
| An incomplete list of other similar tools:
| https://github.com/dkogan/vnlog/#description
| alchemist1e9 wrote:
| Here is a related but more obscure tool that can be
| surprisingly useful.
|
| http://hopper.si.edu/wiki/mmti/Starbase
|
| Their tbl format is so trivially close to standard csv that I
| just convert on the fly back and forth with tiny helper perl
| scripts.
| alchemist1e9 wrote:
| Wow! This looks a really complete set of operations and extremely
| useful.
| snidane wrote:
| This looks great!
|
| Please consider removing any implicit network calls like the
| initial "Checking GitHub for updates...". This itself will
| prevent people from adoption or even trying it any further. This
| is similar to gnu parallel's --citation, which, albeit a small
| thing - will scare many people off.
|
| Consider adding pivot and unpivot operations. Mlr gets it quite
| right with syntax, but is unusable since it doesn't work in
| streaming mode and tries to load everything into memory, despite
| claiming otherwise.
|
| Consider adding basic summing command. Sum is the most common
| data operation, which could warrant its own special optimized
| command, instead offloading this to external math processor like
| lua or python. Even better if this had a group by (-by) and
| window by (-over) capability. Eg. 'qsv sum col1,col2 -by
| col3,col4'. Brimdata's zq utility is the only one I know that
| does this quite right, but is quite clunky to use.
|
| Consider adding a laminate command. Essentially adding a new
| column with a constant. This probably could be achieved by a join
| with a file with a single row, but why not make this common
| operation easier to use.
|
| Consider the option to concatenate csv files with mismatched
| headers. cat rows or cat columns complains about the mismatch.
| One of the most common problems with handling csvs is schema
| evolution. I and many others would appreciate if we could merge
| similar csvs together easily.
|
| Conversions to and from other standard formats would be
| appreciated (parquet, ion, fixed width lenghts, avro, etc.). Othe
| compression formats as well - especially zstd.
|
| It would be nice if the tool enabled embedding outputs of
| external commands easily. Lua and python builtin support is nice,
| but probably not sufficient. i'd like to be able to run a jq
| command on a single column and merge it back as another for
| example.
|
| Inspiration: - csvquote:
| https://news.ycombinator.com/item?id=31351393 - teip:
| https://github.com/greymd/teip
| quasarj wrote:
| Wait, who is scared off by parallel's --citation?
| fbdab103 wrote:
| I refuse to use parallel due to that obnoxiousness.
|
| At minimum, it is not installed by default, so it is already
| a negative to just using xargs. That it then puts that
| barrier in my way makes it an easy tool to skip.
| quasarj wrote:
| I just don't understand what barrier you are talking about.
| I just checked, it doesn't even whine at you when you use
| it, the help just notes that you should cite it if you
| publish a paper where you used it. And... anyone publishing
| papers knows about citation requirements lol. Anyone else
| can ignore it. What is this barrier?
| dima55 wrote:
| In addition to being annoying, it raises questions about
| whether it is free software or not. Some people care a
| whole lot about that. And some people have higher
| standards about being nagged. And lots and lots of time
| was spent discussing solutions, for instance:
| https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=915541
| quasarj wrote:
| Ah, I see, they have changed it (or possibly the version
| on my system has had the --will-cite patched out, as
| discussed in this bug).
|
| Okay, I accept your argument about Free Software.
| However, I find it interesting that it's a GNU project...
| they are generally the most hardline Free Software
| people.
| fbdab103 wrote:
| To slippery slope this, what happens if more tools start
| adopting this behavior? Curl now asks you to buy Daniel
| Stenberg a coffee on each use. Wget asks you to support
| Ukraine. Caddy wants you to invest in their startup. Each
| of which may come with their own `--ignore-annoyance-
| flag` I need to learn. The best I can do is vote with my
| feet.
|
| I also do not care for the citation requirement. I
| utilize tons of tools in my work which go unstated. I do
| not feel the need to cite Linux, DNS, htop, Make, Diet
| Coke, my Kinesis keyboard, etc. Sadly, reliable plumbing
| gets no respect. Especially for a tool which is more or
| less interchangeable with some shell scripting. Unless I
| am trying to shore up the references list, I am going to
| cite directly relevant work.
|
| At some point, you no longer need to note that your work
| was powered by electricity.
| jasonjayr wrote:
| Vim has solicited donations for Uganda since forever.
| dima55 wrote:
| You can get quite far by piping to other tools and/or using
| DSLs. pivoting can almost certainly be done by the luau support
| in qsv (or `vnl-filter`, for instance). Summing and grouping is
| something that `datamash` does well (or qsv luau probably, or
| `vnl-filter --eval`). Adding a column once again can be done
| with luau or `vnl-filter`.
|
| Would you be more likely to use this tool if it had even more
| stuff in it requiring reading even more documentation? That's a
| genuine question.
___________________________________________________________________
(page generated 2023-12-23 23:00 UTC)