hngopher.com

       [HN Gopher] Qsv: Efficient CSV CLI Toolkit
       ___________________________________________________________________
        
       Qsv: Efficient CSV CLI Toolkit
        
       Author : s1291
       Score  : 56 points
       Date   : 2023-12-22 12:50 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | foehrenwald wrote:
       | related: https://github.com/johnkerl/miller
       | 
       | I am wondering who really uses these tools and for what since
       | there are R and python data science tools available?
        
         | snidane wrote:
         | Out of core computations. While your python and R script will
         | choke after reading few hundred megs, my compiled binary cli
         | will keep streaming through many such files with memory usage
         | sitting somewhere near zero.
        
           | mbreese wrote:
           | That's just the effect of streaming IO vs reading in the file
           | into memory all at once. That has nothing to do with the
           | language you use, but how you process the data.
           | 
           | I keep multiple little Python scripts around to do things
           | like sum lists of numbers (think extracting a column with
           | awk, then calculating a sum). Compiled vs an interpreted
           | script really doesn't matter. What matters is using the right
           | algorithm for the job. R and Python data science libraries
           | like to read in all of the data at once into one single data
           | structure. That's the anti-pattern to avoid if at all
           | possible.
           | 
           | (But they are very handy for small datasets of complex
           | calculations that require the entire dataset in memory. )
        
         | hermitcrab wrote:
         | Also: https://github.com/BurntSushi/xsv
         | https://csvkit.readthedocs.io/en/latest/
        
         | dima55 wrote:
         | For simple analyses (i.e. what most people do most of the time)
         | doing this on the commandline gets you there faster. I use
         | vnlog (https://github.com/dkogan/vnlog/). By the time you fired
         | up your editor to write your Python code, I already have
         | analyses and plots ready.
        
         | fbdab103 wrote:
         | I write Python every day, but still use miller here and there.
         | If I am doing a "simple" operation (eye of the beholder), being
         | able to pipe it on the command line is great.
         | 
         | To do a comparable amount of manipulation in Python takes a lot
         | more boilerplate (imports, command line arguments, diety-can-
         | we-default-to-Int64 already?, etc), plus you have to ensure you
         | have a virtual environment with correct dependencies. Which is
         | more or less standard numpy+pandas, but a single executable
         | tool to do some data workup is always appreciated.
         | 
         | I am never performance constrained, but I have been told that
         | miller is one of the slower tools in this space, but I still
         | reach for it do to its wide format support.
        
       | dima55 wrote:
       | An incomplete list of other similar tools:
       | https://github.com/dkogan/vnlog/#description
        
         | alchemist1e9 wrote:
         | Here is a related but more obscure tool that can be
         | surprisingly useful.
         | 
         | http://hopper.si.edu/wiki/mmti/Starbase
         | 
         | Their tbl format is so trivially close to standard csv that I
         | just convert on the fly back and forth with tiny helper perl
         | scripts.
        
       | alchemist1e9 wrote:
       | Wow! This looks a really complete set of operations and extremely
       | useful.
        
       | snidane wrote:
       | This looks great!
       | 
       | Please consider removing any implicit network calls like the
       | initial "Checking GitHub for updates...". This itself will
       | prevent people from adoption or even trying it any further. This
       | is similar to gnu parallel's --citation, which, albeit a small
       | thing - will scare many people off.
       | 
       | Consider adding pivot and unpivot operations. Mlr gets it quite
       | right with syntax, but is unusable since it doesn't work in
       | streaming mode and tries to load everything into memory, despite
       | claiming otherwise.
       | 
       | Consider adding basic summing command. Sum is the most common
       | data operation, which could warrant its own special optimized
       | command, instead offloading this to external math processor like
       | lua or python. Even better if this had a group by (-by) and
       | window by (-over) capability. Eg. 'qsv sum col1,col2 -by
       | col3,col4'. Brimdata's zq utility is the only one I know that
       | does this quite right, but is quite clunky to use.
       | 
       | Consider adding a laminate command. Essentially adding a new
       | column with a constant. This probably could be achieved by a join
       | with a file with a single row, but why not make this common
       | operation easier to use.
       | 
       | Consider the option to concatenate csv files with mismatched
       | headers. cat rows or cat columns complains about the mismatch.
       | One of the most common problems with handling csvs is schema
       | evolution. I and many others would appreciate if we could merge
       | similar csvs together easily.
       | 
       | Conversions to and from other standard formats would be
       | appreciated (parquet, ion, fixed width lenghts, avro, etc.). Othe
       | compression formats as well - especially zstd.
       | 
       | It would be nice if the tool enabled embedding outputs of
       | external commands easily. Lua and python builtin support is nice,
       | but probably not sufficient. i'd like to be able to run a jq
       | command on a single column and merge it back as another for
       | example.
       | 
       | Inspiration:                 - csvquote:
       | https://news.ycombinator.com/item?id=31351393       - teip:
       | https://github.com/greymd/teip
        
         | quasarj wrote:
         | Wait, who is scared off by parallel's --citation?
        
           | fbdab103 wrote:
           | I refuse to use parallel due to that obnoxiousness.
           | 
           | At minimum, it is not installed by default, so it is already
           | a negative to just using xargs. That it then puts that
           | barrier in my way makes it an easy tool to skip.
        
             | quasarj wrote:
             | I just don't understand what barrier you are talking about.
             | I just checked, it doesn't even whine at you when you use
             | it, the help just notes that you should cite it if you
             | publish a paper where you used it. And... anyone publishing
             | papers knows about citation requirements lol. Anyone else
             | can ignore it. What is this barrier?
        
               | dima55 wrote:
               | In addition to being annoying, it raises questions about
               | whether it is free software or not. Some people care a
               | whole lot about that. And some people have higher
               | standards about being nagged. And lots and lots of time
               | was spent discussing solutions, for instance:
               | https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=915541
        
               | quasarj wrote:
               | Ah, I see, they have changed it (or possibly the version
               | on my system has had the --will-cite patched out, as
               | discussed in this bug).
               | 
               | Okay, I accept your argument about Free Software.
               | However, I find it interesting that it's a GNU project...
               | they are generally the most hardline Free Software
               | people.
        
               | fbdab103 wrote:
               | To slippery slope this, what happens if more tools start
               | adopting this behavior? Curl now asks you to buy Daniel
               | Stenberg a coffee on each use. Wget asks you to support
               | Ukraine. Caddy wants you to invest in their startup. Each
               | of which may come with their own `--ignore-annoyance-
               | flag` I need to learn. The best I can do is vote with my
               | feet.
               | 
               | I also do not care for the citation requirement. I
               | utilize tons of tools in my work which go unstated. I do
               | not feel the need to cite Linux, DNS, htop, Make, Diet
               | Coke, my Kinesis keyboard, etc. Sadly, reliable plumbing
               | gets no respect. Especially for a tool which is more or
               | less interchangeable with some shell scripting. Unless I
               | am trying to shore up the references list, I am going to
               | cite directly relevant work.
               | 
               | At some point, you no longer need to note that your work
               | was powered by electricity.
        
               | jasonjayr wrote:
               | Vim has solicited donations for Uganda since forever.
        
         | dima55 wrote:
         | You can get quite far by piping to other tools and/or using
         | DSLs. pivoting can almost certainly be done by the luau support
         | in qsv (or `vnl-filter`, for instance). Summing and grouping is
         | something that `datamash` does well (or qsv luau probably, or
         | `vnl-filter --eval`). Adding a column once again can be done
         | with luau or `vnl-filter`.
         | 
         | Would you be more likely to use this tool if it had even more
         | stuff in it requiring reading even more documentation? That's a
         | genuine question.
        
       ___________________________________________________________________
       (page generated 2023-12-23 23:00 UTC)