[HN Gopher] Miller CLI - Like Awk, sed, cut, join, and sort for ...
       ___________________________________________________________________
        
       Miller CLI - Like Awk, sed, cut, join, and sort for CSV, TSV and
       JSON
        
       Author : jabo
       Score  : 247 points
       Date   : 2021-08-25 06:58 UTC (16 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | enriquto wrote:
       | I'm a bit surprised by the need of verbose "--icsv --ocsv"...
       | Shouldn't it be trivial to see that the input is csv? (and in
       | that case, the output could be csv by default).
        
         | zimpenfish wrote:
         | I like how `csvkit` does it; assumes CSV input by default, `-t`
         | for TSV input, `-T` for TSV output. Given I run `csvformat -T`
         | many times a week, I appreciate the brevity.
        
         | heinrichhartman wrote:
         | > Shouldn't it be trivial to see that the input is csv?
         | 
         | There is no reliable way to infer csv files: CSV files do not
         | have a magic number. There are all kinds of separators used and
         | quoting rules differ widely.
         | 
         | It's super annoying if a tool works on a 1K line CSV file, but
         | breaks down if I have a 3 line file because it can't infer the
         | type.
         | 
         | I much prefer my tools not to be "90%-smart", but predictable.
        
           | enriquto wrote:
           | > I much prefer my tools not to be "90%-smart", but
           | predictable.
           | 
           | I understand your preference, but please recognize that it is
           | not universal. Some of us much prefer a super-simple tool
           | that fails for some particular cases, while requiring special
           | options to be completely general. Thus you can use the
           | heuristic defaults interactively (where you'll notice the
           | errors easily), and write scripts with the more explicit
           | form.
        
             | chasil wrote:
             | Wouldn't it be wonderful if we actually used ASCII as it
             | was designed?                   Oct   Dec   Hex   Char
             | ----------------------------------------         034   28
             | 1C    FS  (file separator)         035   29    1D    GS
             | (group separator)         036   30    1E    RS  (record
             | separator)         037   31    1F    US  (unit separator)
             | 
             | https://ronaldduncan.wordpress.com/2009/10/31/text-file-
             | form...
        
               | svieira wrote:
               | The problem with this is the same problem that CSV has to
               | solve though - there's no escape character specified in
               | ASCII so you can't have a unit that contains any of these
               | 4 characters or else you'll break the parser.
        
             | memetomancer wrote:
             | In turn I get what you are saying, but in this case it is
             | not a trivial problem. CSV files seem simple on the surface
             | but there are all sorts of gotchas.
             | 
             | For example, there's plenty of variation between
             | platforms/applications when it comes to just terminating a
             | line. Are we using CR, LF, CR+LF, LF+CR, NL, RS, EOL? What
             | do we do when the source file is produced by an app that
             | uses one approach but doesn't care about the others (allows
             | their occurrence)?
             | 
             | If those others should appear in the data would our
             | "90%-smart" tool make the wrong determination on line
             | termination for the whole file? would everything just break
             | or would this tool churn along and wreck all the data? how
             | long until you noticed?
             | 
             | By my estimation, the "90%-smart" tool would be about 30%
             | dependable unless used only with a known source and format,
             | meaning it wouldn't need to be smart in the first place.
        
               | enriquto wrote:
               | > CSV files seem simple on the surface but there are all
               | sorts of gotchas.
               | 
               | My point is that supporting "general CSV files" is
               | useless. Restricting your tooling to "simple CSV files"
               | is good. But my opinions are not very representative. I
               | also think that it is perfectly acceptable for a shell
               | script to fail badly when it encounters filenames with
               | spaces.
        
           | Y_Y wrote:
           | I assumed GP was referring to the filename extension. Of
           | course people put all sorts of nonsense into files and then
           | call them ".csv", but it's not a bad heuristic to use that to
           | guess the format as a default, and even guess that the output
           | format should be the same, in the absence of other flags.
        
           | enriquto wrote:
           | > It's super annoying if a tool works on a 1K line CSV file,
           | but breaks down if I have a 3 line file because it can't
           | infer the type.
           | 
           | How can that be? Is it possible to have such an ambiguous
           | file? I mean, if a file contains a single number on a single
           | name it can be anything, but the interpretation is the same.
           | Can you create a file that has different contents depending
           | on whether it is interpreted as csv or tsv?
        
             | actually_a_dog wrote:
             | Not only is it possible, as a couple of commenters have
             | already shown, but, due to the many variants of CSV out
             | there, it's possible to construct a CSV that has different
             | contents depending on which dialect you tell your CSV
             | reader to expect. I'll leave the actual construction as an
             | exercise to the reader, but, it would work along the same
             | lines as the ambiguous TSV/CSV files you've seen here
             | already.
        
             | faho wrote:
             | >Can you create a file that has different contents
             | depending on whether it is interpreted as csv or tsv?
             | 
             | Easily!                   a\tb,c
             | 
             | Is "a\tb" and "c" as a csv, but "a" and "b,c" as a tsv!
             | 
             | In practice, if the first line contains only a tab or a
             | comma it might be enough to infer that as the separator,
             | but:
             | 
             | 1. that would fail on single-column files (by
             | misinterpreting them as multi-column if the unused
             | separator appears) 2. that couldn't infer anything on files
             | where both separators appear
             | 
             | So it would only be a 90% (or maybe 99%) solution.
        
               | darkwater wrote:
               | If it's a single column file, you - the user - should
               | know it and act accordingly. Yeah scripting usage on
               | random input, II know, but in that case then you would
               | specify the input type. But heuristic defaulting to the
               | most common usage in case of doubt (that would be CSV in
               | the CSV/TSV doubt) for the interactive use is the way to
               | go. Or at least, it's what I would expect personally as a
               | user.
        
               | kzrdude wrote:
               | Heuristics are annoying and data-dependent algo changes
               | are dangerous.
               | 
               | Completely different example, but limits in Splunk
               | aggregations - it means you can run your report on small
               | data, but when you scale it up (to real production data
               | sizes, maybe), then suddenly you get wrong numbers, and
               | maybe results like "0 errors of type X" when the real
               | answer is that there are >=1 errors. Because one of the
               | aggregations used has a window size limit that it is
               | silently applying. This stuff is dangerous.
               | 
               | What Splunk was doing for me would be the equivalent of
               | an SQL join giving approximate answers when the data is
               | too big.
        
               | faho wrote:
               | I can see that argument, but I kinda don't agree?
               | 
               | The issue with the heuristic is that it can fail
               | depending on the input, and the input can _easily_ change
               | in a way that kills the heuristic.
               | 
               | Say you run the tool on a file, and it detects csv input
               | and all is well. Then you update the file, and now it
               | includes a tab character in the first line and the
               | heuristic detects it as a tsv and now it fails - or the
               | heuristic now gives up, or whatever.
               | 
               | Sure you can "improve the heuristic", but you can still,
               | always, have the _data_ change in a way that it defeats
               | it. You now need to either be careful with the data or
               | _know_ that you should specify the format, without the
               | tool telling you. Everything seems to work immediately,
               | and then later it blows up. That 's a problem akin to
               | e.g. bash and filenames with spaces. Everything works,
               | until someone has a space in a filename, and then you get
               | told that you should have known to quote everything all
               | along (the solution there, would be to abolish word
               | splitting).
               | 
               | To coin a pithy phrase: When a tool is easy to misuse and
               | a user misuses it, blame the tool, not the user.
               | 
               | Now, if I were writing this thing I would make the logic
               | much simpler: Make it default to csv (or whichever format
               | is more common). Now the way to break the "heuristic" is
               | to give data in the wrong format. But if you use csv, you
               | don't have to explicitly give the format and your data
               | can't break the heuristic (unless it switches format,
               | which you would know about).
        
             | mellavora wrote:
             | Sure. It is typical to represent currency amounts as
             | 
             | 12,000
             | 
             | If it isn't quoted, then a csv will read the comma as a
             | separator, while a TSV won't
             | 
             | amazing number of similar examples. I went for a real use-
             | case rather than a theoretical possibility
        
           | qwerty456127 wrote:
           | If a text file has the same number of commas, tabs or
           | semicolons on every line it most probably is (but, obviously,
           | is not guaranteed to be) a CSV/TSV/SSV.
           | 
           | Defining every flavour of these is hardly possible with a
           | simple command line so I would rather let the user to specify
           | an entire configuration file for this. We probably need an
           | entire CSV schema language.
        
             | carlmr wrote:
             | That's already not true for Quoted entries that contain the
             | separator. Which I think is a common CSV use case.
             | 
             | But I do agree you could have a heuristic. E.g. ends in
             | .csv and contains a lot more commas/semicolons/tabs than
             | you would expect in normal text in the first 1-5 lines.
             | 
             | You could still have the flag as a fallback when you need
             | something that's completely reliable.
        
       | citilife wrote:
       | Not exactly the same, but we wrote a library to easily load any
       | delimited type of file and finds header (even if not first row).
       | It also works to load JSON, Parquet, AVRO and loads it into a
       | dataframe. Not CLI exactly, but pretty easy:
       | 
       | https://github.com/capitalone/dataprofiler
       | 
       | Can Install: _pip install dataprofiler[ml] --user_
       | 
       | How it works:
       | 
       | csv_data = Data('your_file.csv') # Load: delimited, JSON,
       | Parquet, Avro
       | 
       | csv_data.data.head(10) # Get head
       | 
       | csv_data.data.sort_values(by='name', inplace=True) # Sort
       | 
       | Anyway, pretty interesting Miller CLI. I'm not sure how the
       | header detection works, especially if the header isn't the first
       | row (which is often the case)
        
         | snthpy wrote:
         | Thanks. This looks very cool.
        
       | nwmcsween wrote:
       | A bit OT but there really need to be a new paradigm shift for a
       | new OS interface that has less coupling and more flexibility
       | compared to piping.
        
       | otabdeveloper4 wrote:
       | See also my own foray into this space:
       | https://tkatchev.bitbucket.io/
       | 
       | You might find it useful.
        
         | natrys wrote:
         | Thanks for tab. I have used it a bit in the pipeline of some of
         | my projects. Like jq, I think tab is another good example of a
         | DSL that allows tersely and quickly getting things done. Same
         | as Perl I suppose, but cleaner and more functional. Raku is
         | also great, but it adds big dependencies whereas tab is single
         | binary which is always nice.
         | 
         | If you are taking feature request, string interpolation or a
         | sprintf function would make report generation kind of things
         | easier for me.
        
       | calvinmorrison wrote:
       | Just yesterday I was opining in the IRC about what it would be
       | like to have a operating system that was build from the ground up
       | that supports smarter data formats than just text.
       | 
       | The concept of everything as a file, has awesome benefits I
       | believe, but that doesn't meant files need to be unstructured
       | text.
       | 
       | There's a core problem, that with text languages (like AWK or
       | structured regex (sed, etc) you end up both having to parse AND
       | manipulate data. which is no fun and prone to errors.
       | 
       | Abstracting away all of that into codecs that all of coreutils
       | could speak would be very cool to me.
       | 
       | The second issue is structured data vs unstructured text. CSVs,
       | or other table based formats make sense sometimes, and sometimes
       | you want to be able to query things more easily. JSON provides
       | that. What's the synthesis of both I don't know... but maybe this
       | is closer to heaven.
       | 
       | I'd like to _minimize_ the amount of parsing or text manipulation
       | done in any program and be able to focus on extracting and
       | inserting the data I need, like a rosetta stone that just handles
       | everything for me while I work. I want to be able to do:
       | 
       | awk "{print $schoolName}" or awk "{print $4}" and it just works -
       | json, text, CSV.
        
         | zokier wrote:
         | Maybe the most notable system that had "files as structured
         | data" is the IBM S/360 lineage of mainframes.
         | 
         | > In the context of IBM mainframe computers in the S/360 line,
         | a data set (IBM preferred) or dataset is a computer file having
         | a record organization.
         | 
         | > Data sets are not unstructured streams of bytes, but rather
         | are organized in various logical record[3] and block structures
         | determined by the DSORG (data set organization), RECFM (record
         | format), and other parameters
         | 
         | https://en.wikipedia.org/wiki/Data_set_(IBM_mainframe)
         | 
         | Remarkably S/360 predates UNIX by several years, so its
         | definitely not a novel concept. Albeit apparently IBM systems
         | cost order of magnitude more than PDPs of the time.
         | 
         | Of course there are also classic LISP machines where afaik
         | pretty much everything was a SEXPR, but I have even less
         | knowledge about the details of those.
        
         | brushfoot wrote:
         | You should check out PowerShell. It's based on exactly that
         | concept. Everything is an object.
         | 
         | `Import-Csv` and `ConvertFrom-Json` return objects that you can
         | manipulate using LINQ or PowerShell's `*-Object` cmdlets
         | (`Select-Object`, `Where-Object`, `Group-Object`, etc.). `Get-
         | Process` returns objects representing your system's processes,
         | `Get-Service` returns objects for your system's services, and
         | so on.
         | 
         | Being able to chain that in your pipeline is incredibly
         | powerful. You can send data into and out of CSVs and various
         | other providers in a single line. It's a huge productivity
         | boost.
        
         | platz wrote:
         | so, powershell?
        
           | kybernetikos wrote:
           | Sounds like it. Another option might be nushell.
           | https://www.nushell.sh/
        
         | oauea wrote:
         | You should take a look at the late TempleOS. It's not a serious
         | contender but has a lot of interesting ideas in that direction
        
           | pisipisipisi wrote:
           | He is dead now. RIP
        
       | throw0101a wrote:
       | > _Like Awk, sed, cut, join, and sort_ [...]
       | 
       | So basically Perl? :)
        
       | traceroute66 wrote:
       | Is this (supposedly) better than Burntsushi toolset (e.g. XSV[1])
       | or just different ?
       | 
       | [1] https://github.com/BurntSushi/xsv
        
         | chaosfox wrote:
         | I love xsv, but its been 3 years since last release and the PRs
         | are accumulating.. miller seems much more active.
        
       | mmahemoff wrote:
       | I don't think Sed and Awk should be on this list. It's not really
       | able to transform text into something else, is it?
       | 
       | Even the tutorial leans on sed for cleansing ahead of Miller.
       | https://www.ict4g.net/adolfo/notes/data-analysis/miller-quic...
        
         | majkinetor wrote:
         | Yes it can.
         | 
         | https://miller.readthedocs.io/en/latest/features.html
        
       | lenkite wrote:
       | I just prefer using R data-frames - the convenience of a full
       | data-science language shell for getting statistical data out of
       | CSV. One can also make gorgeous graphs using ggplot2.
        
       | awild wrote:
       | I make a lot of use of Miller at work and it has been a godsend.
       | People tend to underestimate tools like this, but I'm able to
       | answer a lot of questions about our data that other people would
       | take longer for or don't have the capacity to check. It's great
       | as a first step and if anything more complex is needed we usually
       | spin up a python script. So it's very much in the philosophy of
       | small composable tools but for records.
        
       | lazyeye wrote:
       | For windows users NimbleText is a great little utility for
       | working with tabular data:-
       | 
       | https://nimbletext.com/
        
       | muad_kyrlach wrote:
       | Got a csv or json you need to eff-with from the command line?
       | It's Miller time, baby!
        
       | dima55 wrote:
       | Similar, but using the ACTUAL awk, sed, join, sort tools you
       | already have and know about: https://github.com/dkogan/vnlog/
        
       | sam_goody wrote:
       | How does this compare to JQ?
       | 
       | Aside for that it appears to support more formats (CSV [which has
       | partial JQ support], and TSV), of course.
       | 
       | I find it very helpful when people compare their new toy or
       | service with what is already pretty well known.
        
         | awild wrote:
         | Miller is not new (at least 6 years old is what I can gleam
         | from the Github page).
         | 
         | It is somewhat similar to JQ, but specifically for record based
         | formats and not documents.
         | 
         | The example page [0] has quite a few nice examples.
         | 
         | A recent snippet from my own usage, looking for duplicate ids
         | in a large-ish csv.gz:                   mlr uniq -c -g id then
         | filter '$count > 1'         mlr --ijson --ocsv filter
         | '$success' then stats1 -a mean,min,max,stddev -f duration
         | 
         | [0]: https://miller.readthedocs.io/en/latest/data-examples.html
        
       | flakiness wrote:
       | The comparison between their ongoing C version [1] and the
       | upcoming go version [2] is nice to see. The README talks a bit
       | more the performance comparison and the C version's strength, but
       | still.
       | 
       | [1] https://github.com/johnkerl/miller/tree/main/c
       | 
       | [2] https://github.com/johnkerl/miller/tree/main/go
        
       | nyolfen wrote:
       | i recently came across `jq`, which i use for this purpose
        
       | nyolfen wrote:
       | i recently came across jq, which i use for json parsing:
       | https://stedolan.github.io/jq/
        
       | xvilka wrote:
       | There's also rq (record query)[1] that also supports CSV and JSON
       | but not TSV though. It's written in Rust.
       | 
       | [1] https://github.com/dflemstr/rq
        
       | yung_steezy wrote:
       | I always use sqlite3 for working with large CSV files:
       | $ sqlite3         $ .mode csv         $ .import my.csv foo
       | $ SELECT * FROM foo WHERE name = 'bar';
       | 
       | It reads the header in automatically for the field names and then
       | stores all the values as strings.
        
         | psanford wrote:
         | I do the same. I wrote[0] a little wrapper around this workflow
         | that does a few additional niceties: 1) normalizes column names
         | and removes whitespace, 2) supports importing multiple csv
         | files without importing the header line as a record.
         | 
         | [0]: https://github.com/psanford/csv2sqlite
        
         | noisy_boy wrote:
         | Great tip. A simple utility script can be used for repeated
         | use/customization: https://pastebin.com/qEkimk7W
         | 
         | Another advantage is that this can be executed to import as
         | part of initialization and get to sqlite prompt with the table
         | ready for sql execution:                   $ csv_to_table.sh
         | sample-test.csv
         | 
         | Or it can be executed using heredoc for non-interactive
         | execution as well:                   $ csv_to_table.sh sample-
         | test.csv <<-CMD         select * from sample_test limit 10;
         | CMD         ... output ...
        
         | sdmac99 wrote:
         | duckdb is another good option as well.
        
         | inanutshellus wrote:
         | holy moly, that's rad. definitely adding this to my toolbox,
         | thanks!
        
         | tyingq wrote:
         | This stack exchange answer makes that into a fairly simple bash
         | alias: https://unix.stackexchange.com/questions/424555/how-to-
         | inser...
        
         | greazy wrote:
         | This is extremely ugly, I have no idea why you'd do this. If
         | you're going to call a tool in interactive mode on the CLI,
         | then use R for proper data handling.
        
           | [deleted]
        
           | halayli wrote:
           | If there is anything ugly here, it's your communication.
           | 
           | Aside from the fact that OP's example can run in a non-
           | interactive mode, it makes absolutely no difference to the
           | discussion.
        
           | skrebbel wrote:
           | You can't understand that there are people who know SQL and
           | not R?
        
             | momirlan wrote:
             | I don't think the comment refers to people, just to better
             | ways of doing things. People learn.
        
               | [deleted]
        
               | lugged wrote:
               | In my experience people rarely learn anything after you
               | insult them and call their work ugly.
               | 
               | If AP had a "better way of doing things" maybe they
               | should have actually given us an example of what it looks
               | like.
               | 
               | Right now, all I think about R is that snobby,
               | gatekeeping assholes use it to do things I can already do
               | easily in SQL.
        
         | majkinetor wrote:
         | Not bad. Here is the Windows equivalent with Excel separator:
         | PS> gsudo choco install sqlite3              PS > sqlite3
         | sqlite> .mode csv         sqlite> .separator ; \n
         | sqlite> .import my.csv foo         sqlite> SELECT * FROM foo
         | WHERE name = 'bar';
         | 
         | For CSV that contain up to several 100k records native
         | PowerShell solution is way faster and more practical:
         | Get-Content my.csv | ConvertFrom-Csv | ? name -eq bar
         | 
         | FYI, I maintain miller choco package:
         | https://community.chocolatey.org/packages/miller
        
       | simonw wrote:
       | I added a new feature to my sqlite-utils CLI tool in June which
       | covers similar ground: it now lets you load CSV, TSV and JSON
       | data into an in-memory SQLite database and execute SQL queries
       | against it: https://simonwillison.net/2021/Jun/19/sqlite-utils-
       | memory/
       | 
       | Example usage:                   sqlite-utils memory example.csv
       | "select * from t order by color, shape"         # Defaults to
       | outputting JSON, you can add         # --csv or --tsv for those
       | formats or         # --table to output as a rendered table
       | 
       | More docs here: https://sqlite-
       | utils.datasette.io/en/stable/cli.html#queryin...
        
       | chrisweekly wrote:
       | See LNAV (https://lnav.org) for a "mini-ETL" CLI powertool with
       | embedded SQLite; it's fantastic.
        
       | smashah wrote:
       | Nice. Any reason why installation section is not prominent and on
       | the top of the readme?
       | 
       | It's even in an non-intuitive place in the documentation.
        
       | brushfoot wrote:
       | If you have Windows, you have a tool that can do this already
       | installed: PowerShell. (If you don't have Windows, PowerShell is
       | free and open source [MIT license]. One of the first things I
       | install on a new Linux or macOS box lately.)
       | 
       | The first example in PowerShell:
       | 
       | > Import-Csv example.csv | Sort-Object Color, Shape
       | 
       | To filter, use `Where-Object`:
       | 
       | > Import-Csv example.csv | Where-Object { $_.Color -eq 'purple' }
       | | Sort-Object Color, Shape
       | 
       | Or using aliases:
       | 
       | > ipcsv example.csv | ? { $_.Color -eq 'purple' } | sort Color,
       | Shape
       | 
       | JSON uses the same syntax -- just replace `Import-Csv` with `Get-
       | Content` and `ConvertFrom-Json`:
       | 
       | > Get-Content example.json | ConvertFrom-Json | Select-Object
       | Color -Unique
       | 
       | > purple
       | 
       | > red
       | 
       | > yellow
       | 
       | There's also `Group-Object` for aggregation.
        
       | csmarshall wrote:
       | I don't know about MillerCLI's portability, but RecordStream
       | (https://github.com/benbernard/RecordStream) is my go to swiss
       | army knife.
        
       | msluyter wrote:
       | Some other useful tools for csv files, in particular:
       | 
       | csvkit, a set of command line tools for manipulating csvs:
       | https://csvkit.readthedocs.io/en/latest/
       | 
       | visdata, a quite terminal based csv explorer:
       | https://www.visidata.org/
        
         | mxuribe wrote:
         | visdata is pretty mind-blowing...at least, that's how i felt
         | when i saw the demo the first time:
         | https://www.youtube.com/watch?v=N1CBDTgGtOU
        
       ___________________________________________________________________
       (page generated 2021-08-25 23:02 UTC)