[HN Gopher] Miller CLI - Like Awk, sed, cut, join, and sort for ...
___________________________________________________________________
Miller CLI - Like Awk, sed, cut, join, and sort for CSV, TSV and
JSON
Author : jabo
Score : 247 points
Date : 2021-08-25 06:58 UTC (16 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| enriquto wrote:
| I'm a bit surprised by the need of verbose "--icsv --ocsv"...
| Shouldn't it be trivial to see that the input is csv? (and in
| that case, the output could be csv by default).
| zimpenfish wrote:
| I like how `csvkit` does it; assumes CSV input by default, `-t`
| for TSV input, `-T` for TSV output. Given I run `csvformat -T`
| many times a week, I appreciate the brevity.
| heinrichhartman wrote:
| > Shouldn't it be trivial to see that the input is csv?
|
| There is no reliable way to infer csv files: CSV files do not
| have a magic number. There are all kinds of separators used and
| quoting rules differ widely.
|
| It's super annoying if a tool works on a 1K line CSV file, but
| breaks down if I have a 3 line file because it can't infer the
| type.
|
| I much prefer my tools not to be "90%-smart", but predictable.
| enriquto wrote:
| > I much prefer my tools not to be "90%-smart", but
| predictable.
|
| I understand your preference, but please recognize that it is
| not universal. Some of us much prefer a super-simple tool
| that fails for some particular cases, while requiring special
| options to be completely general. Thus you can use the
| heuristic defaults interactively (where you'll notice the
| errors easily), and write scripts with the more explicit
| form.
| chasil wrote:
| Wouldn't it be wonderful if we actually used ASCII as it
| was designed? Oct Dec Hex Char
| ---------------------------------------- 034 28
| 1C FS (file separator) 035 29 1D GS
| (group separator) 036 30 1E RS (record
| separator) 037 31 1F US (unit separator)
|
| https://ronaldduncan.wordpress.com/2009/10/31/text-file-
| form...
| svieira wrote:
| The problem with this is the same problem that CSV has to
| solve though - there's no escape character specified in
| ASCII so you can't have a unit that contains any of these
| 4 characters or else you'll break the parser.
| memetomancer wrote:
| In turn I get what you are saying, but in this case it is
| not a trivial problem. CSV files seem simple on the surface
| but there are all sorts of gotchas.
|
| For example, there's plenty of variation between
| platforms/applications when it comes to just terminating a
| line. Are we using CR, LF, CR+LF, LF+CR, NL, RS, EOL? What
| do we do when the source file is produced by an app that
| uses one approach but doesn't care about the others (allows
| their occurrence)?
|
| If those others should appear in the data would our
| "90%-smart" tool make the wrong determination on line
| termination for the whole file? would everything just break
| or would this tool churn along and wreck all the data? how
| long until you noticed?
|
| By my estimation, the "90%-smart" tool would be about 30%
| dependable unless used only with a known source and format,
| meaning it wouldn't need to be smart in the first place.
| enriquto wrote:
| > CSV files seem simple on the surface but there are all
| sorts of gotchas.
|
| My point is that supporting "general CSV files" is
| useless. Restricting your tooling to "simple CSV files"
| is good. But my opinions are not very representative. I
| also think that it is perfectly acceptable for a shell
| script to fail badly when it encounters filenames with
| spaces.
| Y_Y wrote:
| I assumed GP was referring to the filename extension. Of
| course people put all sorts of nonsense into files and then
| call them ".csv", but it's not a bad heuristic to use that to
| guess the format as a default, and even guess that the output
| format should be the same, in the absence of other flags.
| enriquto wrote:
| > It's super annoying if a tool works on a 1K line CSV file,
| but breaks down if I have a 3 line file because it can't
| infer the type.
|
| How can that be? Is it possible to have such an ambiguous
| file? I mean, if a file contains a single number on a single
| name it can be anything, but the interpretation is the same.
| Can you create a file that has different contents depending
| on whether it is interpreted as csv or tsv?
| actually_a_dog wrote:
| Not only is it possible, as a couple of commenters have
| already shown, but, due to the many variants of CSV out
| there, it's possible to construct a CSV that has different
| contents depending on which dialect you tell your CSV
| reader to expect. I'll leave the actual construction as an
| exercise to the reader, but, it would work along the same
| lines as the ambiguous TSV/CSV files you've seen here
| already.
| faho wrote:
| >Can you create a file that has different contents
| depending on whether it is interpreted as csv or tsv?
|
| Easily! a\tb,c
|
| Is "a\tb" and "c" as a csv, but "a" and "b,c" as a tsv!
|
| In practice, if the first line contains only a tab or a
| comma it might be enough to infer that as the separator,
| but:
|
| 1. that would fail on single-column files (by
| misinterpreting them as multi-column if the unused
| separator appears) 2. that couldn't infer anything on files
| where both separators appear
|
| So it would only be a 90% (or maybe 99%) solution.
| darkwater wrote:
| If it's a single column file, you - the user - should
| know it and act accordingly. Yeah scripting usage on
| random input, II know, but in that case then you would
| specify the input type. But heuristic defaulting to the
| most common usage in case of doubt (that would be CSV in
| the CSV/TSV doubt) for the interactive use is the way to
| go. Or at least, it's what I would expect personally as a
| user.
| kzrdude wrote:
| Heuristics are annoying and data-dependent algo changes
| are dangerous.
|
| Completely different example, but limits in Splunk
| aggregations - it means you can run your report on small
| data, but when you scale it up (to real production data
| sizes, maybe), then suddenly you get wrong numbers, and
| maybe results like "0 errors of type X" when the real
| answer is that there are >=1 errors. Because one of the
| aggregations used has a window size limit that it is
| silently applying. This stuff is dangerous.
|
| What Splunk was doing for me would be the equivalent of
| an SQL join giving approximate answers when the data is
| too big.
| faho wrote:
| I can see that argument, but I kinda don't agree?
|
| The issue with the heuristic is that it can fail
| depending on the input, and the input can _easily_ change
| in a way that kills the heuristic.
|
| Say you run the tool on a file, and it detects csv input
| and all is well. Then you update the file, and now it
| includes a tab character in the first line and the
| heuristic detects it as a tsv and now it fails - or the
| heuristic now gives up, or whatever.
|
| Sure you can "improve the heuristic", but you can still,
| always, have the _data_ change in a way that it defeats
| it. You now need to either be careful with the data or
| _know_ that you should specify the format, without the
| tool telling you. Everything seems to work immediately,
| and then later it blows up. That 's a problem akin to
| e.g. bash and filenames with spaces. Everything works,
| until someone has a space in a filename, and then you get
| told that you should have known to quote everything all
| along (the solution there, would be to abolish word
| splitting).
|
| To coin a pithy phrase: When a tool is easy to misuse and
| a user misuses it, blame the tool, not the user.
|
| Now, if I were writing this thing I would make the logic
| much simpler: Make it default to csv (or whichever format
| is more common). Now the way to break the "heuristic" is
| to give data in the wrong format. But if you use csv, you
| don't have to explicitly give the format and your data
| can't break the heuristic (unless it switches format,
| which you would know about).
| mellavora wrote:
| Sure. It is typical to represent currency amounts as
|
| 12,000
|
| If it isn't quoted, then a csv will read the comma as a
| separator, while a TSV won't
|
| amazing number of similar examples. I went for a real use-
| case rather than a theoretical possibility
| qwerty456127 wrote:
| If a text file has the same number of commas, tabs or
| semicolons on every line it most probably is (but, obviously,
| is not guaranteed to be) a CSV/TSV/SSV.
|
| Defining every flavour of these is hardly possible with a
| simple command line so I would rather let the user to specify
| an entire configuration file for this. We probably need an
| entire CSV schema language.
| carlmr wrote:
| That's already not true for Quoted entries that contain the
| separator. Which I think is a common CSV use case.
|
| But I do agree you could have a heuristic. E.g. ends in
| .csv and contains a lot more commas/semicolons/tabs than
| you would expect in normal text in the first 1-5 lines.
|
| You could still have the flag as a fallback when you need
| something that's completely reliable.
| citilife wrote:
| Not exactly the same, but we wrote a library to easily load any
| delimited type of file and finds header (even if not first row).
| It also works to load JSON, Parquet, AVRO and loads it into a
| dataframe. Not CLI exactly, but pretty easy:
|
| https://github.com/capitalone/dataprofiler
|
| Can Install: _pip install dataprofiler[ml] --user_
|
| How it works:
|
| csv_data = Data('your_file.csv') # Load: delimited, JSON,
| Parquet, Avro
|
| csv_data.data.head(10) # Get head
|
| csv_data.data.sort_values(by='name', inplace=True) # Sort
|
| Anyway, pretty interesting Miller CLI. I'm not sure how the
| header detection works, especially if the header isn't the first
| row (which is often the case)
| snthpy wrote:
| Thanks. This looks very cool.
| nwmcsween wrote:
| A bit OT but there really need to be a new paradigm shift for a
| new OS interface that has less coupling and more flexibility
| compared to piping.
| otabdeveloper4 wrote:
| See also my own foray into this space:
| https://tkatchev.bitbucket.io/
|
| You might find it useful.
| natrys wrote:
| Thanks for tab. I have used it a bit in the pipeline of some of
| my projects. Like jq, I think tab is another good example of a
| DSL that allows tersely and quickly getting things done. Same
| as Perl I suppose, but cleaner and more functional. Raku is
| also great, but it adds big dependencies whereas tab is single
| binary which is always nice.
|
| If you are taking feature request, string interpolation or a
| sprintf function would make report generation kind of things
| easier for me.
| calvinmorrison wrote:
| Just yesterday I was opining in the IRC about what it would be
| like to have a operating system that was build from the ground up
| that supports smarter data formats than just text.
|
| The concept of everything as a file, has awesome benefits I
| believe, but that doesn't meant files need to be unstructured
| text.
|
| There's a core problem, that with text languages (like AWK or
| structured regex (sed, etc) you end up both having to parse AND
| manipulate data. which is no fun and prone to errors.
|
| Abstracting away all of that into codecs that all of coreutils
| could speak would be very cool to me.
|
| The second issue is structured data vs unstructured text. CSVs,
| or other table based formats make sense sometimes, and sometimes
| you want to be able to query things more easily. JSON provides
| that. What's the synthesis of both I don't know... but maybe this
| is closer to heaven.
|
| I'd like to _minimize_ the amount of parsing or text manipulation
| done in any program and be able to focus on extracting and
| inserting the data I need, like a rosetta stone that just handles
| everything for me while I work. I want to be able to do:
|
| awk "{print $schoolName}" or awk "{print $4}" and it just works -
| json, text, CSV.
| zokier wrote:
| Maybe the most notable system that had "files as structured
| data" is the IBM S/360 lineage of mainframes.
|
| > In the context of IBM mainframe computers in the S/360 line,
| a data set (IBM preferred) or dataset is a computer file having
| a record organization.
|
| > Data sets are not unstructured streams of bytes, but rather
| are organized in various logical record[3] and block structures
| determined by the DSORG (data set organization), RECFM (record
| format), and other parameters
|
| https://en.wikipedia.org/wiki/Data_set_(IBM_mainframe)
|
| Remarkably S/360 predates UNIX by several years, so its
| definitely not a novel concept. Albeit apparently IBM systems
| cost order of magnitude more than PDPs of the time.
|
| Of course there are also classic LISP machines where afaik
| pretty much everything was a SEXPR, but I have even less
| knowledge about the details of those.
| brushfoot wrote:
| You should check out PowerShell. It's based on exactly that
| concept. Everything is an object.
|
| `Import-Csv` and `ConvertFrom-Json` return objects that you can
| manipulate using LINQ or PowerShell's `*-Object` cmdlets
| (`Select-Object`, `Where-Object`, `Group-Object`, etc.). `Get-
| Process` returns objects representing your system's processes,
| `Get-Service` returns objects for your system's services, and
| so on.
|
| Being able to chain that in your pipeline is incredibly
| powerful. You can send data into and out of CSVs and various
| other providers in a single line. It's a huge productivity
| boost.
| platz wrote:
| so, powershell?
| kybernetikos wrote:
| Sounds like it. Another option might be nushell.
| https://www.nushell.sh/
| oauea wrote:
| You should take a look at the late TempleOS. It's not a serious
| contender but has a lot of interesting ideas in that direction
| pisipisipisi wrote:
| He is dead now. RIP
| throw0101a wrote:
| > _Like Awk, sed, cut, join, and sort_ [...]
|
| So basically Perl? :)
| traceroute66 wrote:
| Is this (supposedly) better than Burntsushi toolset (e.g. XSV[1])
| or just different ?
|
| [1] https://github.com/BurntSushi/xsv
| chaosfox wrote:
| I love xsv, but its been 3 years since last release and the PRs
| are accumulating.. miller seems much more active.
| mmahemoff wrote:
| I don't think Sed and Awk should be on this list. It's not really
| able to transform text into something else, is it?
|
| Even the tutorial leans on sed for cleansing ahead of Miller.
| https://www.ict4g.net/adolfo/notes/data-analysis/miller-quic...
| majkinetor wrote:
| Yes it can.
|
| https://miller.readthedocs.io/en/latest/features.html
| lenkite wrote:
| I just prefer using R data-frames - the convenience of a full
| data-science language shell for getting statistical data out of
| CSV. One can also make gorgeous graphs using ggplot2.
| awild wrote:
| I make a lot of use of Miller at work and it has been a godsend.
| People tend to underestimate tools like this, but I'm able to
| answer a lot of questions about our data that other people would
| take longer for or don't have the capacity to check. It's great
| as a first step and if anything more complex is needed we usually
| spin up a python script. So it's very much in the philosophy of
| small composable tools but for records.
| lazyeye wrote:
| For windows users NimbleText is a great little utility for
| working with tabular data:-
|
| https://nimbletext.com/
| muad_kyrlach wrote:
| Got a csv or json you need to eff-with from the command line?
| It's Miller time, baby!
| dima55 wrote:
| Similar, but using the ACTUAL awk, sed, join, sort tools you
| already have and know about: https://github.com/dkogan/vnlog/
| sam_goody wrote:
| How does this compare to JQ?
|
| Aside for that it appears to support more formats (CSV [which has
| partial JQ support], and TSV), of course.
|
| I find it very helpful when people compare their new toy or
| service with what is already pretty well known.
| awild wrote:
| Miller is not new (at least 6 years old is what I can gleam
| from the Github page).
|
| It is somewhat similar to JQ, but specifically for record based
| formats and not documents.
|
| The example page [0] has quite a few nice examples.
|
| A recent snippet from my own usage, looking for duplicate ids
| in a large-ish csv.gz: mlr uniq -c -g id then
| filter '$count > 1' mlr --ijson --ocsv filter
| '$success' then stats1 -a mean,min,max,stddev -f duration
|
| [0]: https://miller.readthedocs.io/en/latest/data-examples.html
| flakiness wrote:
| The comparison between their ongoing C version [1] and the
| upcoming go version [2] is nice to see. The README talks a bit
| more the performance comparison and the C version's strength, but
| still.
|
| [1] https://github.com/johnkerl/miller/tree/main/c
|
| [2] https://github.com/johnkerl/miller/tree/main/go
| nyolfen wrote:
| i recently came across `jq`, which i use for this purpose
| nyolfen wrote:
| i recently came across jq, which i use for json parsing:
| https://stedolan.github.io/jq/
| xvilka wrote:
| There's also rq (record query)[1] that also supports CSV and JSON
| but not TSV though. It's written in Rust.
|
| [1] https://github.com/dflemstr/rq
| yung_steezy wrote:
| I always use sqlite3 for working with large CSV files:
| $ sqlite3 $ .mode csv $ .import my.csv foo
| $ SELECT * FROM foo WHERE name = 'bar';
|
| It reads the header in automatically for the field names and then
| stores all the values as strings.
| psanford wrote:
| I do the same. I wrote[0] a little wrapper around this workflow
| that does a few additional niceties: 1) normalizes column names
| and removes whitespace, 2) supports importing multiple csv
| files without importing the header line as a record.
|
| [0]: https://github.com/psanford/csv2sqlite
| noisy_boy wrote:
| Great tip. A simple utility script can be used for repeated
| use/customization: https://pastebin.com/qEkimk7W
|
| Another advantage is that this can be executed to import as
| part of initialization and get to sqlite prompt with the table
| ready for sql execution: $ csv_to_table.sh
| sample-test.csv
|
| Or it can be executed using heredoc for non-interactive
| execution as well: $ csv_to_table.sh sample-
| test.csv <<-CMD select * from sample_test limit 10;
| CMD ... output ...
| sdmac99 wrote:
| duckdb is another good option as well.
| inanutshellus wrote:
| holy moly, that's rad. definitely adding this to my toolbox,
| thanks!
| tyingq wrote:
| This stack exchange answer makes that into a fairly simple bash
| alias: https://unix.stackexchange.com/questions/424555/how-to-
| inser...
| greazy wrote:
| This is extremely ugly, I have no idea why you'd do this. If
| you're going to call a tool in interactive mode on the CLI,
| then use R for proper data handling.
| [deleted]
| halayli wrote:
| If there is anything ugly here, it's your communication.
|
| Aside from the fact that OP's example can run in a non-
| interactive mode, it makes absolutely no difference to the
| discussion.
| skrebbel wrote:
| You can't understand that there are people who know SQL and
| not R?
| momirlan wrote:
| I don't think the comment refers to people, just to better
| ways of doing things. People learn.
| [deleted]
| lugged wrote:
| In my experience people rarely learn anything after you
| insult them and call their work ugly.
|
| If AP had a "better way of doing things" maybe they
| should have actually given us an example of what it looks
| like.
|
| Right now, all I think about R is that snobby,
| gatekeeping assholes use it to do things I can already do
| easily in SQL.
| majkinetor wrote:
| Not bad. Here is the Windows equivalent with Excel separator:
| PS> gsudo choco install sqlite3 PS > sqlite3
| sqlite> .mode csv sqlite> .separator ; \n
| sqlite> .import my.csv foo sqlite> SELECT * FROM foo
| WHERE name = 'bar';
|
| For CSV that contain up to several 100k records native
| PowerShell solution is way faster and more practical:
| Get-Content my.csv | ConvertFrom-Csv | ? name -eq bar
|
| FYI, I maintain miller choco package:
| https://community.chocolatey.org/packages/miller
| simonw wrote:
| I added a new feature to my sqlite-utils CLI tool in June which
| covers similar ground: it now lets you load CSV, TSV and JSON
| data into an in-memory SQLite database and execute SQL queries
| against it: https://simonwillison.net/2021/Jun/19/sqlite-utils-
| memory/
|
| Example usage: sqlite-utils memory example.csv
| "select * from t order by color, shape" # Defaults to
| outputting JSON, you can add # --csv or --tsv for those
| formats or # --table to output as a rendered table
|
| More docs here: https://sqlite-
| utils.datasette.io/en/stable/cli.html#queryin...
| chrisweekly wrote:
| See LNAV (https://lnav.org) for a "mini-ETL" CLI powertool with
| embedded SQLite; it's fantastic.
| smashah wrote:
| Nice. Any reason why installation section is not prominent and on
| the top of the readme?
|
| It's even in an non-intuitive place in the documentation.
| brushfoot wrote:
| If you have Windows, you have a tool that can do this already
| installed: PowerShell. (If you don't have Windows, PowerShell is
| free and open source [MIT license]. One of the first things I
| install on a new Linux or macOS box lately.)
|
| The first example in PowerShell:
|
| > Import-Csv example.csv | Sort-Object Color, Shape
|
| To filter, use `Where-Object`:
|
| > Import-Csv example.csv | Where-Object { $_.Color -eq 'purple' }
| | Sort-Object Color, Shape
|
| Or using aliases:
|
| > ipcsv example.csv | ? { $_.Color -eq 'purple' } | sort Color,
| Shape
|
| JSON uses the same syntax -- just replace `Import-Csv` with `Get-
| Content` and `ConvertFrom-Json`:
|
| > Get-Content example.json | ConvertFrom-Json | Select-Object
| Color -Unique
|
| > purple
|
| > red
|
| > yellow
|
| There's also `Group-Object` for aggregation.
| csmarshall wrote:
| I don't know about MillerCLI's portability, but RecordStream
| (https://github.com/benbernard/RecordStream) is my go to swiss
| army knife.
| msluyter wrote:
| Some other useful tools for csv files, in particular:
|
| csvkit, a set of command line tools for manipulating csvs:
| https://csvkit.readthedocs.io/en/latest/
|
| visdata, a quite terminal based csv explorer:
| https://www.visidata.org/
| mxuribe wrote:
| visdata is pretty mind-blowing...at least, that's how i felt
| when i saw the demo the first time:
| https://www.youtube.com/watch?v=N1CBDTgGtOU
___________________________________________________________________
(page generated 2021-08-25 23:02 UTC)