[HN Gopher] Miller: Like Awk, sed, cut, join, and sort for CSV, ...
___________________________________________________________________
Miller: Like Awk, sed, cut, join, and sort for CSV, TSV, and
tabular JSON
Author : ingve
Score : 258 points
Date : 2023-03-16 09:21 UTC (13 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| jbverschoor wrote:
| Seems cool. Maybe a rainbow / alternating color per column would
| be nice
| 0xbadcafebee wrote:
| Csvkit is the usual way to handle working with CSVs (or Excel, or
| JSON, or.... Postgres):
| https://csvkit.readthedocs.io/en/latest/index.html#why-csvki...
|
| _brew install csvkit_ and enjoy
| carb wrote:
| This is nice. I use `column` for pretty printing CSV/TSV but it
| fixes two tiny gaps in `sort` (skipping header lines) and `jq`
| (parsing CSV input. `jq` supports `@csv` for output conversion
| but not input). $ cat example.csv
| color,shape,flag,index yellow,triangle,1,11
| red,square,1,15 red,circle,1,16 red,square,0,48
| purple,triangle,0,51 red,square,0,77 # pretty
| printing $ column -ts',' example.csv color shape
| flag index yellow triangle 1 11 red square
| 1 15 red circle 1 16 red square
| 0 48 purple triangle 0 51 red square
| 0 77 # sorting with skipped headers is a mess.
| $ (head -n 1 example.csv && tail -n +2 example.csv | sort -r -k4
| -t',') | column -ts',' color shape flag index
| red square 0 77 purple triangle 0 51
| red square 0 48 red circle 1 16
| red square 1 15 yellow triangle 1 11
| asicsp wrote:
| Another option (based on
| https://unix.stackexchange.com/q/11856/109046):
| (sed -u '1q' ; sort -r -k4 -t',') <example.csv | column -ts','
| 2h wrote:
| > sorting with skipped headers is a mess
|
| I like these command line tools, but I think they can cripple
| someone actually learning programming language. For example,
| here is a short program that does your last example:
|
| https://go.dev/play/p/9bASZ97lLWv
| cosmojg wrote:
| It's a matter of perspective.
|
| I like programming languages, but I think they can cripple
| someone actually learning Unix!
|
| At the end of the day, you should just use whatever tools
| make you the most productive most quickly.
| gabrielsroka wrote:
| I thought the Go code looked way too complex and Python would
| be simpler. Yes and no. import csv
| filename = 'example.csv' sort_by = 'index'
| reverse = True with open(filename) as f:
| lines = [d for d in csv.DictReader(f)] for line
| in lines: line['index'] = int(line['index'])
| lines.sort(key=lambda line: line[sort_by], reverse=reverse)
| print(','.join(lines[0].keys())) for line in lines:
| print(','.join(str(v) for v in line.values()))
| cmdlineluser wrote:
| Perhaps a `DictWriter` would simplify things:
| import csv import sys filename =
| "example.csv" sort_by = "index" reverse =
| True with open(filename, newline="") as f:
| reader = csv.DictReader(f) writer =
| csv.DictWriter(sys.stdout, fieldnames=reader.fieldnames)
| writer.writeheader()
| writer.writerows(sorted(reader, key=lambda row:
| int(row[sort_by]), reverse=reverse))
| gabrielsroka wrote:
| I thought about that but 1) it seemed like cheating to
| write to standard out, 2) you're assuming that the column
| to sort by is an integer whereas I broke that code up a
| little bit.
|
| But yours has the advantage of being able to support more
| complex CSVs.
| [deleted]
| coldtea wrote:
| The whole point of UNIX userland is to not have to write a
| custom program for every simple case that just needs
| recombining some existing basic programs in a pipeline...
| 2h wrote:
| > simple case
|
| that's just it though, the last example is not a simple
| case, hence why the last example is awkward by the
| commenters own admission. command line tools are fine, but
| you need to know when to set the hammer down and pick up
| the chainsaw.
| carb wrote:
| To me the last example is still simple. When I encounter
| this in the wild, I don't really care about preserving
| the header. tail -n +2 example.csv | sort
| -r -k4 -t','
|
| Or more often, I just do this and ignore the header
| sort -r -k4 -t',' example.csv
|
| Keeping the header feels awkward, but using `sort` to
| reverse sort by a specific column is still quicker to
| type and execute (for me) than writing a program.
| coldtea wrote:
| > _the last example is not a simple case_
|
| As far as shell scripting goes, this is hardly anything
| to write home about. Looks simple enough to me.
|
| It just retains the header by printing the header first
| as is, and then sorting the lines after the header. It's
| immediately obvious how to do it to anybody who knows
| about head and tail.
|
| And with Miller it's even simpler than that, still on the
| command line...
| darrenf wrote:
| > `jq` supports `@csv` for output conversion but not input
|
| Actually, `jq` _can_ cope with trivial CSV input like your
| example, - `jq -R 'split(",")'` will turn a CSV into an array
| of arrays. To then sort it in reverse order by 3rd column and
| retain the header, the following fell out of my fingers (I'm
| beyond certain that a more skilled `jq` user than me could
| improve it): jq -R 'split(",")' example.csv
| | jq -sr '[[.[0]],.[1:]|sort_by(.[4])|reverse|.[]]|.[]|@csv'
| "color","shape","flag","index" "red","square","0","77"
| "purple","triangle","0","51" "red","square","0","48"
| "red","circle","1","16" "red","square","1","15"
| "yellow","triangle","1","11"
|
| NB. there is also an entry in the `jq` cookbook for parsing
| CSVs into arrays of objects (and keeping numbers as numbers,
| dealing with nulls, etc)
| https://github.com/stedolan/jq/wiki/Cookbook#convert-a-csv-f...
| melx wrote:
| +1 for easy install (`dnf install miller` on my Fedora).
|
| But seems like it cannot handle a simple use case: CSV without
| header.
|
| $ mlr --csv head -n 20 pp-2002.csv
|
| mlr: unacceptable empty CSV key at file "pp-2002.csv" line 1.
|
| You have to explicitly pass it (FYI `implicit-csv-header` is
| terrible arg name)
|
| $ mlr --csv --implicit-csv-header head -n 20 pp-2002.csv
|
| While `head` obliges rightly
|
| $ head -n 20 pp-2002.csv
| kbouck wrote:
| you could shorten to:
|
| $ mlr -N --csv head -n 20 pp-2002.csv
|
| -N is a shortcut for --implicit-csv-header and --headerless-
| csv-output
| coldtea wrote:
| Well, head just reads lines. A CSV row can optionally be
| multiple lines.
|
| Also head doesn't do anything with the data or the format,
| aside from printing line by line, so doesn't need to know any
| column names.
| melx wrote:
| head reads CSV row that has multiple lines.
|
| Miller is being offered as "like awk, sed, head.." (emphasis
| on head - mine) and yes it offers more, but it does not
| behave "like" the *nix tools it refers to.
| 0cf8612b2e1e wrote:
| Miller is designed around the idea of structured data. This
| is a higher bar than naively manipulating text, and the
| user has to be more deliberate to extract fields. Doing cli
| manipulation of a csv that contains quoted commas is
| challenging with the standard tools.
| [deleted]
| ar9av wrote:
| Looks cool! It'd help to have some more full examples (including
| output) of what some common use-cases do in the README. After
| looking through this, I'm still scratching my head trying to
| think of what problem this could solve in my own practice, and I
| work with a lot of csv files daily
| mdaniel wrote:
| In some sense, there's no answer to your question because
| everyone's tool workflow is their own.
|
| That said, the most recent invocation for me was `mlr --icsv
| --ojson cat < a515b308-9a0e-4e4e-99a2-eafaa6159762.csv` to fix
| up CloudTrail csv and after that I happen to be more muscle-
| memory with jq but conceptually next I could have `filter
| '$eventname == "CreateNodegroup"'`
| zX41ZdbW wrote:
| I recommend using clickhouse-local[1] for these tasks.
|
| It does SQL; it supports all imaginable data formats, streaming
| processing, and connecting to external data sources. It also
| outperforms every other tool[2].
|
| [1] https://clickhouse.com/blog/extracting-converting-
| querying-l...
|
| [2]
| https://colab.research.google.com/github/dcmoura/spyql/blob/...
| krick wrote:
| > curl https://clickhouse.com/ | sh
|
| Jesus, this is disgusting. I'm not that picky and don't really
| complain about "... | sh" usually, but at least I took it for
| granted that I can always look at the script in the browser and
| assume that is has no actual evil intentions and doesn't rely
| on some fucking client-header magic to be modified on the fly.
| cerved wrote:
| pipe it though less
| zX41ZdbW wrote:
| https://clickhouse.com/docs/en/install/#available-
| installati...
| maximilianroos wrote:
| Why be so cantankerous? curl
| https://clickhouse.com/ > install.sh cat install
| zX41ZdbW wrote:
| We provide .deb, .rpm, .tgz, Docker, or single-binary, for
| x86-64, AArch64, for Linux, Mac and FreeBSD.
| jsyolo wrote:
| or sqlite for CSV/TSV, haven't tried it for json.
| zX41ZdbW wrote:
| How does it look from command-line for streaming processing
| of CSV/TSV?
| [deleted]
| jsyolo wrote:
| you can pipe to it
| nousermane wrote:
| Yep: $ cat foo.tsv name foo
| bar Alice 10 8888 Bob 20 9999
| $ cat foo.tsv | sqlite3 -batch \ -cmd ".mode
| tabs" \ -cmd ".import /dev/stdin x" \
| -cmd "select foo from x where bar > 9000;" 20
| ydant wrote:
| duckdb and being able to write: select a,b,c
| from '*.jsonl.gz'
|
| has been a huge improvement to my workflows.
| avodonosov wrote:
| How is it better that SQL for these tasks on tabular data?
|
| That's the first question I have after reading the title. Haven't
| read the article.
|
| Edited the first sentence. Originally it was "Why it is not
| called SQL if works on tabular data?".
|
| My point, if one wants cat, sort, sed, join on tabular data, SOL
| is exactly that. Awk is too powerful, not sure about it.
| coldtea wrote:
| > _How is it better that SQL for these tasks on tabular data?_
|
| It has far better handling of a CSV/TSV file on the command
| line directly and is compasable in shell pipelines.
|
| > _My point, if one wants cat, sort, sed, join on tabular data,
| SOL is exactly that._
|
| SQL is a language for data in the form of tables in relational
| databases.
|
| While it can do sorting or joining or some changes, it is meant
| for a different domain than these tools, which other
| constraints, other concerns, and other patterns of use...
|
| You don't need to load anything to a db, for starters.
|
| You also normally don't care for involving a DB in order to use
| in a shell script, or for quick shell exploration.
|
| You also can't mix SQL and regular unix userland in a pipeline
| (well, with enough effort you can, but it's not something
| people do or need to do).
| holy_diver wrote:
| DuckDB is actually pretty good at this kind of thing.
|
| Doesn't need to load anything to DB
|
| Can be used in shell
|
| Can read from stdin and write to stdout
| avodonosov wrote:
| csvsql is the first google result for "sql in command line
| for csv"
|
| https://towardsdatascience.com/analyze-csvs-with-sql-in-
| comm...
| coldtea wrote:
| Yes, I know. It's not that there isn't several ways to do
| it, it's that it's not really a good fit for the command
| line, except in the "I want to reuse SQL that I already
| know".
|
| The problem isn't in having a way to use SQL to query the
| data from the command line, it's that SQL is long winded
| and with syntax not really fit in a traditional shell
| pipeline.
| LMMojo wrote:
| Because SQL (Structured Query Language) is the language used to
| access/manipulate the data, not the name for that kind of data.
| There have been, and are, many databases, which essentially use
| tabular data, which are not SQL database.
| garciasn wrote:
| 1. Because it doesn't use SQL syntax.
|
| 2. Because it's closer to an amalgamation of the standard shell
| scripting tools (cut, sort, jq, etc) than it is to a SQL
| variant.
| snidane wrote:
| Great tool.
|
| BUT, leaks memory like crazy.
|
| Despite documentation stating the verbs are fully streaming.
|
| > Fully streaming verbs > These don't retain any state from one
| record to the next. They are memory-friendly, and they don't wait
| for end of input to produce their output.
|
| https://miller.readthedocs.io/en/6.7.0/streaming-and-memory/...
| coldtea wrote:
| > _BUT, leaks memory like crazy._
|
| Huh?
|
| a) Isn't it written in Golang, which has a GC? Does it do
| custom buffer based management?
|
| b) Isn't it supposed to be run on a file and get some output -
| as opposed to an interactice session? Why would it matter if it
| leaks, then, and how could it leak, as the memory is returned
| to the OS when it ends?
| john_kerl wrote:
| It's indeed written in Go now, which indeed has GC, and
| ultimately all memory is freed ... but there are indeed some
| issues around intermediate retention of memory, taking more
| memory than one would have expected.
|
| Some gains were made on
| https://github.com/johnkerl/miller/pull/1133 and
| https://github.com/johnkerl/miller/pull/1132
| john_kerl wrote:
| See also https://github.com/johnkerl/miller/issues/1119
| john_kerl wrote:
| And https://github.com/johnkerl/miller/pull/1131
| hindsightbias wrote:
| Shell tool favorite from the past: https://github.com/stefan-
| schroedl/tabulator
| pcthrowaway wrote:
| Miller is a welcome addition to my toolbox. I just wish it was as
| 'easy' to use as jq... part of that might be how infrequently I
| work with CSVs
| coldtea wrote:
| Huh? jq has far more difficult syntax...
| hesdeadjim wrote:
| Yea agreed. The moment I need to do anything complicated I
| start getting this gut feeling I'd be better off just writing
| a quick Go cli tool.
| mywittyname wrote:
| The syntax is pretty crazy, so there's a learning curve. But
| once your over the hump, jq is easy to use for
| transformations or "querying" of large volumes of json data.
| password4321 wrote:
| Discussion of a similar tool last month:
|
| _yq: command-line YAML, JSON, XML, CSV and properties processor_
|
| https://news.ycombinator.com/item?id=34656022
|
| Also mentions gojq, Benthos, xsv, Damsel, a 2nd yq, htmlq, cfn-
| flip, csvq, zq, and zsv.
| elesiuta wrote:
| Another similar discussion last year where Miller was
| mentioned:
|
| _New(ish) command line tools_
|
| https://news.ycombinator.com/item?id=31009313
| networked wrote:
| I have made a repository cataloguing tools like this:
| https://github.com/dbohdan/structured-text-tools.
| glomgril wrote:
| Big shouts to visidata -- very underrated tool imo.
| jefftk wrote:
| There's also GoAWK, which supports CSV natively ("-i csv"):
| https://github.com/benhoyt/goawk
| horseRad wrote:
| You should include qsv (https://github.com/jqnatividad/qsv)
| and goawk (https://github.com/benhoyt/goawk) into the CSV
| section! Goawk got a csv/tsv mode and qsv is loaded with
| features being developed.
| account-5 wrote:
| Thanks for this, good job. I always mean to do something
| similar but just end up bookmarking HN threads I then never
| look at.
| samstave wrote:
| Guilty as charged... I have so many bookmarks I forget to
| look at...
|
| APPlet Idea: an app that picks a random bookmark from your
| saves and sends you a reminder to click it and read it...
| and schedule when you want to see it - like every 7AM send
| me a link from my bookmarks. Set an alarm for 15 minutes to
| get me back on schedule.
| elesiuta wrote:
| Thank you! That list is perfect since I don't use these tools
| often enough to remember them and some are hard to find again
| without knowing exactly what to search for.
|
| I also discovered pawk on it which looks interesting, I made
| a somewhat related tool [1] which is probably out of scope
| for your list, but can be used in some of the same ways [2]
| and may of of interest nonetheless.
|
| [1] https://github.com/elesiuta/pyxargs
|
| [2] cat /etc/hosts | pyxargs -d \n --im json --pre "d={}"
| --post "print(dumps(d))" --py "d['{}'.split()[0]] =
| '{}'.split()[1]"
| asicsp wrote:
| Seems like you meant https://github.com/TomWright/dasel
| (instead of damsel)
| password4321 wrote:
| Correct; thanks!
| dmix wrote:
| Anyone else find jq really hard to use? I found the
| syntax/interface bewildering at times. Although not as bad as
| awk.
| donio wrote:
| Benthos is a really cool tool when you want to take this idea
| to the next level. It can do the usual text/csv/json/yaml
| mangling you'd do with awk or jq (since it includes a version
| of those in addition to its own processing langauge) but it
| also has decoders and encoders for a bunch of different binary
| formats like protobuf or Avro. And in addition to stdin/out and
| files it can talk to Kafka, MQTT, HTTP, ES and a bunch of other
| stuff. I was able to put together a log processor that consumes
| from Kafka, does a bunch of field mangling and then ingests
| into Elasticsearch in a couple dozen lines of yaml.
| nextaccountic wrote:
| the obvious missing reference is jq
| https://github.com/stedolan/jq
| mdaniel wrote:
| I've been getting _a lot_ of mileage out of
| https://github.com/itchyny/gojq#readme recently due to two
| things: its _vastly_ superior error messages and the
| (regrettably verbose) `--yaml-input` option
|
| I also have https://github.com/01mf02/jaq#readme installed
| but just haven't needed it
| donio wrote:
| -y and -Y unused so I patch those in as short flags for
| --yaml-input/output in my copy.
| password4321 wrote:
| True, for me that one's a given!
| dima55 wrote:
| There's also https://github.com/dkogan/vnlog/ which is a
| wrapper around the existing coreutils, so all the options work,
| and there's nothing to learn
| fdgsdfogijq wrote:
| I've been using chat-GPT to work with these types of files now.
| sitkack wrote:
| ChatGPT is pretty good at giving working code snippets for
| Pandas, you can describe the transforms you want and a sketch
| of the data (column names and sample rows) and it will usually
| give back working code.
| mcdonje wrote:
| While I get what they mean by "tabular JSON", that's a bit of a
| misnomer.
| didgetmaster wrote:
| I was wondering what they meant by that term as well. Does
| every JSON document in the array have to have the exact same
| structure (including the ordering of key-value pairs)? What
| happens if row #1000 introduces a new key-value pair not seen
| before? What if the value for a key is an array?
| rustyf wrote:
| Even without aliases, I still prefer PowerShell on Windows. Once
| you've turned the text into an object, you can pipeline it to
| hell.
|
| Get-Content .\example.csv | ConvertFrom-Csv | Where-Object
| -Property color -eq red
| 7thaccount wrote:
| Ever try this on large files? A lot of PowerShell commands I
| make like this can take minutes to run when a combination of
| Linux commands and Awk might take a couple of seconds.
| unxdfa wrote:
| I much prefer this approach as you only have to parse stuff
| once during the processing pipeline rather than at every step.
|
| Unfortunately after working with it for a few years I utterly
| despise PowerShell for many other reasons.
| cbolton wrote:
| Can you share some of the reasons you don't like it?
| unxdfa wrote:
| Yes: memory ceiling is huge, WinRM is buggy and unreliable,
| performance is "variable", totally inconsistent handling of
| dates and times, being .Net it's UTF-16 internally so
| streams are painful to work with, escape characters argh,
| variable scoping is horrible, most of the API you have to
| hit to get stuff done is wrapping ancient bits of weird
| win32 and DCOM and WMI thus has inconsistent error handling
| and ability to actually do what it is told.
|
| I've probably forgotten a few things.
| hobs wrote:
| While powershell has gotten a lot faster in the most recent
| versions its still pretty slow for anything involving
| computation; most of the type my equivalent python code
| beats the pants off of it.
|
| The expressiveness is nice, but oftentimes modules won't
| support it or require weird ways of using the data to get
| the performance you want (mostly by dropping out of the
| pipe.)
|
| The choices around Format- vs Out- vs Convert- are Very
| Confusing for new people and the "object in a shell but
| also text sometimes" way of displaying things is weird and
| until recently things like -NoTypeInformation or managing
| encoding in files was just pointlessly weird.
|
| The module support and package management is still entirely
| in the stone ages and I regularly see people patching in C#
| in strings to get the behavior they want.
|
| "Larger" modules tend to get Really Slow - the azure
| modules especially are just an example of how not to do it.
|
| The way it automatically unwraps collections is cool, but
| gets weird when you have a 1 vs many option for your
| output, and you might find yourself defensively casting
| things to lists or w/e.
|
| The typing system in general is nice to get started on but
| declaring a type does not fix it, so assignment can just
| break your entire world.
|
| There's still a lot to love about the language when you are
| getting things done in a windows environment its great to
| glue together the various pieces of the system but I find
| the code "breaks" more often than equivalent python code.
| shortrounddev wrote:
| My philosophy about any shell language is that if
| performance is a concern, then you should probably use a
| real programming language for it. Using shell scripting
| to handle batch processing tasks just creates
| dependencies on unmaintainable code.
| hobs wrote:
| Simple tools with simple rules will outlast most of the
| code we'll all build. Picking things like grep, awk, and
| sed means your knowledge will be widely applicable going
| forward, and many people caring about their performance
| both backwards and forwards in time means your shell can
| be pretty fast.
| msla wrote:
| If performance is too bad, you can only use the shell for
| toy examples, which means there's no reason to have a
| real shell at all and you might as well go back to
| COMMAND.COM or equivalent. It's taking the idea of Unix-
| style scriptable shells and Improving the implementation
| to the point of unusability.
| wankle wrote:
| No thank you. I appreciate the power, speed, simplicity and
| flexibility of UNIX/GNU style text tools. I-Also-Don't-Want-To-
| Be-Locked-Into-This-Ridiculous-Syntax-Nightmare.
| shortrounddev wrote:
| It's funny to me that people hate how verbose posh is when
| bash syntax looks like my cat walked over my keyboard
| stinos wrote:
| When I read I-Also-Don't-Want-To-Be-Locked-Into-This-
| Ridiculous-Syntax-Nightmare, bash and sed and cut and the
| likes indeed are the first things which come to mind. I
| really feel kind of bad sometimes for once having spent
| time learning them, only to later find out there are many
| alternatives and nearly all of them have a shorter learning
| curve. Many of these tools also have zero discoverability
| as well, meaning you effectively get locked into learning
| their syntax. And then another one for the next tool.
| Whereas at least in PS you can try to use common words and
| tab completion and sometimes get there before having to
| reach to 'how do I x in y'.
| chongli wrote:
| It's a matter of recognizing your use case. If you're going
| to write a program that you expect to maintain for years,
| sure, go ahead and make it as verbose as possible. Unix
| tools support this with long-form flags (usually prefixed
| with -- rather than -). On the other hand, if you're doing
| exploration and iterating interactively on the fly (which
| bash is best at) then you want very terse syntax to keep
| lines short.
| password4321 wrote:
| Thank goodness for tab completion!
|
| I appreciate that it auto-corrects capitalization and slash
| direction.
| gabrielsroka wrote:
| Here it is with aliases. `?` Means where:
| ipcsv example.csv | ? color -eq red color shape
| flag index ----- ----- ---- ----- red square 1
| 15 red circle 1 16 red square 0 48
| red square 0 7
| MayeulC wrote:
| I agree that this is something missing on classic UNIX shells:
| typed output streams. I had this discussion a while back on HN,
| though I can't find it ATM (I wish there was a comment history
| search function). I am far from the first one who thought of
| that, and there are a few implementations of this idea.
|
| Searching for that comment, I came across relevant stories:
|
| https://news.ycombinator.com/item?id=27535357
|
| https://news.ycombinator.com/item?id=25225113
| asicsp wrote:
| Previous discussion:
| https://news.ycombinator.com/item?id=28298729 _(273 points | Aug
| 25, 2021 | 66 comments)_
| dang wrote:
| Thanks! Macroexpanded:
|
| _Miller - tool for querying, shaping, reformatting data in
| CSV, TSV, and JSON_ -
| https://news.ycombinator.com/item?id=29651871 - Dec 2021 (33
| comments)
|
| _Miller CLI - Like Awk, sed, cut, join, and sort for CSV, TSV
| and JSON_ - https://news.ycombinator.com/item?id=28298729 - Aug
| 2021 (66 comments)
|
| _Miller v5.0.0: Autodetected line-endings, in-place mode,
| user-defined functions_ -
| https://news.ycombinator.com/item?id=13751389 - Feb 2017 (20
| comments)
|
| _Miller is like sed, awk, cut, join, and sort for name-indexed
| data such as CSV_ -
| https://news.ycombinator.com/item?id=10066742 - Aug 2015 (76
| comments)
| [deleted]
___________________________________________________________________
(page generated 2023-03-16 23:00 UTC)