[HN Gopher] Desed: Demystify and debug your sed scripts
___________________________________________________________________
Desed: Demystify and debug your sed scripts
Author : asicsp
Score : 146 points
Date : 2024-09-05 04:46 UTC (18 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| hiAndrewQuinn wrote:
| I feel we're witnessing a resurgence of interest in 'nix default
| programs such as `sed` and `awk` in part because LLMs make it so
| much easier to get started in them, and because they really do
| exist _everywhere_ you might look. (The fact they were designed
| to be performant in bygone decades and are super-performant now
| as a result is also nice!)
|
| There is just something incredibly freeing about knowing you can
| sit down at a freshly-reinstalled box and do productive work
| without having to install a single thing on the box itself first.
|
| EDIT: https://hiandrewquinn.github.io/til-site/posts/what-
| programm... might be of interest if you want to know what you can
| work with right out of the box on Debian 12. Other distros might
| differ.
| WizardClickBoy wrote:
| 100% agree. I'm currently preparing several 10s of GBs of HTML
| in nested directories for static hosting via S3 and was
| floundering until Gippity recommended find + exec sed to me.
| I'm now batch fixing issues (think 'not enough "../" in 60000
| relative hrefs in nested directories') with a single command
| rather than writing scripts and feel like a wizard.
|
| These tools are things I've used before but always found
| painful and confusing. Being able to ask Gippity for detailed
| explanations of what is happening, in particular being able to
| paste a failing command and have it explain what the problem
| is, has been a game changer.
|
| In general, for those of us who never had a command line wizard
| colleague or mentor to show what is possible, LLMs are an
| absolute game changer both in terms of recommending tools and
| showing how to use them.
| godelski wrote:
| Primeagen detected
|
| I find him hard to listen to when he does things like this
| WizardClickBoy wrote:
| Primeagen is some kind of Youtuber? I am not familiar and
| don't understand what you are trying to convey here.
| 000ooo000 wrote:
| Guessing 'gippity' has been used by primeagen recently,
| so now you're gonna be tarred with the 18-23 React
| bootcamp graduate brush (at least that's who I imagine
| find him watchable).
| WizardClickBoy wrote:
| It's a case of convergent evolution - I don't know where
| I heard it first, but I asked GPT if it minded and it
| said "Of course, you can call me Gippity!", so I do,
| because it's more fun.
| poulpy123 wrote:
| yes, and a cringy one
| barrkel wrote:
| If you have a lot of files, consider find piped to xargs with
| -P for parallelism and -n to limit the number of files per
| parallel invocation.
|
| Only a tiny bit more complex but often an order of magnitude
| faster with today's CPUs.
|
| Use -print0 on find with -0 on xargs to handle spaces in
| filenames correctly.
|
| GNU parallel is another step up, but xargs is generally
| always to hand.
| WizardClickBoy wrote:
| Thanks! Gippity did suggest the xargs approach as an
| alternative, but I found that
|
| find [...] - exec [...] {} +
|
| as opposed to
|
| find [...] - exec [...] {} \;
|
| worked fine and was performant enough for my use-case. An
| example command was
|
| find . -type f -name "*.html" -exec sed -i '' -e 's/\\.\\.\
| /\\.\\.\/\\.\\.\//\\.\\.\/\\.\\.\/\\.\\.\/source\//g' {} +
|
| which took about 20s to run
| mdaniel wrote:
| One can express your sed in less Leaning Toothpick
| Syndrome[1] via: find . -type f -name
| "*.html" -exec sed -i '' -e
| 's|\.\./\.\./\.\./|../../../source/|g' {} +
|
| Using "/" as the delineation character for "s" patterns
| that include "/" drives me batshit - almost as much as
| scripts that use the doublequote for strings that contain
| no variables but also contain doublequotes (looking at
| you, json literals in awscli examples)
|
| If your sed is GNU, or otherwise sane, one can also `sed
| -Ee` and then use `s|\Q../../../|` getting rid of almost
| every escape character. I got you half way there because
| one need not escape the "." in the replacement pattern
| because "." isn't a meta character in the replacement
| space - what would that even mean?
|
| 1:
| https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome
| skydhash wrote:
| Parallel is nice when doing music conversion with ffmpeg.
| dools wrote:
| I needed some scripts to run a little "factory" for flashing an
| operating system onto some IoT devices. Lots of the work was
| running various shell commands but it is nonetheless something
| I would have traditionally written in PHP or Python but I
| thought "what the hell" and did the whole thing in bash with
| ChatGPT and it was a totally mind blowing experience.
|
| Now I use bash for all sorts of stuff. I've been working with
| *nix for 20 years but bash is so arcane and my needs always so
| immediate that I never did anything other than use it to run
| commands in sequence with maybe a $1 or a $2 in there
| godelski wrote:
| I've gotten into it recently but actually not because LLMs.
| Actually I find them unhelpful here. The reason I've gotten
| into it is because I wanted to make a bunch of install scripts
| for programs I want on fresh boxes. Mostly it's been fun.
| Seeing what I can do with curl, sed, awk, regex, and bash
| scripting. I'm often finding that I can do a ton of things in a
| single line where I would have done a lot more if I wrote it in
| python or something else. Idk, there's just something very fun
| about this.
|
| Though what's been a little frustrating is that there's anti
| scraping measures and they break things. But they're always
| trivial to get around, so it's just annoying.
|
| A big reason LLMs and up failing is that I need my scripts to
| work on osx and nix machines. So it's always suggesting things
| to me that work on one but not the other. It seems to not want
| to listen to my constraints and grep is problematic for them in
| particular. Luckily man pages are great. I think they're often
| over looked.
| asicsp wrote:
| If you are able to install specific implementations of the
| tools, go with GNU tools on all the machines. That way, you'd
| get more features and work the same everywhere.
|
| If that is not an option, go with Perl. It'd be a little
| slower, but you'll get consistent results. Plus, Perl has
| powerful regex, lots of standard libraries, etc.
| godelski wrote:
| Well the fun is, as I was trying to convey, building the
| tools automatically from fresh boxes. Sure, I can bootstrap
| my way by first installing gnu coreutils but if this was
| about doing things the easy way I'd just use the relevant
| package manager and ansible like everyone else
| keybored wrote:
| I resent this combination.
|
| - We never figured out how to package programs properly (Nix
| needs to become easier to use)
|
| - For all kinds of smaller tasks we practically need to use
| those Unix tools
|
| - Those everywhere tools are for hysterical raisins hard to use
| in a larger context (The Unix Philosophy in practice: use these
| five different tools but keep in mind that they are each
| different from each other across six dimensions and also they
| have defaults from the 70's or 80's)
|
| - For a lot of "simple" things you need to remember the simple
| thing plus eight comments (on the StackOverflow answer which
| has 166 votes but that's just because it was the first to
| answer the question) with nuance like "this won't work for your
| coworker on Mac"
|
| - So you don't: you go to SO (see previous) and use snippets
| (see first point: we don't know how to package programs, this
| is the best we got)
|
| - This works fine until Google Search decides that you are too
| reliant on it for it to have to work _well_
|
| - Now you don't use "random stuff from StackOverflow" which can
| at least have an audit trail: now you use random weights from
| your LLM in order to make "simple" solutions (six Unix tools in
| a small Bash script which you can't read because Bash is hard)
|
| This is pretty much the opposite of what inspired me when
| studying computer science and programming.
| skydhash wrote:
| > We never figured out how to package programs properly
|
| What the issue with apt, pacman, and the others? I think
| they're doing their job fine.
|
| > For all kinds of smaller tasks we practically need to use
| those Unix tools
|
| I mean, they're good for what they do
|
| > Those everywhere tools are for hysterical raisins hard to
| use in a larger context
|
| Because each does a universal task you may want to do in the
| unix world of files and stream of texts.
|
| > For a lot of "simple" things you need to remember the
| simple thing plus eight comments
|
| No, you just need the manuals. And there are books too. And
| yes the difference between BSD and GNU is not obvious at
| first glance. But they're different software worked on by
| different people.
| leetrout wrote:
| Related, `sd` is a great utility worth the install which makes
| simple sed-type operations more obvious / easier (for some value
| of easy).
|
| https://github.com/chmln/sd
| oguz-ismail wrote:
| It uses a different syntax though. Hardly worth anyone's time
| Etheryte wrote:
| Not sure if I agree. Sed is widely known and much of the
| value comes from that, just being around for a long while,
| but I wouldn't really say that the syntax is all that
| straightforward. As a thought experiment, try explaining how
| to use sed to a fresh graduate who's never seen it. Not
| saying sd is better or anything, but rather that just because
| the syntax is different doesn't make it bad.
| oguz-ismail wrote:
| sed is widely known because it's available everywhere and
| is used in every shell script. I just don't see the point
| in learning a new utility that does the same thing as sed
| but with different syntax. In this case the new utility
| doesn't even honor my language settings and just errors out
| if I enter a non-English letter. It's ridiculous
| wolletd wrote:
| How? Shouldn't it just all be UTF-8? Or do you use a
| different encoding on your system?
| wolletd wrote:
| > try explaining how to use sed to a fresh graduate who's
| never seen it
|
| Well, for starters, you just `s/<regex>/<replacement>/` and
| try to use that in your everyday work. Just forget about
| the syntax. It's a search-and-replace tool.
|
| That's the only way I used sed for years. I've learned more
| since then, but it's still the command I use the most. And
| that's also what `sd` focuses on.
|
| Also, if you want to replace newlines, just use `tr`, to
| hook onto the examples of sd. It may seem annoying to use a
| different tool, but there are two major advantages: 1.
| you're learning about the existence, capabilities and
| limitations of more tools 2. both `sed` and `tr` are
| probably available in your next shitty embedded busybox-
| driven device, while `sd` probably is not
|
| As you said, the value comes from being around for a long
| time and, probably more importantly, still being present on
| nearly any Unix-like system.
| ta1243 wrote:
| 99% of the time I use sed to mangle the output of a text
| file into something else.
|
| Earlier I did this cat as1 | grep " 65" |
| sed -e 's/.* 0 65/65/' -e 's/[^ 0-9]//' |sort|uniq
|
| Now some twat will come along and say my process should
| have been cat as1 grep " 65" as1
| grep " 65" as1 | sed -e (various different tries to the
| data looks useful) grep " 65" as1 | sed -e
| (options) | sort|uniq
|
| Because otherwise it's a "useless use of cat" and
| reformatting my line is well worth the time and cognitive
| load to save those extra forks.
| Etheryte wrote:
| I think the concept of useless use of cat is one of the
| few things I strongly disagree with in software
| development. Most things have their trade-offs, pros and
| cons, but using cat to start a pipe makes everything
| composable and easy to work with, it's pretty much
| universally good. The moment you drop it because of the
| small redundancy, you have to make sure you don't mess up
| the params for whatever comes next, and that overhead is
| in my opinion never worth what you gain by dropping cat.
| GolDDranks wrote:
| sd has very much proven to be worth of my time. It's both
| faster and way easier to use.
| keybored wrote:
| Middle-brow dismissal. Hardly worth anyone's consideration.
|
| Just go straight to the point that this isn't available on a
| proprietary Unix that had its EOL fifteen years ago and that
| five people still use.
| oguz-ismail wrote:
| >this isn't available on a proprietary Unix
|
| Skill issue. It's not necessary in the first place anyway
| ReleaseCandidat wrote:
| As soon as there is a _complete_ regex reference in the readme,
| it may be worth a try. The main problem with _any_ regex tool
| or programming language or ... is the subtle and not so subtle
| differences between the various regex implementations - like
| the "normal" and "extended" mode of sed.
|
| This phrase: sd uses regex syntax that you
| already know from JavaScript and Python.
|
| says it all.
|
| I still haven't found a better short overview of various regex
| engines than that:
| https://web.archive.org/web/20130830063653/http://www.regula...
| ptman wrote:
| Indeed. It's different from Python, maybe JavaScript as well.
| https://docs.rs/regex/latest/regex/#syntax
| gregwebs wrote:
| There's also sad that let's you review find and replace changes
| to files before making them: https://github.com/ms-jpq/sad
| aidos wrote:
| sed, awk, grep and friends are just so effective at trawling
| through text.
|
| I dump about 150GB of Postgres logs a day (I know, it's over the
| top but I only keep a few days worth and there have been several
| occasions where I was saved by being able to pick through them).
|
| At that size you even need to give up on grepping, really. I've
| written a tiny bash script that uses the fact that log lines
| start with a timestamp and `dd` for immediate extraction. This
| allows me to quickly binary search for the location I'm
| interested in.
|
| Then I can `dd` to dump the region of the file I want. After that
| I have an little awk script that lets me collapse the sql lines
| (since they break across multiple lines) to make grepping really
| easy.
|
| All in all it's a handful of old school script that makes an
| almost impossible task easy.
| porridgeraisin wrote:
| Can you explain how you used dd here? Ive never seen it used
| this way, curious
| fwip wrote:
| dd lets you specify an offset to start reading the file at,
| with `skip`. This would let you perform a binary search by
| picking an offset in the file, reading a small chunk (say, a
| kilobyte), and scanning for the date/time string within it.
| Each read should be O(1) in terms of the size of the file, so
| a O(log(n)) for the binary search, whereas a grep-based
| approach is O(n).
|
| (The datetime in the log message is presumably sorted, or
| nearly so).
| aidos wrote:
| Sure! I've created a gist so you can see for yourself but the
| basic idea is as described. Read a chunk, find the first date
| in it and then decide if you want to read further forward or
| back in the file.
|
| https://gist.github.com/aidos/5a6a3fa887f41f156b282d72e1b79f.
| ..
|
| For anyone else, here's the awk for combining lines in the
| log files for making them greppable too: https://gist.github.
| com/aidos/44a9dfce3c16626e9e7834a83aed91...
| qwertox wrote:
| > Why sed??
|
| > Sed is the perfect programming language, especially for graph
| problems. It's plain and simple and doesn't clutter your screen
| with useless identifiers like if, for, while, or int. Furthermore
| since it doesn't have things like numbers, it's very simple to
| use.
|
| "useless identifiers like if, for, while, or int"? Useless
| identifiers?
| ReleaseCandidat wrote:
| That's about as serious as Some of the
| notable features include: Preview variable values,
| both of them! ... Its name is a
| palindrome
| 082349872349872 wrote:
| To be fair, IBM actually had a commercial product that was
| simpler (cheaper) because it didn't have things like numbers:
| https://en.wikipedia.org/wiki/IBM_1620#Transferred_to_San_Jo.
| ..
| russfink wrote:
| For me, the question of why is because it's already installed
| in the environment and available on every UNIX system I have
| used. This is a case of conforming myself to the tool, rather
| than the other way around. If you are of a certain vintage like
| I am, You got used to doing these things early on because we
| could not just apt install foo on our platforms anytime we
| needed something.
|
| I do not mean to sound like "kids these days... " I really like
| these modern systems that allow you to install a wide range of
| packages. It is a huge step forward. I just want to explain my
| perspective, perhaps others share that perspective. It probably
| also explains why such tools continue to exist.
| JoelJacobson wrote:
| I wish there was a similar tool for relational algebraic
| expressions, to make relational database research papers more
| accessible.
| sylware wrote:
| I am done with regular expressions languages and engines. Each
| time I wanted to do a not so trivial usage of it, I had to re-
| learn the language(s) and debug it, not to mention the editing
| operations on top of them (sed...).
|
| This has been quite annoying. So now I code it in C or assembly
| fusing common-cases code templates and ready build scripts to
| have a comfortable dev loop.
|
| In the end, I get roughly the same results and I don't need those
| regular expressions languages and engines.
|
| It is a clear win in that case.
| mlegendre wrote:
| Amusingly, in French, "desed" sounds like "decede", which means
| die / decease. That's quite a fitting name for a tool one would
| use in "I need to debug a sed script" situations!
| 082349872349872 wrote:
| `sed` in latin is often used to contrast two things, "not this,
| but that", eg
|
| _Amicitia non semper intellegitur sed sentitur._ (Friendship
| is not always understood, but it is felt.)
|
| which I'm always reminded of when using sed(1) in a script to
| provide, not this pattern, but that replacement.
| russfink wrote:
| No Debian (Ubuntu, Mint and friends) version?
| trey-jones wrote:
| Once in HN comments I saw `sed` referred to as a one-way hashing
| function, and that's always stuck with me - not just for sed, but
| for any type of operation that ends up being sort of a "black
| box". Input becomes output reliably, but it's hell to understand
| how. My big take away was: These types of operations are OK, when
| necessary, but it's a good idea to take the time to write some
| comments/documentation so the next person who looks at it
| (including self) has somewhere to start.
|
| That said, debugging is definitely a thing, and tools like this
| are awesome!
| mifydev wrote:
| Oh, I definitely need to run this one on
| https://github.com/chebykinn/sedmario
| ok123456 wrote:
| This is built into perl:
|
| perl -MO=Deparse -w -naF: -le 'print $F[2]'
| tqwhite wrote:
| IMPOSSIBLE!!! God made sed as a test for humans to prove their
| humility. It is intrinsically mysterious.
___________________________________________________________________
(page generated 2024-09-05 23:01 UTC)