[HN Gopher] Unix core utilities implemented in Haskell
___________________________________________________________________
Unix core utilities implemented in Haskell
Author : ocean_moist
Score : 178 points
Date : 2024-10-30 06:07 UTC (4 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| anardil wrote:
| Wow, hello! This is my repository. I'm happy to answer any
| questions.
| aeonik wrote:
| You specify "fast", can you elaborate on the performance of the
| collection? How does it compare to the standard core utils?
|
| Great work, looks amazing.
| anardil wrote:
| Performance (execution, memory) is generally in the same
| ballpark as the BSD versions, with some caveats specific to
| utils that do lots of in place data manipulation.
|
| cut comes to mind as an example, slicing and dicing lines
| into fields quickly without a ton of copies isn't easy. Using
| Streaming.ByteString generally makes a huge difference, but
| it's extremely difficult to use unless you get can your mind
| to meld with the types it wants. Picking it up again months
| later takes some serious effort.
| faragon wrote:
| Very beautiful implementation of the awk interpreter in less
| than 600 lines!
|
| https://github.com/Gandalf-/coreutils/blob/master/Coreutils/...
| anardil wrote:
| Thank you! This is one of my favorites. User declared
| variables are next on the todo list, when I get back to it.
| OskarS wrote:
| It is really gorgeously written Haskell. I've only dabbled
| in Haskell, but you're really shetting my appetite for
| digging in deeper.
| cosmic_quanta wrote:
| Could you speak to the advantages of Haskell's lazy IO? I only
| hear about its disadvantages usually
| habitue wrote:
| I imagine for streaming tools like these it's pretty
| convenient. You don't have to manage buffers etc, just write
| code against a massive string and haskell takes care of
| streaming it for you and pulling in more data when needed.
|
| There are libraries that handle it, but they probably have
| weird types, you can just use functions in the prelude to
| write a lot of these basic utilities.
| anardil wrote:
| It definitely has some sharp edges. One advantage is skipping
| computations (and the IO they'd need) that don't end up
| getting used, which let's you do some clever looking things/
| ignore some details. That's hard to take advantage of in
| practice, I think.
|
| The other advantage is just deferring IO. For instance in
| split or tee, you could decide that you need 500 output files
| and open all the handles together in order to pass them to
| another function that will consume them. I'd squint at
| someone who wrote `void process_fds(int fds[500]);`, but here
| it doesn't matter.
| mrkeen wrote:
| If your language doesn't give you laziness, you're
| reinventing it yourself with strict primitives each time.
| vacuity wrote:
| On the other hand, when you don't want laziness you really
| won't like if it's present anyways.
| weebull wrote:
| Lazy to strict is reasonably easy to do though. The
| problem is normally that once one bit goes strict, most
| other things implicitly do too.
|
| Strict to lazy is normally a rewrite.
| weebull wrote:
| Lazy to strict is reasonably easy to do though. The
| problem is normally that once one bit goes strict, most
| other things implicitly do too.
|
| Strict to lazy is normally a rewrite.
| bts wrote:
| Hi! A few years ago I found myself wanting an equivalent of
| `column` that didn't strip color codes. After I implemented it
| in Haskell, I found it was useful to use Nix to force
| statically linking against libraries like gmp to reduce startup
| time. Perhaps what I ended up doing might be helpful for you
| too: https://github.com/bts/columnate/blob/master/default.nix
| anardil wrote:
| Thank you for the suggestion, I'll give this a whirl! I've
| fussed around with `--ghc-options '-optl-static -fPIC'` and
| the like in years past without success.
| Vosporos wrote:
| Fantastic work, thank you so much!
| wizerno wrote:
| The author has a few blog posts covering the tee, split, which,
| and cat commands [1].
|
| [1] https://www.anardil.net/tag/coreutils.html
| chasil wrote:
| "On Windows - Symlinking doesn't appear to change the name
| reported by System.Environment.getProgName, so you'll need to
| create copies of the binary with different names."
|
| I have not found this to be the case with "mklink /h".
| zanie wrote:
| That'd be a hard link though
| eviks wrote:
| > Symlinking doesn't appear to change the name reported by
| System.Environment.getProgName, so you'll need to create copies
| of the binary with different names
|
| Would hardlinks help (would avoid multiple copies)?
| anardil wrote:
| I'll have to give this a try, thank you for the suggestion!
| Waterluvian wrote:
| I'm curious if there are any that were especially uncomfortable
| to implement in Haskell.
|
| I'm a _very very_ novice "functional programmer" but on occasion
| I find problems that feel just ridiculous to implement FP style.
| sfn42 wrote:
| I'm not much of a functional programmer either, but generally
| in the beginning you'll feel this way because you're not used
| to thinking about problems in a functional way. You're used to
| solving problems with imperative structures. So you try to
| implement that in a functional language and it turns out
| completely stupid because the language isn't made to solve
| problems that way. But it usually has good ways to solve
| things.
| galangalalgol wrote:
| I got stuck trying to have an intuition about what would
| explode my memory usage due to lazy eval, I never got a grip
| on the transition from pseudofunctional at the io boundaries
| into real functional internally either.
| anonzzzies wrote:
| I like fp and array programming and logic programming (and
| the overlap) to solve problems that are (considered or for
| real; I don't usually know going in) really a bad fit for
| those and then implement them shorter and faster than in
| their imperative implementation. For some things it does not
| work obviously, but it's a really nice exercise for when in
| the gym, walking, cooking, driving and such; the performant
| or LoC optimising parts fit in my head so I can do the
| rewrites there and find ways to make them fit. For instance;
| going from an imperative piece of code to array programming
| (k/j), you have to find a way to translate the problem to a
| vector/matrix manipulation problem and that you can do while
| doing something else entirely. When you find a good one fit,
| it can be many, many times faster than the original just like
| that.
| anardil wrote:
| Definitely. Depending on how long you've spent staring at the
| contents of /bin/ and /usr/bin/ you'll notice there are
| definitely some array or matrix oriented utils (or options)
| missing like column.
|
| cut comes to mind as a difficult one. In C, you can just hop
| around the char buffer[] and drop nulls in place for fields,
| etc before printing. You _could_ go that way a Data.Array Char,
| but that 's hard to justify as functional.
| Muromec wrote:
| Shouldn't cut but the easiest one to do in functional style?
|
| You basicaly map one line of a stream to another with some
| filtering and joining. Do I miss the part where it's terribly
| slow and/or not doable in Haskell or something?
| amelius wrote:
| It would be cool if RFCs would be turned into Haskell, so we
| could use __that__ as the specification instead.
|
| Same for all the remaining web standards.
| fuzztester wrote:
| executable specs?
| johnisgood wrote:
| Opinion: I would rather prefer OCaml or just C (instead of say,
| Rust or Python).
| bionade24 wrote:
| C is way too ambiguous to serve for that purpose.
| johnisgood wrote:
| Are we referring to reference implementations? I have
| always found it easier to understand the C code, as Rust
| and Python tend to obscure too many details behind high-
| level abstractions.
|
| C is simple enough to know what is going on, IMO.
|
| For example: numbers = [1, 2, 3, 4, 5]
| squared_numbers = [num * num for num in numbers]
| print(squared_numbers)
|
| vs. fn main() { let numbers =
| vec![1, 2, 3, 4, 5]; let squared_numbers:
| Vec<i32> = numbers.iter().map(|&num| num * num).collect();
| println!("{:?}", squared_numbers); }
|
| vs. #include <stdio.h> int
| main() { int numbers[] = {1, 2, 3, 4, 5};
| int squared_numbers[5]; // Squaring
| each number for (int i = 0; i < 5; i++) {
| squared_numbers[i] = numbers[i] * numbers[i]; }
| // Printing the squared numbers printf("[");
| for (int i = 0; i < 5; i++) { printf("%d",
| squared_numbers[i]); if (i < 4) printf(",
| "); } printf("]\n");
| return 0; }
|
| Python and Rust here seem concise and elegant, but can be
| more difficult to follow to the unaccustomed due to
| obscurity. I am sure there are examples that involve more
| higher-level abstractions, and reference implementations
| typically seem to use a lot of said abstractions. In case
| of this Rust code, abstractions (like iterators and
| closures) can also make the code more challenging to follow
| for those who are not accustomed to functional programming
| paradigms.
|
| What I am trying to say is that the C implementation is
| more straightforward, IMO.
| yazzku wrote:
| [num**2 for num in numbers]
|
| There's no way that C implementation is easier to read
| than the Python one. There's no high-level abstraction
| going on there and the list comprehension reads like
| natural language.
|
| I agree Rust often gets distracting with unwrapping and
| lifetime notation.
| johnisgood wrote:
| I know it is easier to read, and it is not difficult to
| figure out what is going on (in this case), but many
| languages do not support list comprehensions (for
| example), so you have to implement it in a way that is
| similar to the C version, making the C version more
| helpful.
|
| The example above may not highlight what I was trying to,
| however.
|
| Perhaps Ada could work even better than C? It is verbose,
| yes, but you will immediately know what is going on.
|
| My point is, that in my opinion verbosity is better than
| conciseness in cases of reference implementations.
| SkiFire13 wrote:
| > as Rust and Python tend to obscure too many details
| behind high-level abstractions.
|
| C on the other hand forces handling too many low level
| details for a reference implementation IMO. Rust also
| isn't really good in that regard.
|
| In a reference implementation what needs to be made
| explicit is the logic that needs to be implemented,
| rather than details like memory management. For that I
| would prefer something like Datalog instead.
| johnisgood wrote:
| > what needs to be made explicit is the logic that needs
| to be implemented, rather than details like memory
| management
|
| I agree.
| trealira wrote:
| People who write C nowadays generally try to write it in
| the most "boring" and easy-to-read way possible, I think.
| But C can also be hard to understand if it uses a lot of
| side-effectful expressions and pointer arithmetic. Just
| look at "How Old School C Programmers Process Arguments",
| which looks over an excerpt from K&R [0]. To make your
| code slightly harder to read, while still being
| plausible, you could change it to use pointer arithmetic,
| like so: // Printing the squared numbers
| printf("["); for (int n = 5, *p = squared_numbers;
| --n >= 0; p++) printf("%d%s", *p, n > 0 ? ", "
| : "]\n");
|
| [0]: https://www.usrsb.in/How-Old-School-C-Programmers-
| Process-Ar...
| amelius wrote:
| I personally think Haskell is a better choice because it is
| purely functional.
| tightbookkeeper wrote:
| Awesome work!
|
| I noticed that they are about as long and complex as the C
| versions. In early C++/STL days part of the pitch was
| reimplementing core utils in a much more concise way. In wonder
| if the is a result of focusing on that application, or a
| coincidence of design decisions.
| mytec wrote:
| I'm used to reading about a given C/C++ program being implemented
| in Rust, and was delighted to see such an effort in a functional
| programming language.
|
| I know little about functional programming languages but I've
| always been interested in how languages like Ada and now Rust can
| help programmers write safer code. I'm curious what advantages a
| rewrite of a C/C++ app in a FP language provides and also what
| advantages a FP language brings in comparison to a language like
| Rust.
| Maro wrote:
| I looked at the wc implementation, there's no way it's compatible
| with the GNU tools, how it handles wide characters, which is
| heavily ties to the idiosyncratic C implementation. I know
| because I once wrote a wc implementation in modern C++ for my own
| education, and that's where most of my time and code complexity
| went. Also, the C versions are devilishly fast, with all flags,
| with optimized code branches.
|
| https://bytepawn.com/tag/wc.html
| galangalalgol wrote:
| Poking at the rust coreutils project it seems like they all
| faster than the c versions. The issue is the decades of
| features that have accumulated that someone somewhere uses but
| seem safe to disregard because so few do.
| Maro wrote:
| There are millions of lines of shell scripts relying on those
| flags.
| chongli wrote:
| I would place the blame for that on the GNU coreutils
| project. At best, it's feature creep. At worst, it's
| embrace and extend on the POSIX standard.
| erik_seaberg wrote:
| POSIX doesn't forbid new flags or options. It's up to the
| author to read the spec and test his portability, or else
| willingly rely on certain distros. Some GNU tools have
| had strict modes as a courtesy (they used to jokingly
| call it POSIX_ME_HARDER).
| prmoustache wrote:
| And those scripts should check they are indeed calling the
| gnu coreutils version before they proceed.
| Arcuru wrote:
| They're faster? They must have been working on performance
| then, maybe I should take another look at how that project
| has been maturing.
|
| When I last looked 'dd' was significantly slower, though I
| did make it a bit closer a while back -
| https://jackson.dev/post/rust-coreutils-dd/
|
| A lot of the Rust coreutils were written by people learning
| the language, and the code is often _more_ complicated than
| the corresponding code for the Unix utils. They don't seem to
| get enough experienced programmers fixing them up or usage
| for me to actually recommend anybody use them, but maybe
| that's changing.
| mustache_kimono wrote:
| > When I last looked 'dd' was significantly slower, though
| I did make it a bit closer a while back -
| https://jackson.dev/post/rust-coreutils-dd/
|
| I read your blog. Maybe you should take a look at some of
| the other utils. I worked on sort and ls and both have long
| been faster than their GNU equivalents.
|
| > They don't seem to get enough experienced programmers
| fixing them up or usage for me to actually recommend
| anybody use them, but maybe that's changing.
|
| The issue is obviously compatibility and the uutils won't
| be broadly "usable" or stable until is hits 100% PASS/0%
| FAIL on the GNU test coverage, although the numbers are
| getting better and better. See:
| https://raw.githubusercontent.com/uutils/coreutils-
| tracking/...
|
| > A lot of the Rust coreutils were written by people
| learning the language,
|
| I really, really hate this way of framing the project --
| which implicitly sounds like GNU is full of pros. Yes, lots
| of code was written by people just learning Rust and
| systems programming. Once it reaches parity on GNU test
| coverage, which it is rapidly approaching, GNU coreutils
| would wish it had this kind of enthusiasm behind it.
|
| > and the code is often _more_ complicated than the
| corresponding code for the Unix utils.
|
| I think it really depends. Much of the code is much
| simpler. For instance -- I think it's much easier to
| generically do hashing in Rust, and it shows.
| diath wrote:
| ...and that's precisely why they are faster, if you can throw
| out decades of compatibility flags you can avoid expensive
| branching and such to make them faster. That's why such
| projects should be called alternatives, and not replacements,
| it's not a replacement if it cannot be seamlessly replaced
| and decades worth of scripts can continue working.
| 7bit wrote:
| When is it the time to talk about alternatives replacing
| the current apps in favor of faster apps? 10 years or rare
| usage? Twenty? At some point the edge cases should be
| covered by the scripts not by the app, only to make it
| slower for everyone.
|
| I get what you're saying, but I don't care how it's called.
| Some things must die. Eg. Python 3 and the depreciation of
| was a very controversial step, but ultimately the right
| choice at the right time.
| kanbankaren wrote:
| > When is it the time to talk about alternatives
|
| When all standards bodies and governments decide that
| POSIX is a hindrance which might take a few decades. And
| that is if they decide.
| galangalalgol wrote:
| The FBI just said everyone needs a roadmap by 2026 to
| eliminate c/c++ from critical infrastructure.
| cozzyd wrote:
| Python 3 created a huge amount of unnecessary effort and
| made a lot of people hesitant about using python again
| andrepd wrote:
| Well it's one of the most popular languages in the planet
| so it cannot have gone that bad.
| jhayward wrote:
| Not a mainstream view.
|
| Python 3 is a hugely successful language and
| implementation, and almost no one regrets that it exists
| aside from a few noisy holdouts, and people who never
| liked any Python anywhere at any time.
| cozzyd wrote:
| I don't disagree that python3 is hugely successful, but
| that doesn't mean the Python 2->3 pain was necessary.
| Certainly many Python users started using it too recently
| to remember it though.
| erik_seaberg wrote:
| python3 had no compatibility mode, so everyone needed
| 100% of their dependencies to migrate. This was so
| painful that some teams abandoned their legacy python2
| code and reimplemented in languages with better back
| compat stories.
| wrs wrote:
| No problem. Don't branch in the middle of the algorithm,
| choose the appropriate optimized algorithm based on the
| task. That's why we haven't had a separate fgrep binary for
| decades.
| johnisgood wrote:
| Faster? I do not know, what I do know is that there has been
| some really crazy vulnerabilities in one (or some) of these
| utilities (logic errors).
| anardil wrote:
| You're right! Data.ByteString.Lazy is Word8 under the covers,
| so wide characters are truncated. tr takes a similar short cut.
| Swapping to Data.Text would fix that.
|
| Where simplicity conflicted with compatibility, I've chosen the
| former so far. Targeting the BSD options and behavior is
| another example of that. The primary goal is to feel out the
| data flow for each utility, rather than dig into all the edges.
| johnisgood wrote:
| Speaking of fast: https://redman.xyz/git/turbowc/tree/turbowc.c
| seems to be really fast (it is not a replacement for wc,
| however).
| MBCook wrote:
| The readme mentions feature parity with the BSD coreutils, so
| I'm assuming that's the target. Not GNU.
| parasti wrote:
| Ive seen several coreutils reimplementations just using the name
| "coreutils" for themselves. I doubt anyone in the original
| project cares, but seems in bad taste to me.
| bqmjjx0kac wrote:
| I mean, it gets the point across and they're not claiming to be
| "GNU coreutils". Maybe some derivative name would be better for
| uniqueness.
| xolve wrote:
| TIL that GitHub usernames can end with a `-`
| anardil wrote:
| The rules were different in 2014 when I made my account! It's
| actually quite annoying because lots of 3rd party GitHub
| integrations puke immediately saying I have an invalid
| username.
| hello_computer wrote:
| The original coreutils are very weak on wide-char formatting,
| item delimiting (i.e. -null / -print0), flag name consistency,
| discoverability, and overall input/output uniformity. Solving the
| edge cases rigorously would probably require minor breakages, but
| would be worthwhile.
|
| Most coreutil re-workings i've seen either double-down on JSON
| output, 24-bit color, sixels, & other lipsticks on the pig--
| without addressing any of the basic breakages we've become
| accustomed to--or they go off into an entirely different
| direction like nushell.
|
| There is definitely room for improvement, but it is a thankless
| job.
| Muromec wrote:
| Why not have bug to bug compatible vetsion of coreutils, but in
| rust and then have a nice thing too?
| hello_computer wrote:
| Most are simple programs. If bug-per-bug compat + memory
| safety is the goal, carefully reviewing the C would have been
| a vastly better investment of time compared to making 13,498
| commits.
|
| https://www.cvedetails.com/product/5075/GNU-
| Coreutils.html?v...
|
| https://github.com/uutils/coreutils
| mhd wrote:
| This reminds me of a similar attempt with Perl[0], done a few...
| erm, twenty years ago. Might be interesting to do a three-way
| comparison for a few well-covered cases.
|
| [0]: https://metacpan.org/release/CWEST/ppt-0.14
___________________________________________________________________
(page generated 2024-11-03 23:00 UTC)