[HN Gopher] Unix core utilities implemented in Haskell
       ___________________________________________________________________
        
       Unix core utilities implemented in Haskell
        
       Author : ocean_moist
       Score  : 178 points
       Date   : 2024-10-30 06:07 UTC (4 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | anardil wrote:
       | Wow, hello! This is my repository. I'm happy to answer any
       | questions.
        
         | aeonik wrote:
         | You specify "fast", can you elaborate on the performance of the
         | collection? How does it compare to the standard core utils?
         | 
         | Great work, looks amazing.
        
           | anardil wrote:
           | Performance (execution, memory) is generally in the same
           | ballpark as the BSD versions, with some caveats specific to
           | utils that do lots of in place data manipulation.
           | 
           | cut comes to mind as an example, slicing and dicing lines
           | into fields quickly without a ton of copies isn't easy. Using
           | Streaming.ByteString generally makes a huge difference, but
           | it's extremely difficult to use unless you get can your mind
           | to meld with the types it wants. Picking it up again months
           | later takes some serious effort.
        
         | faragon wrote:
         | Very beautiful implementation of the awk interpreter in less
         | than 600 lines!
         | 
         | https://github.com/Gandalf-/coreutils/blob/master/Coreutils/...
        
           | anardil wrote:
           | Thank you! This is one of my favorites. User declared
           | variables are next on the todo list, when I get back to it.
        
             | OskarS wrote:
             | It is really gorgeously written Haskell. I've only dabbled
             | in Haskell, but you're really shetting my appetite for
             | digging in deeper.
        
         | cosmic_quanta wrote:
         | Could you speak to the advantages of Haskell's lazy IO? I only
         | hear about its disadvantages usually
        
           | habitue wrote:
           | I imagine for streaming tools like these it's pretty
           | convenient. You don't have to manage buffers etc, just write
           | code against a massive string and haskell takes care of
           | streaming it for you and pulling in more data when needed.
           | 
           | There are libraries that handle it, but they probably have
           | weird types, you can just use functions in the prelude to
           | write a lot of these basic utilities.
        
           | anardil wrote:
           | It definitely has some sharp edges. One advantage is skipping
           | computations (and the IO they'd need) that don't end up
           | getting used, which let's you do some clever looking things/
           | ignore some details. That's hard to take advantage of in
           | practice, I think.
           | 
           | The other advantage is just deferring IO. For instance in
           | split or tee, you could decide that you need 500 output files
           | and open all the handles together in order to pass them to
           | another function that will consume them. I'd squint at
           | someone who wrote `void process_fds(int fds[500]);`, but here
           | it doesn't matter.
        
           | mrkeen wrote:
           | If your language doesn't give you laziness, you're
           | reinventing it yourself with strict primitives each time.
        
             | vacuity wrote:
             | On the other hand, when you don't want laziness you really
             | won't like if it's present anyways.
        
               | weebull wrote:
               | Lazy to strict is reasonably easy to do though. The
               | problem is normally that once one bit goes strict, most
               | other things implicitly do too.
               | 
               | Strict to lazy is normally a rewrite.
        
               | weebull wrote:
               | Lazy to strict is reasonably easy to do though. The
               | problem is normally that once one bit goes strict, most
               | other things implicitly do too.
               | 
               | Strict to lazy is normally a rewrite.
        
         | bts wrote:
         | Hi! A few years ago I found myself wanting an equivalent of
         | `column` that didn't strip color codes. After I implemented it
         | in Haskell, I found it was useful to use Nix to force
         | statically linking against libraries like gmp to reduce startup
         | time. Perhaps what I ended up doing might be helpful for you
         | too: https://github.com/bts/columnate/blob/master/default.nix
        
           | anardil wrote:
           | Thank you for the suggestion, I'll give this a whirl! I've
           | fussed around with `--ghc-options '-optl-static -fPIC'` and
           | the like in years past without success.
        
         | Vosporos wrote:
         | Fantastic work, thank you so much!
        
       | wizerno wrote:
       | The author has a few blog posts covering the tee, split, which,
       | and cat commands [1].
       | 
       | [1] https://www.anardil.net/tag/coreutils.html
        
       | chasil wrote:
       | "On Windows - Symlinking doesn't appear to change the name
       | reported by System.Environment.getProgName, so you'll need to
       | create copies of the binary with different names."
       | 
       | I have not found this to be the case with "mklink /h".
        
         | zanie wrote:
         | That'd be a hard link though
        
       | eviks wrote:
       | > Symlinking doesn't appear to change the name reported by
       | System.Environment.getProgName, so you'll need to create copies
       | of the binary with different names
       | 
       | Would hardlinks help (would avoid multiple copies)?
        
         | anardil wrote:
         | I'll have to give this a try, thank you for the suggestion!
        
       | Waterluvian wrote:
       | I'm curious if there are any that were especially uncomfortable
       | to implement in Haskell.
       | 
       | I'm a _very very_ novice "functional programmer" but on occasion
       | I find problems that feel just ridiculous to implement FP style.
        
         | sfn42 wrote:
         | I'm not much of a functional programmer either, but generally
         | in the beginning you'll feel this way because you're not used
         | to thinking about problems in a functional way. You're used to
         | solving problems with imperative structures. So you try to
         | implement that in a functional language and it turns out
         | completely stupid because the language isn't made to solve
         | problems that way. But it usually has good ways to solve
         | things.
        
           | galangalalgol wrote:
           | I got stuck trying to have an intuition about what would
           | explode my memory usage due to lazy eval, I never got a grip
           | on the transition from pseudofunctional at the io boundaries
           | into real functional internally either.
        
           | anonzzzies wrote:
           | I like fp and array programming and logic programming (and
           | the overlap) to solve problems that are (considered or for
           | real; I don't usually know going in) really a bad fit for
           | those and then implement them shorter and faster than in
           | their imperative implementation. For some things it does not
           | work obviously, but it's a really nice exercise for when in
           | the gym, walking, cooking, driving and such; the performant
           | or LoC optimising parts fit in my head so I can do the
           | rewrites there and find ways to make them fit. For instance;
           | going from an imperative piece of code to array programming
           | (k/j), you have to find a way to translate the problem to a
           | vector/matrix manipulation problem and that you can do while
           | doing something else entirely. When you find a good one fit,
           | it can be many, many times faster than the original just like
           | that.
        
         | anardil wrote:
         | Definitely. Depending on how long you've spent staring at the
         | contents of /bin/ and /usr/bin/ you'll notice there are
         | definitely some array or matrix oriented utils (or options)
         | missing like column.
         | 
         | cut comes to mind as a difficult one. In C, you can just hop
         | around the char buffer[] and drop nulls in place for fields,
         | etc before printing. You _could_ go that way a Data.Array Char,
         | but that 's hard to justify as functional.
        
           | Muromec wrote:
           | Shouldn't cut but the easiest one to do in functional style?
           | 
           | You basicaly map one line of a stream to another with some
           | filtering and joining. Do I miss the part where it's terribly
           | slow and/or not doable in Haskell or something?
        
       | amelius wrote:
       | It would be cool if RFCs would be turned into Haskell, so we
       | could use __that__ as the specification instead.
       | 
       | Same for all the remaining web standards.
        
         | fuzztester wrote:
         | executable specs?
        
         | johnisgood wrote:
         | Opinion: I would rather prefer OCaml or just C (instead of say,
         | Rust or Python).
        
           | bionade24 wrote:
           | C is way too ambiguous to serve for that purpose.
        
             | johnisgood wrote:
             | Are we referring to reference implementations? I have
             | always found it easier to understand the C code, as Rust
             | and Python tend to obscure too many details behind high-
             | level abstractions.
             | 
             | C is simple enough to know what is going on, IMO.
             | 
             | For example:                   numbers = [1, 2, 3, 4, 5]
             | squared_numbers = [num * num for num in numbers]
             | print(squared_numbers)
             | 
             | vs.                   fn main() {             let numbers =
             | vec![1, 2, 3, 4, 5];             let squared_numbers:
             | Vec<i32> = numbers.iter().map(|&num| num * num).collect();
             | println!("{:?}", squared_numbers);         }
             | 
             | vs.                   #include <stdio.h>              int
             | main() {             int numbers[] = {1, 2, 3, 4, 5};
             | int squared_numbers[5];                      // Squaring
             | each number             for (int i = 0; i < 5; i++) {
             | squared_numbers[i] = numbers[i] * numbers[i];             }
             | // Printing the squared numbers             printf("[");
             | for (int i = 0; i < 5; i++) {                 printf("%d",
             | squared_numbers[i]);                 if (i < 4) printf(",
             | ");             }             printf("]\n");
             | return 0;         }
             | 
             | Python and Rust here seem concise and elegant, but can be
             | more difficult to follow to the unaccustomed due to
             | obscurity. I am sure there are examples that involve more
             | higher-level abstractions, and reference implementations
             | typically seem to use a lot of said abstractions. In case
             | of this Rust code, abstractions (like iterators and
             | closures) can also make the code more challenging to follow
             | for those who are not accustomed to functional programming
             | paradigms.
             | 
             | What I am trying to say is that the C implementation is
             | more straightforward, IMO.
        
               | yazzku wrote:
               | [num**2 for num in numbers]
               | 
               | There's no way that C implementation is easier to read
               | than the Python one. There's no high-level abstraction
               | going on there and the list comprehension reads like
               | natural language.
               | 
               | I agree Rust often gets distracting with unwrapping and
               | lifetime notation.
        
               | johnisgood wrote:
               | I know it is easier to read, and it is not difficult to
               | figure out what is going on (in this case), but many
               | languages do not support list comprehensions (for
               | example), so you have to implement it in a way that is
               | similar to the C version, making the C version more
               | helpful.
               | 
               | The example above may not highlight what I was trying to,
               | however.
               | 
               | Perhaps Ada could work even better than C? It is verbose,
               | yes, but you will immediately know what is going on.
               | 
               | My point is, that in my opinion verbosity is better than
               | conciseness in cases of reference implementations.
        
               | SkiFire13 wrote:
               | > as Rust and Python tend to obscure too many details
               | behind high-level abstractions.
               | 
               | C on the other hand forces handling too many low level
               | details for a reference implementation IMO. Rust also
               | isn't really good in that regard.
               | 
               | In a reference implementation what needs to be made
               | explicit is the logic that needs to be implemented,
               | rather than details like memory management. For that I
               | would prefer something like Datalog instead.
        
               | johnisgood wrote:
               | > what needs to be made explicit is the logic that needs
               | to be implemented, rather than details like memory
               | management
               | 
               | I agree.
        
               | trealira wrote:
               | People who write C nowadays generally try to write it in
               | the most "boring" and easy-to-read way possible, I think.
               | But C can also be hard to understand if it uses a lot of
               | side-effectful expressions and pointer arithmetic. Just
               | look at "How Old School C Programmers Process Arguments",
               | which looks over an excerpt from K&R [0]. To make your
               | code slightly harder to read, while still being
               | plausible, you could change it to use pointer arithmetic,
               | like so:                 // Printing the squared numbers
               | printf("[");       for (int n = 5, *p = squared_numbers;
               | --n >= 0; p++)           printf("%d%s", *p, n > 0 ? ", "
               | : "]\n");
               | 
               | [0]: https://www.usrsb.in/How-Old-School-C-Programmers-
               | Process-Ar...
        
           | amelius wrote:
           | I personally think Haskell is a better choice because it is
           | purely functional.
        
       | tightbookkeeper wrote:
       | Awesome work!
       | 
       | I noticed that they are about as long and complex as the C
       | versions. In early C++/STL days part of the pitch was
       | reimplementing core utils in a much more concise way. In wonder
       | if the is a result of focusing on that application, or a
       | coincidence of design decisions.
        
       | mytec wrote:
       | I'm used to reading about a given C/C++ program being implemented
       | in Rust, and was delighted to see such an effort in a functional
       | programming language.
       | 
       | I know little about functional programming languages but I've
       | always been interested in how languages like Ada and now Rust can
       | help programmers write safer code. I'm curious what advantages a
       | rewrite of a C/C++ app in a FP language provides and also what
       | advantages a FP language brings in comparison to a language like
       | Rust.
        
       | Maro wrote:
       | I looked at the wc implementation, there's no way it's compatible
       | with the GNU tools, how it handles wide characters, which is
       | heavily ties to the idiosyncratic C implementation. I know
       | because I once wrote a wc implementation in modern C++ for my own
       | education, and that's where most of my time and code complexity
       | went. Also, the C versions are devilishly fast, with all flags,
       | with optimized code branches.
       | 
       | https://bytepawn.com/tag/wc.html
        
         | galangalalgol wrote:
         | Poking at the rust coreutils project it seems like they all
         | faster than the c versions. The issue is the decades of
         | features that have accumulated that someone somewhere uses but
         | seem safe to disregard because so few do.
        
           | Maro wrote:
           | There are millions of lines of shell scripts relying on those
           | flags.
        
             | chongli wrote:
             | I would place the blame for that on the GNU coreutils
             | project. At best, it's feature creep. At worst, it's
             | embrace and extend on the POSIX standard.
        
               | erik_seaberg wrote:
               | POSIX doesn't forbid new flags or options. It's up to the
               | author to read the spec and test his portability, or else
               | willingly rely on certain distros. Some GNU tools have
               | had strict modes as a courtesy (they used to jokingly
               | call it POSIX_ME_HARDER).
        
             | prmoustache wrote:
             | And those scripts should check they are indeed calling the
             | gnu coreutils version before they proceed.
        
           | Arcuru wrote:
           | They're faster? They must have been working on performance
           | then, maybe I should take another look at how that project
           | has been maturing.
           | 
           | When I last looked 'dd' was significantly slower, though I
           | did make it a bit closer a while back -
           | https://jackson.dev/post/rust-coreutils-dd/
           | 
           | A lot of the Rust coreutils were written by people learning
           | the language, and the code is often _more_ complicated than
           | the corresponding code for the Unix utils. They don't seem to
           | get enough experienced programmers fixing them up or usage
           | for me to actually recommend anybody use them, but maybe
           | that's changing.
        
             | mustache_kimono wrote:
             | > When I last looked 'dd' was significantly slower, though
             | I did make it a bit closer a while back -
             | https://jackson.dev/post/rust-coreutils-dd/
             | 
             | I read your blog. Maybe you should take a look at some of
             | the other utils. I worked on sort and ls and both have long
             | been faster than their GNU equivalents.
             | 
             | > They don't seem to get enough experienced programmers
             | fixing them up or usage for me to actually recommend
             | anybody use them, but maybe that's changing.
             | 
             | The issue is obviously compatibility and the uutils won't
             | be broadly "usable" or stable until is hits 100% PASS/0%
             | FAIL on the GNU test coverage, although the numbers are
             | getting better and better. See:
             | https://raw.githubusercontent.com/uutils/coreutils-
             | tracking/...
             | 
             | > A lot of the Rust coreutils were written by people
             | learning the language,
             | 
             | I really, really hate this way of framing the project --
             | which implicitly sounds like GNU is full of pros. Yes, lots
             | of code was written by people just learning Rust and
             | systems programming. Once it reaches parity on GNU test
             | coverage, which it is rapidly approaching, GNU coreutils
             | would wish it had this kind of enthusiasm behind it.
             | 
             | > and the code is often _more_ complicated than the
             | corresponding code for the Unix utils.
             | 
             | I think it really depends. Much of the code is much
             | simpler. For instance -- I think it's much easier to
             | generically do hashing in Rust, and it shows.
        
           | diath wrote:
           | ...and that's precisely why they are faster, if you can throw
           | out decades of compatibility flags you can avoid expensive
           | branching and such to make them faster. That's why such
           | projects should be called alternatives, and not replacements,
           | it's not a replacement if it cannot be seamlessly replaced
           | and decades worth of scripts can continue working.
        
             | 7bit wrote:
             | When is it the time to talk about alternatives replacing
             | the current apps in favor of faster apps? 10 years or rare
             | usage? Twenty? At some point the edge cases should be
             | covered by the scripts not by the app, only to make it
             | slower for everyone.
             | 
             | I get what you're saying, but I don't care how it's called.
             | Some things must die. Eg. Python 3 and the depreciation of
             | was a very controversial step, but ultimately the right
             | choice at the right time.
        
               | kanbankaren wrote:
               | > When is it the time to talk about alternatives
               | 
               | When all standards bodies and governments decide that
               | POSIX is a hindrance which might take a few decades. And
               | that is if they decide.
        
               | galangalalgol wrote:
               | The FBI just said everyone needs a roadmap by 2026 to
               | eliminate c/c++ from critical infrastructure.
        
               | cozzyd wrote:
               | Python 3 created a huge amount of unnecessary effort and
               | made a lot of people hesitant about using python again
        
               | andrepd wrote:
               | Well it's one of the most popular languages in the planet
               | so it cannot have gone that bad.
        
               | jhayward wrote:
               | Not a mainstream view.
               | 
               | Python 3 is a hugely successful language and
               | implementation, and almost no one regrets that it exists
               | aside from a few noisy holdouts, and people who never
               | liked any Python anywhere at any time.
        
               | cozzyd wrote:
               | I don't disagree that python3 is hugely successful, but
               | that doesn't mean the Python 2->3 pain was necessary.
               | Certainly many Python users started using it too recently
               | to remember it though.
        
               | erik_seaberg wrote:
               | python3 had no compatibility mode, so everyone needed
               | 100% of their dependencies to migrate. This was so
               | painful that some teams abandoned their legacy python2
               | code and reimplemented in languages with better back
               | compat stories.
        
             | wrs wrote:
             | No problem. Don't branch in the middle of the algorithm,
             | choose the appropriate optimized algorithm based on the
             | task. That's why we haven't had a separate fgrep binary for
             | decades.
        
           | johnisgood wrote:
           | Faster? I do not know, what I do know is that there has been
           | some really crazy vulnerabilities in one (or some) of these
           | utilities (logic errors).
        
         | anardil wrote:
         | You're right! Data.ByteString.Lazy is Word8 under the covers,
         | so wide characters are truncated. tr takes a similar short cut.
         | Swapping to Data.Text would fix that.
         | 
         | Where simplicity conflicted with compatibility, I've chosen the
         | former so far. Targeting the BSD options and behavior is
         | another example of that. The primary goal is to feel out the
         | data flow for each utility, rather than dig into all the edges.
        
         | johnisgood wrote:
         | Speaking of fast: https://redman.xyz/git/turbowc/tree/turbowc.c
         | seems to be really fast (it is not a replacement for wc,
         | however).
        
         | MBCook wrote:
         | The readme mentions feature parity with the BSD coreutils, so
         | I'm assuming that's the target. Not GNU.
        
       | parasti wrote:
       | Ive seen several coreutils reimplementations just using the name
       | "coreutils" for themselves. I doubt anyone in the original
       | project cares, but seems in bad taste to me.
        
         | bqmjjx0kac wrote:
         | I mean, it gets the point across and they're not claiming to be
         | "GNU coreutils". Maybe some derivative name would be better for
         | uniqueness.
        
       | xolve wrote:
       | TIL that GitHub usernames can end with a `-`
        
         | anardil wrote:
         | The rules were different in 2014 when I made my account! It's
         | actually quite annoying because lots of 3rd party GitHub
         | integrations puke immediately saying I have an invalid
         | username.
        
       | hello_computer wrote:
       | The original coreutils are very weak on wide-char formatting,
       | item delimiting (i.e. -null / -print0), flag name consistency,
       | discoverability, and overall input/output uniformity. Solving the
       | edge cases rigorously would probably require minor breakages, but
       | would be worthwhile.
       | 
       | Most coreutil re-workings i've seen either double-down on JSON
       | output, 24-bit color, sixels, & other lipsticks on the pig--
       | without addressing any of the basic breakages we've become
       | accustomed to--or they go off into an entirely different
       | direction like nushell.
       | 
       | There is definitely room for improvement, but it is a thankless
       | job.
        
         | Muromec wrote:
         | Why not have bug to bug compatible vetsion of coreutils, but in
         | rust and then have a nice thing too?
        
           | hello_computer wrote:
           | Most are simple programs. If bug-per-bug compat + memory
           | safety is the goal, carefully reviewing the C would have been
           | a vastly better investment of time compared to making 13,498
           | commits.
           | 
           | https://www.cvedetails.com/product/5075/GNU-
           | Coreutils.html?v...
           | 
           | https://github.com/uutils/coreutils
        
       | mhd wrote:
       | This reminds me of a similar attempt with Perl[0], done a few...
       | erm, twenty years ago. Might be interesting to do a three-way
       | comparison for a few well-covered cases.
       | 
       | [0]: https://metacpan.org/release/CWEST/ppt-0.14
        
       ___________________________________________________________________
       (page generated 2024-11-03 23:00 UTC)