[HN Gopher] Building a data compression utility in Haskell using...
       ___________________________________________________________________
        
       Building a data compression utility in Haskell using Huffman codes
        
       Author : lazamar
       Score  : 179 points
       Date   : 2024-07-04 04:29 UTC (18 hours ago)
        
 (HTM) web link (lazamar.github.io)
 (TXT) w3m dump (lazamar.github.io)
        
       | tankfeeder wrote:
       | https://rosettacode.org/wiki/Huffman_coding
        
       | lynx23 wrote:
       | Very nice read, thanks for sharing!
        
         | revskill wrote:
         | What's nice ?
        
       | alwinaugustin wrote:
       | Thanks for sharing. Very nice and insightful.
        
       | londons_explore wrote:
       | For all readers, arithmetic codes are better in nearly all ways.
       | They can be implemented in less RAM and code, they compress and
       | decompress to a better ratio, and the probabilities of different
       | symbols appearing can be dynamically updated during the stream
       | far more easily.
       | 
       | The only reason Huffman codes are used is they were invented
       | first and arithmetic codes were patented. That patent has now
       | expired, so we should use the better design.
        
         | londons_explore wrote:
         | Two slight benefits of Huffman codes over arithmetic:
         | 
         | * They usually self synchronize when some data is corrupted
         | (but not guaranteed, does not apply where the Huffman table is
         | dynamic)
         | 
         | * Neither Huffman nor arithmetic codes are easy to parallelize
         | the decoding of, but Huffman is slightly easier.
        
         | kqr wrote:
         | I was under the impression that arithmetic codes are guaranteed
         | to be _at least_ one bit less efficient than Huffman codes per
         | input block. What makes you say they have better compression
         | ratio?
         | 
         | Are you thinking of pre-defined Huffman tables that aren't
         | adapted to the input? Because the latter ought to be as good as
         | it gets.
         | 
         | (I agree with the other benefits. Since arithmetic coding
         | tables are built in a streaming fashion rather than
         | constructing the codebook up front, they are more memory-
         | efficient while working.)
        
           | lifthrasiir wrote:
           | Huffman codes are conceptually isomorphic to arithmetic codes
           | where all probabilities are 2^-k with k integer, so they have
           | an obvious disadvantage due to more inaccurate symbol
           | distribution.
        
             | SassyBird wrote:
             | Hopefully k is natural. ;)
        
               | lifthrasiir wrote:
               | Implied because any symbol distribution which
               | probabilities do not sum to 1 is invalid anyway ;-)
        
           | hcs wrote:
           | Huffman codes are less efficient per symbol since each symbol
           | is a bit string, arithmetic coding effectively smears symbols
           | across bits more finely. Whether you use a dynamic or static
           | probability model is a different issue applying to either
           | coding method. (Emotionally though I prefer Huffman codes,
           | they're just so _neat_ )
        
         | lifthrasiir wrote:
         | If you do have an option to switch from Huffman, rANS is now
         | the way to go, not a clasical arithmetic coding.
        
         | lazamar wrote:
         | There is one way in which Huffman codes are better: they are
         | easier to explain and simpler to implement.
         | 
         | I went for simplicity of exposition in the post, but arithmetic
         | coders can indeed get arbitrarily close to the entropy, which
         | is not quite the case with Huffman.
        
           | nottorp wrote:
           | > easier to explain
           | 
           | I think Huffman is the one compression algorithm that
           | compresses stuff significantly that can fit on the proverbial
           | napkin, so it's a good start.
           | 
           | The others require the whole napkin stack at the table.
        
         | userbinator wrote:
         | LZ is even better. Neither arithmetic nor Huffman will compress
         | when the probability of all symbols is the same, but LZ will
         | find repetitions easily. LZ also decompresses extremely quickly
         | --- _faster than memcpy_ is often mentioned.
        
           | lazamar wrote:
           | Indeed, but worth noting that LZ is a modelling scheme,
           | whilst Huffman is a coding technique.
           | 
           | That is, LZ determines, dynamically as it goes, what are all
           | the elements we want to encode and their probabilities. Then
           | you need a coder, like Huffman, to actually encode it.
           | 
           | In the post I used a semi-static zero-order byte-based model.
           | Which means I counted the byte occurrences first and just
           | used that count for the probabilities throughout all of the
           | encoding. Then I used Huffman codes to translate those
           | probabilities into bits.
           | 
           | But I'm considering writing a follow-up changing this static
           | model for an LZ77 one as I think that would be fun.
        
           | d_burfoot wrote:
           | > LZ is even better. Neither arithmetic nor Huffman will
           | compress when the probability of all symbols is the same
           | 
           | Comparing LZ to arithmetic encoding is a category error. LZ
           | and Huffman are combined modeling+encoding methods, while
           | arithmetic is just an encoding method, and it can be combined
           | with any modeling technique. Arithmetic plus a suitable
           | modeling technique will achieve compression as good as LZ,
           | Huffman, or any other scheme. The PAQ8 compressors, and I
           | believe its successors in the Hutter Prize ranking, use
           | arithmetic plus a very advanced modeling scheme.
           | 
           | http://prize.hutter1.net/hfaq.htm#paq8
        
       | mrkeen wrote:
       | There exists an array-based, in-place algorithm for this,
       | reducing the need to allocate trees and chase pointers.
       | 
       | I mention this only because, when I learned the tree-based
       | approach at uni, I simply wasn't aware that there was another way
       | to do it, and I'm wondering how many of you that's true for as
       | well.
       | 
       | While the tree approach is intuitive and illuminating, it
       | probably makes more sense to work with in-place arrays, since the
       | situations when you care most about compression are probably the
       | situations when you have a lot of data and want to run fast.
       | In-Place Calculation of Minimum-Redundancy Codes       Moffat,
       | Katajainen.  1995.
       | http://hjemmesider.diku.dk/~jyrki/Paper/WADS95.pdf
        
         | lifthrasiir wrote:
         | > In-Place Calculation of Minimum-Redundancy Codes
         | 
         | Or in general, refer to "On the Implementation of Minimum
         | Redundancy Prefix Codes" by Moffat and Turpin (1997), as
         | strongly recommended and later explained by Charles Bloom [1].
         | 
         | [1] https://cbloomrants.blogspot.com/2010/08/08-12-10-lost-
         | huffm...
        
           | lazamar wrote:
           | Thanks for the link. I was motivated to write the post after
           | reading Moffat's book 'Managing Gigabytes'. A pearl from the
           | 90's.
           | 
           | The authors mention this technique in the second edition.
        
         | mjan22640 wrote:
         | > and I'm wondering how many of you that's true for as well
         | 
         | the phrasing sounds like a list comprehension
        
           | agumonkey wrote:
           | true, tickles my brain in all kinds of funny ways
        
         | agentultra wrote:
         | It's mentioned at the end and left as an exercise to the
         | reader.
        
         | userbinator wrote:
         | The JPEG standard ITU T.81 (1992) has a description of the
         | algorithm in flowcharts, so the knowledge of array-based
         | Huffman was probably already somewhat common in the 80s.
        
       | banish-m4 wrote:
       | Last time I used Huffman codes, it was to run a MICMAC processor
       | macroprogram (assembly text) in the fewest number of microcycles
       | and to use the fewest microinstructions in the microprogram
       | (microcode). So starting with a histogram of the
       | macroinstructions executed (IIRC, I first wrote an interpreter in
       | C to count how many of each were executed), I crafted a
       | progressive decoding microcode program to implement all of the
       | required ISA macro-operations. IIRC, the macro instruction ISA I
       | created was bit-granular instead of byte-oriented. In the real
       | world, it would've been slow and inconvenient. What's nice about
       | Huffman codes is that you can vary the prefix depth based on the
       | distribution of values, so you don't have to have lopsided codes
       | based on 1 bit prefixes.
       | 
       | Also, the microprogram had to deal with branch prediction because
       | it was a non-superscalar pipelined processor model. Guess the
       | wrong branch, and enjoy wasting cycles on a pipeline stall while
       | the correct branch filters forward.
        
       | atlintots wrote:
       | This is great! Are there any other similar tutorials going
       | through writing a Haskell program, but with some more advanced
       | features (monad transformers, lenses, etc)
        
         | trealira wrote:
         | I would recommend the book _Haskell in Depth_ , which covers
         | both of those topics (monad transformers by chapter 6, lenses
         | in chapter 3 and chapter 14). It also covers some other
         | advanced features, like Template Haskell and concurrency, and
         | has a chapter dedicated to working with SQL databases in
         | Haskell.
        
         | mirpa wrote:
         | You might try: https://github.com/turion/rhine-koans it is
         | tutorial for FRP library Rhine, well commented with tests
        
       | tromp wrote:
       | > To make it unambiguous we must make sure that no code word is a
       | prefix of another code word.
       | 
       | Technically, this is not quite correct. The class of so-called
       | uniquely decodable codes is unambigous, and a superset of the
       | prefix codes. One simple example of a uniquely decodable code is
       | the reverse of a prefix code. For the example in the article that
       | would be                   a 1         b 00         c 10
       | 
       | While the code for a is a prefix of the code of c, one can still
       | unambiguously decode any code sequence by processing it in
       | reverse order. It would be interesting to see a uniquely
       | decodable code that is neither a prefix code nor one in reverse.
        
         | lazamar wrote:
         | That's interesting. I guess this is not usually used because
         | you may have a long string of bits that is ambiguous till you
         | get to a disambiguating bit.
         | 
         | Something like
         | 
         | `100000000000000001`
         | 
         | In this case, where to know whether the first code was an `a`
         | or a `c` you have to read all the way to where the zeroes end.
        
         | n4r9 wrote:
         | It's a weird example, but what about                 a 1
         | b 101
         | 
         | ?
         | 
         | It is neither prefix-free nor suffix-free. Yet every occurrence
         | of 0 corresponds to an occurrence of b.
         | 
         | However, this is obviously inefficient. So I guess the question
         | is whether there's an optimal code which is neither prefix-free
         | nor suffix-free.
         | 
         | --------------
         | 
         | EDIT
         | 
         | I did some googling and found this webpage
         | https://blog.plover.com/CS/udcodes.html where the author gives
         | the following example of a uniquely decodable code:
         | a 0011       b 011       c 11       d 1110
         | 
         | I guess this is "almost" prefix-free since the only prefix is c
         | of d. If a message starts wiht 1, you could find the first 0
         | and then look at whether there's an odd or even number of 1's.
         | So I think I can see how it's uniquely decodable. However, my
         | crypto knowledge is too rusty to remember how to show whether
         | this is an optimal code for some probability distribution.
        
           | imurray wrote:
           | That code in the EDIT is suboptimal. It doesn't saturate the
           | Kraft inequality. You could make every codeword two bits and
           | still encode 4 symbols, so that would be strictly better.
        
             | n4r9 wrote:
             | Ah of course. Thanks for the insight. About 15 years since
             | I studied this stuff!
        
         | imurray wrote:
         | > It would be interesting to see a uniquely decodable code that
         | is neither a prefix code nor one in reverse.
         | 
         | More interesting than I thought. First the adversarial answer;
         | sure (edit: ah, I see someone else posted exactly the same!):
         | a 101         b 1
         | 
         | But it's a bad code, because we'd always be better with a=1 and
         | b=0.
         | 
         | The Kraft inequality gives the sets of code lengths that can be
         | made uniquely decodable, and we can achieve any of those with
         | Huffman coding. So there's never a reason to use a non-prefix
         | code (assuming we are doing symbol coding, and not swapping to
         | something else like ANS or arithmetic coding).
         | 
         | But hmmmm, I don't know if there exists a uniquely-decodable
         | code with the same set of lengths as an optimal Huffman code
         | that is neither a prefix code nor one in reverse (a suffix
         | code).
         | 
         | If I was going to spend time on it, I'd look at
         | https://en.wikipedia.org/wiki/Sardinas-Patterson_algorithm --
         | either to brute force a counter-example, or to see if a proof
         | is inspired by how it works.
        
       | benreesman wrote:
       | Haskell is a really nice language. In general I don't identify as
       | an X programmer for any value of X: I tend to write in a half
       | dozen languages daily and they all suck in their own special way.
       | 
       | But on two separate occasions I made important career decisions
       | with opportunity cost to work with highly lethal GHC
       | contributors: those people are just really good.
       | 
       | If Haskell sucks like all languages it's because Haskell excels
       | at using computers to _compute_ something: Haskell considers data
       | shuffling a strictly secondary concern compared to doing actual
       | computations.
        
         | random3 wrote:
         | How do you distinguish data shuffling from computation? What's
         | actual computation from this perspective?
        
           | tossandthrow wrote:
           | Philosophically speaking there is no difference.
           | 
           | What parent commenter probably refers to is that you think in
           | terms of computations and not in terms of data units.
           | 
           | And that is just tremendously elegant.
        
             | 082349872349872 wrote:
             | Philosophically speaking there's a great difference.
             | 
             | Data shuffling doesn't --in principle-- lose information;
             | computation does. ("evaluation is forgetting")
             | 
             | In https://news.ycombinator.com/item?id=32498382 "glue
             | code" and "parsley code" are data shuffling, while "crunch
             | code" is computation.
        
           | lucianbr wrote:
           | Reading a row from a database and putting it on the screen,
           | and reading some numbers from the keyboard and putting them
           | in the database. These things I would not call computation. I
           | mean sure, displaying needs to compute coordinates for where
           | to light up pixels, but that's all already written. I just
           | call it. Same with updating btrees when writing to the db.
           | 
           | I'm guessing if all you do is this kind of db - screen -
           | keyboard and back stuff, haskell is not very useful, if not
           | actively a hindrance.
        
           | mrkeen wrote:
           | Before I was good at Haskell, I would approach a data-
           | processing job sequentially based on the next thing that
           | needs to be done.
           | 
           | I want to open a file, and I can't read it all at once, so
           | I'll use a FileReader and it should be buffered, so I'll wrap
           | it with a BufferedReader. I'll use try-with-resources and
           | click into the classes because I can't remember if the
           | contract of the outermost reader is that it will close the
           | inner readers too.
           | 
           | Right, now I'll grab the next n bytes from the stream, and
           | start thinking about the algorithm. Swear a bit when I think
           | about crossing the buffer boundaries, and on-and-on...
           | 
           | The IO concerns are very much interwoven with the algorithm.
           | 
           | In Haskell I just start by writing one function from bytes to
           | bytes. That's the computation. Then when that's done I expose
           | that function as bytes to bytes.
           | 
           | Others can hook it up to files, webservers, pipe it through
           | gzip, whatever!
        
         | blandblender wrote:
         | > I tend to write in a half dozen languages daily
         | 
         | 6 languages per day? Are they the same six the next day?
         | 
         | > they all suck in their own special way.
         | 
         | Not surprising if you're writing 6 different languages per day.
         | 
         | > Haskell excels at using computers to compute something
         | 
         | Can you please explain how Haskell computes medians more
         | elegantly than say C?
        
       | ykonstant wrote:
       | Hey, since this is likely to attract Haskell programmers: how
       | fast is Haskell these days for a programmer intent on writing
       | optimized code? I am particularly interested in its performance
       | for numerical crunching like matrix operations and other stuff
       | that benefit from SIMD.
        
         | epgui wrote:
         | I'm not the best person to answer this question, but AFAIK it's
         | very very fast (in the rough vicinity of C). But also memory-
         | hungry.
        
           | freilanzer wrote:
           | I'm pretty sure highly optimised code won't be elegant
           | Haskell code though.
        
             | rebeccaskinner wrote:
             | Highly optimized code tends to be inelegant in any
             | language. That said, you can get really good performance
             | from very elegant looking "normal" Haskell too. The big
             | challenge with performance in Haskell is that the
             | differences between optimized and unoptimized code can be
             | pretty subtle if you're not used to thinking about Haskell
             | performance.
        
         | Iceland_jack wrote:
         | I met Sam Derbyshire at ZuriHac who told me all the difficult
         | architectural work had been done for SIMD support.
         | 
         | + https://gitlab.haskell.org/ghc/ghc/-/issues/7741
         | 
         | It might make it for GHC 9.12 (for 128 bit vectors only, and
         | mostly floating-point operations unless other people come in
         | and contribute).
         | 
         | The patch is at:
         | 
         | + https://gitlab.haskell.org/ghc/ghc/-/merge_requests/12860
        
           | ykonstant wrote:
           | Thanks for the info!
        
         | mrkeen wrote:
         | I like Haskell performance for every-day backend/web and CLI
         | stuff. But I drop down into Rust when I'm writing something
         | performance-focused.
         | 
         | That said, Haskell's no slouch. Here's a small program to count
         | the 1-bits in a file.                 main :: IO ()       main
         | = do         content <- getArgs >>= \[a] -> unsafeMMapVector a
         | Nothing         print (vectorPopCount content)
         | vectorPopCount :: V.Vector Word64 -> Int       vectorPopCount =
         | V.foldl' (+) 0 . V.map popCount
         | 
         | When you compile with -msse4.2, it will correctly use the
         | hardware popcount instruction, and crunches through a 1GB input
         | file in 0m0,090s. Rounding to the nearest MB, it uses 0 heap.
         | (For the curious, if I compile without -msse4.2 it runs in
         | 0m0,293s).
         | 
         | I haven't tried crunching matrices, but I would start by
         | checking out repa, accelerate, or massiv.
         | https://hackage.haskell.org/package/repa
         | https://hackage.haskell.org/package/accelerate
         | https://hackage.haskell.org/package/massiv
        
           | ykonstant wrote:
           | The lack of heap allocations is great! Thanks for the
           | pointers.
        
         | lazamar wrote:
         | Haskell's speed can be competitive with systems languages but
         | keep in mind that its killer feature is ease of abstraction.
         | 
         | The idea is that it is simple to assemble multiple parts into a
         | coherent, well organised program. Which is important for the
         | entirety of the program, no just the tight loop.
         | 
         | So, with the nice FFI Haskell has, you can always drop down to
         | languages without a GC for inherently imperative optimisations.
         | Then you wrap that into a library with nice types and you can
         | now leverage that raw power anywhere in your Haskell code where
         | the types will match.
         | 
         | I worked at Meta in a high performance Haskell application and
         | that's what we did. Wrote beautiful, large, fast Haskell
         | programs which in some specialised parts had C++ building
         | blocks. 99% of the time was spent on Haskell land composing
         | things into more and more useful applications.
        
         | wyager wrote:
         | The reality is that for any language, including C, compiler
         | optimized code will never be as fast as hand optimized code in
         | libraries like BLAS. So at some level, the choice of host
         | language doesn't matter very much, because you're going to be
         | outsourcing all of the computation anyway if you're really
         | serious about speed. This is the same reason all the AI stuff,
         | possibly the single largest consumer of compute in the world,
         | gets away with being written in python except for the low level
         | compute libraries.
         | 
         | To answer your question directly, The GHC compiler is very
         | good. High level code will perform very well, and for most
         | realistic applications, performance bottlenecks are
         | architectural, not e.g. the use of single width versus SIMD,
         | and the "architectural asymptotics" of haskell are very
         | favorable. I think GHC has/is getting SIMD support but that's
         | not what I would focus on when evaluating perf.
         | 
         | I wouldn't write a matrix multiplication algorithm in Haskell,
         | but I also wouldn't write one in rust or C if I was serious
         | about speed.
         | 
         | Many focus on number crunching as a performance metric, but
         | almost no one is ever actually bottlenecked on that, and if
         | they are it doesn't really matter what high level language
         | they're using.
        
           | GrantMoyer wrote:
           | > The reality is that for any language, including C, compiler
           | optimized code will never be as fast as hand optimized code
           | 
           | That's not strictly true; sometimes a C compiler can optimize
           | away my whole program: https://godbolt.org/z/oG5nfGE6z
        
             | wyager wrote:
             | You're doing extremely simple constant arithmetic here. GHC
             | can optimize this type of thing away to nothing as well.
             | Are we talking about contrived examples or real numerical
             | bottlenecks?
        
         | ReleaseCandidat wrote:
         | It's doable but harder than in imperative languages and the
         | resulting code is _really_ ugly. See the thread of the 1BRC
         | https://discourse.haskell.org/t/one-billion-row-challenge-in...
         | with example code at (I hope that's the right Gist)
         | https://gist.github.com/AndrasKovacs/e156ae66b8c28b1b84abe6b...
        
         | CyberDildonics wrote:
         | If you want to write fast stuff that takes advantage of SIMD,
         | you want to use ISPC, which is made for high performance SIMD.
         | Using haskell for this would be like doing surgery with a dull
         | rock, you will never get what you want.
        
         | tkz1312 wrote:
         | Haskell really shines when you want to write high level,
         | declarative code. Performance when using this style is
         | generally fine for CLI / web backend style stuff. It has the
         | tools to write pretty fast low level code, but they're fairly
         | clunky and if that's all you want to write it's probably not
         | going to be the best tool for the job. They're pretty nice if
         | you have a few focused hotspots you need to optimize.
         | 
         | It has pretty nice CPU profiling tools, so finding and
         | optimizing CPU hotspots is fairly pleasant. Tracking down rouge
         | memory leaks (which lazy evaluation makes more likely) on the
         | other hand can be extremely frustrating.
         | 
         | If you look at the benchmarks game results [1], the fastest
         | haskell implementations are generally between 2 and 5 times
         | slower than the fastest c versions, and will be written in a
         | highly imperative style.
         | 
         | [1]: https://benchmarksgame-
         | team.pages.debian.net/benchmarksgame/...
        
         | wredue wrote:
         | If you're writing idiomatic Haskell. Its performance is
         | terrible.
         | 
         | If you're choosing to fight with Haskell, why? Just use
         | something else.
         | 
         | To understand why people claim Haskell is "fast", you need to
         | understand what they mean. What they mean is "if you opted to
         | write C in such a way as you were performing similar amounts of
         | copious and useless data copying, pointer following and stack
         | blowing madness, Haskell will perform that fast". They are not
         | saying "Haskell is as fast as the fastest idiomatic C
         | implementation".
         | 
         | Another thing you're going to see a lot of is extremely simple
         | anecdotes, such as counting in a loop, or favourable measure
         | points (they will measure the whole c program, but just the
         | point in Haskell after they've flattened data, for example,
         | stating "we just want to compare those parts").
        
       | bdahz wrote:
       | How is the performance when compared to similar implementations
       | in C/C++ or Rust?
        
         | lazamar wrote:
         | I'd say unbeatable!
         | 
         | The goal was simplicity of implementation and code clarity. For
         | this kind of thing I say Haskell performs the best.
        
           | mrkeen wrote:
           | That wasn't really the spirit of the question as I read it.
           | 'Performance' has a narrower definition than that.
        
             | zarathustreal wrote:
             | The point they're making is that there is no performance
             | without tradeoffs and "fast" is meaningless unless you
             | define what you're measuring. Asking the question implies a
             | misunderstanding of the intent of the implementation, OP
             | was trying to subtly let them know.
        
           | bdahz wrote:
           | For the simplicity of implementation and code clarity, I need
           | to know how much I need to pay for it.
           | 
           | If the Haskell implementation is 3x slower than C/C++/Rust
           | implementation, it would be acceptable.
           | 
           | If it's 30x slower, I would rather choose C/C++/Rust even the
           | implementation won't be simple.
           | 
           | If it is even possible to be 3x faster than C/C++/Rust, then
           | why not the mainstream programmers adopt Haskell everywhere?
        
             | lazamar wrote:
             | The goal of this implementation is not to be fast, but to
             | be clear.
             | 
             | I am doing some inefficient things (like two pass encoding)
             | on purpose to keep things simple and clear. So using this
             | particular piece of code to judge a language's performance
             | potential is not really the way to go here.
        
             | rebeccaskinner wrote:
             | The general rule of thumb I'd give is that a performance
             | aware but not micro-optimized Haskell program will
             | typically run in about 2x to 5x the time of a comparable C
             | program, and will take somewhere between 2x and 10x as much
             | memory. For a naive Haskell program the range is much
             | bigger- maybe 2x to 10x as much time and 10x to 1000x as
             | much memory (it's easy to do a lot of allocations in
             | Haskell).
             | 
             | For extremely optimized Haskell you can get close to the
             | speed of C, but there's still a garbage collector.
             | 
             | There are also certain classes of problem where a naive
             | Haskell implementation can beat other languages by mile,
             | including C, if you use the same implementation in both
             | languages. Laziness can be really great sometimes. This
             | didn't happen much in practice though because the kind of
             | code that's really efficient with lazy evaluation is very
             | obviously not in a strict language so people don't usually
             | write code that way.
             | 
             | In the end I'd say Haskell is a good choice for performance
             | sensitive but not performance critical program. In a larger
             | Haskell application if you have a performance critical bit
             | you can usually write Haskell code that will be fast enough
             | if you know what your doing. For something stand alone that
             | needs to be as fast as possible, or the most critically
             | performance sensitive parts of a bigger application, I'd
             | consider using C or C++.
        
               | ReleaseCandidat wrote:
               | To rephrase using my experience: "performance aware"
               | Haskell is about as "fast" as Go, but needs more memory,
               | and both are slower than the same Java code - but both
               | are way more fun to write ;). Optimising Go is easier for
               | most people though, in Haskell you _really_ need to know
               | Haskell internals (how to read core and write unboxed
               | code) and understand laziness.
               | 
               | My try of the 1 billion row challenge in Go and Haskell
               | (and C) and comparison to other, fast implementations:
               | https://github.com/Release-Candidate/1-billion-row-
               | challenge...
        
       | chvrchbvrner wrote:
       | I think there is a typo in the table of the "Creating prefix-free
       | codes" section. D should be '0010' (not '0110').
       | 
       | Otherwise a great read, thanks!
        
         | polytely wrote:
         | Aha, that makes sense, I was wracking my brain as to how 0110
         | was unambiguous.
        
         | lazamar wrote:
         | Fixed it. Well spotted!
        
       | 2-3-7-43-1807 wrote:
       | and yet another episode in the series of "look, what I did with
       | Haskell"
        
       | goldfishgold wrote:
       | Coursera's functional programming course (in Scala) includes a
       | pretty similar Huffman coding assignment with autograder if
       | anybody wants to take a stab at it themselves.
       | 
       | https://www.coursera.org/learn/scala-functional-programming?...
        
       | polterguy1000 wrote:
       | Oh, Hacker News, you've done it again. A riveting discussion on
       | building a data compression utility in Haskell using Huffman
       | codes. Because nothing screams "cutting-edge tech" like rehashing
       | algorithms from the 1950s in a language that most people use to
       | feel intellectually superior.
       | 
       | Let's dive into this treasure trove of comments, shall we?
       | 
       | Array-based, in-place algorithm: Oh, wow, an algorithm that
       | reduces the need to allocate trees and chase pointers. Because,
       | you know, memory management is for plebs.
       | 
       | Moffat and Katajainen's paper: Because nothing says "I'm a true
       | nerd" like referencing academic papers from the 90s. Next,
       | they'll be quoting ancient Greek philosophers on software design.
       | 
       | JPEG standard ITU T.81 (1992): Ah, the good old days when JPEG
       | was the pinnacle of image compression. Let's all take a moment to
       | appreciate how far we've come. Or not.
       | 
       | Uniquely decodable codes: Because clearly, the world needed a
       | deep dive into the nuances of prefix-free and suffix-free codes.
       | I'm sure this will revolutionize the way we... do something.
       | 
       | Functional programming course in Scala: Oh, joy! More academic
       | exercises in a language that's as fun to write as it is to
       | pronounce. Because nothing says "practical" like autograders and
       | theoretical assignments.
       | 
       | Haskell performance: The eternal debate. Is it fast? Is it slow?
       | Does it even matter when you're using it to write code that only
       | other Haskell enthusiasts will ever read?
       | 
       | Arithmetic codes vs. Huffman codes: Because clearly, the world
       | needed another debate on which compression algorithm is
       | marginally better. Spoiler alert: neither will make your life
       | significantly better.
       | 
       | LZ compression: Ah, yes, the savior of all things compressed.
       | Because why use a simple algorithm when you can use a complex one
       | that requires a PhD to understand?
       | 
       | In conclusion, if you want to spend your time debating the finer
       | points of data compression in a language that 99% of developers
       | will never touch, this thread is your playground. For the rest of
       | us, maybe stick to something more practical--like Hyperlambda.
       | 
       | Thomas Ai-Ai Shitboat Handsom
       | 
       | "To compress is human; to really overcomplicate it requires
       | Haskell." - Reversed and paraphrased for your amusement.
       | 
       | PLEASE FORGIVE ME https://ainiro.io/blog/how-to-create-an-ai-ai-
       | shitboat-in-30...
        
         | dist-epoch wrote:
         | Arithmetic codes are not marginally better.
         | 
         | Asymmetric numeral systems, which is related to arithmetic
         | coding was a real breakthrough used in all modern compressors.
        
         | VMG wrote:
         | Downvoted not because of AI but because of snark and negativity
        
       ___________________________________________________________________
       (page generated 2024-07-04 23:00 UTC)