[HN Gopher] So Long Surrogates: How We Moved to UTF-8 in Haskell
___________________________________________________________________
So Long Surrogates: How We Moved to UTF-8 in Haskell
Author : wofo
Score : 103 points
Date : 2022-04-27 15:55 UTC (7 hours ago)
(HTM) web link (www.channable.com)
(TXT) w3m dump (www.channable.com)
| nerdponx wrote:
| This site saves 26 "statistics" cookies and 99 "marketing"
| cookies.
|
| Really? Is all that necessary?
| meetups323 wrote:
| Your user agent saves the cookies. If you don't like it, change
| it.
| hombre_fatal wrote:
| I do think cookies get unfair treatment.
|
| They are things that your browser happily rebroadcasts back
| to the server with no real UI for it outside of the shitty
| devtool bar made for devs, even after all this outcry about
| cookies.
|
| It reminds me of the meme of the guy riding a bicycle,
| throwing a branch into the spokes (rebroadcasting cookies),
| and then roaring in pain on the ground about how evil
| websites/advertisers are tracking him with cookies.
|
| That said, what a lame HN thread on a post about Haskell.
| eklavya wrote:
| I have come to accept it and just ignore it. Many times
| there would be a long thread having to do absolutely
| nothing about the topic at hand. Not a tangent but like
| completely unrelated, why are we even discussing this here
| kind of thing.
|
| I wish there was a good way to visually differentiate when
| a new top comment starts except by squinting and figuring
| out the whitespace from the left of the mobile screen, more
| painful than necessary I presume.
| Mindless2112 wrote:
| Use HackerWeb. Top-level comments are highlighted, and it
| automatically collapses threads to show only top-level
| comments when there are a lot of comments.
|
| https://hackerweb.app/#/item/31181595
| hombre_fatal wrote:
| I couldn't help myself. Always desperate to find an
| opportunity to shove my 2 cents into the world. So
| imagine my glee when you provided me with another one!
|
| Yeah, I think both (A) defaulting to auto-expanded
| threads and (B) making them annoying to collapse make HN
| worse than it could be.
|
| You tend to read the top-level thread because it's
| already there. And then it ends up being longer than you
| expected, or you're trapped in a subtree that just won't
| end, or you just want to see what other people are
| saying. And there's no good way to move past it.
|
| Would be nice to click the indentation to collapse the
| thread anywhere inside the tree.
| boogies wrote:
| I just scroll to the top and use the "next" link on the
| top comment (added with the prev and context links around
| October 27-28th last year I think).
| bawolff wrote:
| Ignoring the privacy bit - 125 cookies is quite a bit of per
| request overhead, especially in http/1.1 where they are not
| compressed. I would say its poor website design.
| Rygian wrote:
| Why shift the burden on the user and the user agent? The
| website is the only one to blame here.
| meetups323 wrote:
| Blaming the website for your own agent doing something you
| don't want it to is learned helplessness.
|
| Every marketing cookie generates revenue for the website in
| some way or another. The website wants revenue, so it asks
| the user agent to maintain those cookies. The user agent
| agrees. Then the operator of the user agent gets upset that
| the website asked their agent to store the cookies? Get
| upset that your agent agreed, not that a request was made.
|
| Or better yet, don't get upset at all and just solve the
| darned problem yourself. Is this Hacker News or Complier
| News?
| jstimpfle wrote:
| Cookies as a mechanism are useful and required for a
| solid modern web experience. However, tracking cookies
| are arguably the opposite of that. A typical modern
| website with marketing comes with, I don't know, 100s of
| cookies. Are you really arguing that the user should be
| required to vet each individual cookie whenever following
| a link with unvetted cookies?
|
| Or how do you solve this problem? Personally, the most I
| can be arsed to do is install some Adblock Plugin. I did
| that only a few months ago and I'm not even sure that it
| improved my experience by a lot.
| jimmaswell wrote:
| There is no problem to solve, the cookies can't hurt you
| and the website needs to stay afloat.
| jstimpfle wrote:
| To state the obvious, some people don't love the
| extensive profiles that are created of them.
| eternityforest wrote:
| Those people should be able to avoid the profiling, but
| any solution should be aimed at protecting those people,
| without impacting the 95% who don't care enough to give
| up convenience or pay for private services too much.
| jstimpfle wrote:
| Maybe my view is warped (I'm from Germany) but 95% seems
| a tad high...
| zasdffaa wrote:
| > and required for a solid modern web experience
|
| Absence of cookies don't make things unstable (non-
| solid?), and fuck knows what 'modern' is supposed to
| mean, or why it's good.
|
| > Or how do you solve this problem?
|
| Block all cookies except for rare moments like posting on
| HN, which then immediately get deleted. And no JS, which
| means CPU is trivial (so no burn-a-core-for-every-open-
| tab which is so common with page-sized pointless
| animations). Many problems can be solved if you want them
| to be.
| eternityforest wrote:
| How exactly will sites remember that you are logged in?
| And how would be have any web apps that aren't horrendous
| without JS?
|
| Also, where is this burn-a-core-for-every-open-tab stuff?
| Many websites are highly optimized and do not use much
| CPU. Not enough to be noticed without actually looking at
| the numbers anyway.
|
| What sites have page size animations these days?
| zasdffaa wrote:
| > How exactly will sites remember that you are logged in?
|
| I don't want them to. I log back in if necessary (browser
| remembers id/pswd). For those few I need to stay logged
| in, I use a VM and save the state - I'm more concerned
| about controlling JS than cookies in such cases.
|
| > And how would be have any web apps that aren't
| horrendous without JS?
|
| I don't use web apps. My tradeoff.
|
| > Also, where is this burn-a-core-for-every-open-tab
| stuff? Many websites are highly optimized and do not use
| much CPU.
|
| Oddly, it seems to be corporate bullshit sites that are
| the worse offenders. Can't find one but you're right,
| it's not all by any means. I retract.
| jstimpfle wrote:
| But you realize you're the oddball that considers the
| problem solved like that? I'm not sure that being a
| "hacker" means to straight out refuse things. You're
| missing out on a lot of fun and inspiring information
| (and yes, many many hours wasted to irrelevant content).
| zasdffaa wrote:
| You make your choices and I make mine. Should a person
| make the informed choice to immerse themselves in the web
| as-is with all its problems & risks, ok, but most people
| just pick the easy path then bitch after. I'm not one of
| them, and straight out refusal is in fact a viable option
| for me.
|
| If I do need anything more, there's VMs. BTW what 'fun
| and inspiring information' do you refer to? Shadertoy is
| a loss I grant, but what else?
| jstimpfle wrote:
| If you miss Shadertoy it won't be hard to imagine other
| similar things, of which there are plenty. Anything that
| requires interactivity beyond the one provided by HTML &
| CSS will obviously require Javascript. Any personalized
| experience (not only suggestions which yes are evil, but
| also personal storage) will obviously require cookies to
| function.
|
| Deleting Cookies on exit (and/or at regular intervals)
| will probably not help much in terms of avoiding
| tracking, especially if you log back in using your
| reinitialized cookies.
| zasdffaa wrote:
| > it won't be hard to imagine other similar things, of
| which there are plenty
|
| which again you don't give.
|
| > Anything that requires interactivity ... obviously
| require Javascript
|
| jeez, no shit, I get it.
|
| > (some defeatist blah about cookies)
|
| Whatever.
|
| You just persistently don't get it. These are my choices.
| I made them carefully. They suit me. They may not suit
| you. We could even compromise if you made an effort to
| see what I'm after but you won't/can't. Now please try to
| understand I'm not you, and just back off!
| Rygian wrote:
| Blaming the user-agent for accepting an abusive amount of
| cookies set by the website is outright bad faith.
|
| The only entity with any real power to decide which
| cookies the website uses is the website itself.
|
| Asking the user or the user agent to comb through cookies
| and decide, one by one, which ones seem marketing-related
| and which ones are technically required, and then block,
| is _way_ too much to ask from a regular internet user.
|
| I have tried, but fail to see good faith in your reply.
| grumbel wrote:
| The browser is the one who stores and sends cookies. It
| would be trivial to make that action explicit and only at
| the users request. That wouldn't even be a new feature,
| that used to be how things worked 20 years ago. Lynx is
| however the only browser left that I know that still asks
| you before storing cookies.
|
| You don't even have to shift through cookies for this to
| work, you can just reject all by default until the user
| explicitly request them to be stored (or use a whitelist
| or wait until the users tried to login that would
| necessitate a cookie, etc.) Lots of possibilities.
|
| > is way too much to ask from a regular internet user.
|
| That's kind of the point. By making it all transparent
| and seamless browser makers played into the hand of
| marketing companies. If cookies had a cost and would
| degrade the user experience, they might be thinking twice
| before putting hundreds of them on a site.
|
| Marketing companies are just making use of the tools they
| are given. And browser manufacturers gave them a lot of
| tools, while taking control away from the user.
| zasdffaa wrote:
| Word. Tired of these "I don't want this but I won't spend
| any time or money on fixing it so someone else should do
| it" posts.
|
| Hint: it's under Tools|Preferences in firefox/palemoon
| Rygian wrote:
| No, it's not under "Tools|Preferences."
|
| There is no setting anywhere, in any web browser, to
| "retain cookies that are technically necessary and reject
| marketing cookies" which is the desirable behaviour.
| zasdffaa wrote:
| Define marketing cookie for me - do you mean 3rd party?
|
| (Some possible control via
| Tools|Preferences|Exceptions... button allows you to
| customise by website, although I've never used it. Or
| just disallow all, which is what I do)
|
| ---
|
| Edit: answer the question please, there may be an easy
| solution to what you want.
|
| Edit2: No reply because god forbid there's an actual way
| you could take control, that would simply ruin everything
| (in a parallel universe, man complains the streets are
| rife with face stabbing but when presented with proof
| they're not, stabs self in face to prove otherwise).
|
| Biggest problem with learned helplessness is that they
| like it that way. Gives them something to be angrily
| resentful about.
| rini17 wrote:
| Easy, enable only cookies for the things you want
| (maintain your session with 1st website, plus core
| functionality like payments). Everything else are
| marketing cookies.
|
| I used umatrix for years but gave up. The guessing what
| to enable to get a site to work got tiresome, and IIRC
| there was also problem with browser support.
| Rygian wrote:
| Definition of cookies I don't consent to: any cookie that
| is not mandatory for the site to technically work.
| zasdffaa wrote:
| You don't answer my question, then use a vague term of
| 'technically work' to ensure I can't give you useful info
| tl;dr you don't want to be helped.
| matthewmacleod wrote:
| Blaming others for making legitimate complaints about
| pervasive bad practices is learned assholishness.
|
| We should all complain loudly and far more than we do
| about the creeping tendency of many companies to do so
| many obviously shitty things, instead of merely shrugging
| our shoulders.
| deathanatos wrote:
| Heh, so I actually do this.
|
| An _incredible_ amount of the web just breaks. Twitter,
| Reuters, Imgur. Like it 's one thing if, when I attempt to
| log in, your log in fails (and usually, logins fail to handle
| the error & will just loop back to the start, that's at least
| a _start_ ) but a lot of the web will have a flash-of-text
| and then nothing, & JS has crashed.
| Aardwolf wrote:
| If only Windows, Java and JavaScript could also move away from
| internal usage of UTF-16, it's purely a legacy format and the
| worst of both worlds (UTF-32 and UTF-8). Even worse is that
| unicode itself, which should in theory be a list of codes for
| glyphs, modifiers and other script related values, that's
| independent of encoding, had to have some codes reserved for
| "surrogates" for the UTF-16 encoding anyway. UTF-8 doesn't need
| such a thing...
| cryptonector wrote:
| Microsoft is making improvements in their UTF-8 support.
| Getting rid of the `W` APIs will take forever. Java and
| JavaScript are even more stuck with UTF-16.
| Aardwolf wrote:
| UTF-8 support for filenames would be a great start, to
| support windows filenames in a multiplatform way in C!
| cryptonector wrote:
| But what do you care how they file names are stored on
| disk, as long as you can read directories and traverse
| paths using UTF-8?
| layer8 wrote:
| Besides the surrogate characters there are also some other
| noncharacters:
| https://www.unicode.org/faq/private_use.html#noncharacters
|
| Because of modifier characters, control characters like for
| bidi, stuff like soft-hyphens and ligatures, locale-dependent
| semantics (upper/lowercase, collation etc.), the general
| discordance between glyphs and characters, and so on and so
| forth, Unicode is so complex, and in general always requires
| careful processing of code point (or code unit) sequences, that
| honestly the surrogate encoding doesn't make that much of a
| difference. It's just an additional wrinkle in a sea of
| wrinkles.
| Aardwolf wrote:
| I still find the surrogates different. Bidi, private use,
| ligatures, ... are script or locale related.
|
| Unicode uses numeric values from 0 to 1112063. You can invent
| all kinds of methods to encode numbers from 0 to 1112063
| (variable length, fixed length, decimal, hexadecimal,
| anything else). But most ways I can think of to encode these
| numbers, including variable length ones that would use 8 bit
| or 16 bit primitives, don't require me to actually reserve
| some of those to-be-encoded numbers themselves for a special
| meaning. Yet for UTF-16 they managed to do it. Imagine that
| all other encodings out there would also want to reserve some
| Unicode values for their own purpose!
| layer8 wrote:
| You always have to work with sequences of code units anyway
| (instead of just single code points), so the individual
| reasons for that doesn't make much of a difference. It
| seems your rejection is more on aesthetic than on practical
| grounds.
| camgunz wrote:
| I have an old saw about UTF-16 not being an irredeemable format
| and UTF-8 eating the world being bad, and I'm happy to dig it
| out again.
|
| UTF-16 is great for lots of East Asian languages, which
| billions of people use. In UTF-8, most of those languages
| require 3 bytes to encode a 32-bit codepoint, in UTF-16 they
| only ever need 2. This ends up being a huge savings.
|
| The main benefit of UTF-8 if you're say, Chinese, is interop.
| Everything else is worse.
|
| You might think "but BOMs are super evil." Checking a BOM is
| extremely, extremely easy. Furthermore, you don't get to bail
| out of checking anything just by using UTF-8, you have to check
| to ensure you have _valid_ UTF-8. That's right, you gotta scan
| the whole bytestream anyway, so you may as well just check the
| 2-byte BOM at the beginning too.
|
| You might also think "what about ASCII compatibility?" ASCII
| compatibility is an anti-feature. You should never be indexing
| into UTF strings (you always have to iterate, or save the
| results of an iteration), upper/lowercasing isn't
| addition/subtraction, etc. etc. You also can't just forget
| about encodings as a result--you can store ASCII in something
| expecting UTF-8, but you definitely can't store UTF-8 in
| something expecting ASCII. So if you're
| sniffing/decoding/tagging a format anyway, you may as well be
| agnostic.
|
| You might also think "OK OK, you could be right, but what about
| HTML, which is mostly ASCII and would nearly double in size if
| it went from UTF-8 to UTF-16." Practically all HTML is gzipped,
| so the difference is pretty small, plus the majority of text
| isn't HTML (almost anything stored in a database, almost
| anything in a file on your computer, etc.)
|
| Different encodings are good at different things. There's no
| one superior encoding for all uses. What we need is text
| encoding agnosticism.
|
| ---
|
| In fairness, I will say I've heard that UTF-8 is pretty popular
| in countries with exactly the kind of languages I'm talking
| about, so the issue is mostly moot at this point. I just think
| UTF-16 gets a really bad rap, and I think we shouldn't just
| gloss over UTF-8 having won because it's good for European
| languages.
| JoshTriplett wrote:
| If you care about text size, you should compress your text;
| that'll save much more space, since it can optimize for
| what's actually used in the document.
|
| > ASCII compatibility is an anti-feature.
|
| ASCII compatibility is extremely useful if you're working
| with, for instance, filenames or programming languages. You
| can lex UTF-8 and handle separators like `/` or quotes like
| `"` and `'`, because those bytes can never occur otherwise.
| languageserver wrote:
| I am always extremely doubtful of these types of blogposts that
| take a well-known algorithm and somehow beat all others
| (including academia, bioinformatics tools, etc.) with a fancy
| implementation in <insert cool programming language 2022>
| poorlyknit wrote:
| (author here)
|
| I wrote this article during a short internship at Channable.
| Not to be apologetic but I think these kind of articles are so
| prevalent because young or unpopular languages usually have
| worse documentation than established ones (naturally). I
| basically wrote down the things I learned during my internship
| that I found noteworthy.
| Nebasuke wrote:
| The article is about how they moved an existing (fast)
| implementation in Haskell in UTF-16 to an even faster
| implementation in Haskell by switching to UTF-8. This is stated
| in the first paragraph.
|
| The post they reference, is also very honest: ..., the fastest
| Haskell implementation of the Aho-Corasick string searching
| algorithm, which powers string search in Channable.
|
| Basically the blog posts show that if you want to program in
| Haskell and still optimise, this is how you can do it. I think
| both posts are great resources and don't overstate their
| claims.
| danschuller wrote:
| I was taught Haskell at university and I'm old. Looking at it's
| wiki page it's a 32 year old language not that much younger
| than 37 year old C++.
| crdrost wrote:
| Oh wow. That is really not very much pain, as described.
|
| I have to say, I never thought that the benefit of Haskell having
| a horrible native string type would be "you can just upgrade
| strings like any other dependency," which is really kinda slick.
| You think about how much pain there was for Py2 -> Py3 where one
| of the big sticking factors was all of the distinctions around
| strings and encoding and byte arrays... this is comparatively
| quite nice. Makes me wonder how much of a programming language
| can be hotswappable.
| resoluteteeth wrote:
| Utf8 vs utf16 as the internal representation of the Unicode
| string type is mostly just an implementation detail.
|
| This is very different from going from python2, which conflated
| bytes and ascii strings, to python3, which intentionally
| changed the api to propely distinguish sequences of bytes and
| strings.
___________________________________________________________________
(page generated 2022-04-27 23:01 UTC)