[HN Gopher] Lessons from Plain Text
       ___________________________________________________________________
        
       Lessons from Plain Text
        
       Author : kugurerdem
       Score  : 88 points
       Date   : 2024-10-10 10:00 UTC (3 days ago)
        
 (HTM) web link (www.rugu.dev)
 (TXT) w3m dump (www.rugu.dev)
        
       | bediger4000 wrote:
       | Greppability is an interesting idea, and a good one, but I'm
       | going to disagree with the recommendation
       | 
       | > Stop hard-wrapping and just use soft-wrapping,
       | 
       | Grep for some pattern in soft-wrapped text and you get a lot of
       | extraneous material.
       | 
       | You also can't grep for things "at the beginning of the line",
       | which is often an important indicator. When I did a lot of plain
       | C programming, I would put function names at the start of a line,
       | below their return type to make it easy to grep for a function
       | definition, rather than just uses.
       | 
       | Soft-wrapping also limits the use of diffability, a complement to
       | grepability. You might correct a single letter in a misspelled
       | word in a soft-wrapped paragraph. Do a "git diff" or equivalent
       | and you'll get back a huge block of "changed" text. Useless.
       | Short, hard wrapped lines make it easy to see diffs.
        
         | kugurerdem wrote:
         | Can you detail a bit more what you mean by extraneous material?
         | Is it something like "you now _also_ need tools that can do
         | soft-wrapping "? Even if that's the case, I think it is easier
         | to wrap a text than to unwrap it (programmatically). So, if you
         | need hard-wrapping, you can just do it.
         | 
         | Wrapping is just as simple as; `fold -s -w 80 input.txt`
         | 
         | Unwrapping usually turns out to be harder according to my
         | experiences. [1]
         | 
         | > You also can't grep for things "at the beginning of the
         | line", which is often an important indicator. When I did a lot
         | of plain C programming, I would put function names at the start
         | of a line, below their return type to make it easy to grep for
         | a function definition, rather than just uses.
         | 
         | I see what you mean. But I don't think your approach conflicts
         | with my recommendation for soft-wrapping. You can still soft-
         | wrap regular text files while choosing to separate certain
         | lines of code for clarity. What you're doing might not even be
         | considered "hard-wrapping" in the typical sense--it's not like
         | you're breaking a 240-character line into multiple lines.
         | You're simply formatting the definition in a way that suits
         | your style, and it's perfectly ok!
         | 
         | For the last one, you can simply use `git diff --word-diff`.
         | Also, platforms like GitHub already highlight word-based diffs,
         | so it usually is very easy to spot the changes.
         | 
         | [1]. https://news.ycombinator.com/item?id=39227848
        
           | bediger4000 wrote:
           | Extraneous material in my example would be potentially the
           | rest of a large paragraph if I change one word. The example
           | would be the diffs of a "Word" doc: the whole paragraph shows
           | as a diff. Folding after the "git diff" would just mean
           | visually picking the diff out of a lot of text.
           | 
           | I do a lot of Go programming these days, and there's a
           | conventional format for code that ends up with a lot of hard
           | wrapped lines, so my C example is just that, an example.
           | 
           | Maybe Markdown would be a better example. When I edit
           | markdown, I move around phrases, clauses and sentences. It's
           | certainly possible to do this with a gigantic soft wrapped
           | chunk of text, but it's much easier with one clause or even
           | phrase per hard wrapped (at 74 characters or less) line.
           | Grepability and diffability and even running text through sed
           | or awk are easier. You're not relying on text coloring.
           | Editing with vim is easier, it has commands to move the
           | cursor to next word, previous paragraph etc etc.
           | 
           | This is one of those things like tabs or spaces and byte
           | order marks. We're unlikely to convince each other.
        
             | MrJohz wrote:
             | That doesn't sound like soft or hard wrapping, though, that
             | sounds like semantic wrapping, which is a separate concept
             | entirely. With semantic wrapping you put each sentence (or
             | similar) on a new line, which helps with diffing. But if
             | that sentence runs over e.g. 80 characters, you still need
             | to decide whether you're going to hard wrap or soft wrap
             | that sentence. And in the inverse direction, if you don't
             | do semantic wrapping, you'll have similar issues with diffs
             | regardless of whether you use hard wraps or soft wraps.
             | 
             | So I think that's a good argument for doing semantic
             | wrapping of code and text (I guess semantic wrapping for
             | code is just not writing everything in one long line
             | separated by semicolons), but once you've put in semantic
             | line breaks, you still need to decide how to handle text
             | that spans multiple lines.
        
               | a1369209993 wrote:
               | > But if that sentence runs over e.g. 80 characters,
               | >  you still need to decide       >  whether you're going
               | to hard wrap or soft wrap that sentence.
               | 
               | No I don't. Semantic wrapping all the way.
        
               | MrJohz wrote:
               | > This is a sentence that includes the word "Lopadotemach
               | oselachogaleokranioleipsanodrimhypotrimmatosilphiokarabom
               | elitokatakechymenokichlepikossyphophattoperisteralektryon
               | optekephalliokigklopeleiolagoiosiraiobaphetraganopterygon
               | " in it.          > How should it be wrapped
               | semantically?
               | 
               | This is a psychological case to demonstrate how semantic
               | wrapping does not by itself solve the "hard vs soft"
               | wrapping question. If the answer is that the word should
               | remain as a single word, then you are using soft wraps
               | (or no wraps at all). If the answer is that the word
               | should be split into 80 character chunks, then you're
               | using hard wraps.
        
               | a1369209993 wrote:
               | > How should it be wrapped semantically?
               | 
               | I have no idea what the semantics of that word are, which
               | is information that is required in order to perform
               | semantic wrapping. (Inherently, since conveying such
               | semantics is one of the major benefits of semantic
               | wrapping.)
               | 
               | However, you included embedded control characters (C2 AD
               | aka 'SOFT HYPHEN'; below replaced with '-') that encode
               | less semantic information than is necessary for _proper_
               | semantic wrapping, but not none:
               | 
               | Lopado-temacho-selacho-galeo-kranio-leipsano-drim-hypo-
               | trimmato-silphio-karabo-melito-katakechy-meno-kichl-epi-
               | kossypho-phatto-perister-alektryon-opte-kephallio-kigklo-
               | peleio-lagoio-siraio-baphe-tragano-pterygon.
               | 
               | Web browsers use that information to do poor-quality
               | semantic wrapping automatically - _actual_ hard or
               | soft[0] wrapping would produce something like:
               | Lopadotemachoselachogaleokranioleipsanod-
               | rimhypotrimmatosilphiokarabomelitokatake-
               | chymenokichlepikossyphophattoperisterale-
               | ktryonoptekephalliokigklopeleiolagoiosir-
               | aiobaphetraganopterygon.
               | 
               | Which looks like the following from a partly-semanically-
               | aware perspective:
               | 
               | Lopado-temacho-selacho-galeo-kranio-leipsano-d[BREAK]rim-
               | hypo-trimmato-silphio-karabo-melito-katake[BREAK]chy-
               | meno-kichl-epi-kossypho-phatto-perister-ale[BREAK]ktryon-
               | opte-kephallio-kigklo-peleio-lagoio-sir[BREAK]aio-baphe-
               | tragano-pterygon.
               | 
               | The fact that you included soft hyphens rather concedes
               | the point that hard and soft[0] wrapping is incorrect.
               | 
               | 0: Or rather, _non-semantic_ , which is what we're
               | actually arguing over. Technically, semantic wrapping is
               | a subset of hard wrapping, but it's a _specific_ subset
               | that isn 't what is expressed by just saying "hard
               | wrapping". Kind of like how birds aren't what anyone
               | means when they just say "dinosaurs".
        
         | card_zero wrote:
         | Huh? Presumably "just use soft-wrapping" only applies to _long
         | lines_. It isn 't banning you from using linebreaks before
         | function definitions. The next section is about linebreaks!
         | Earlier on, it talks about lining thing up with tabs! It can't
         | be advising you to write entire programs on one line.
         | 
         | But anyway, don't you have a code editor with a sidebar with
         | function names, that you can click on to go to the definitions?
         | Sounds like choosing to navigate via grep is the nature of the
         | problem with grepability. And other search tools that aren't
         | regex based can search for multi-line text. This isn't about
         | plain text, it's about Vim. It's like saying "this farmer's
         | field should be constructed differently because it isn't
         | skateable".
        
         | defanor wrote:
         | Neither does soft wrap play nicely with preformatted texts
         | (e.g., ASCII graphics, including diagrams and ornaments such as
         | titles centered with spaces, as well as formatted code).
         | 
         | And the support for soft-wrapping in tools varies: it may be
         | completely unavailable, or just turned off by default, and
         | generally unused in such a case.
         | 
         | I think reflowable text enters the area of markup languages,
         | rather than plain text.
        
         | eviks wrote:
         | > Soft-wrapping also limits the use of diffability
         | 
         | there are better tools for that that show word-based diff
         | instead of a huge block. There aren't such tools that can
         | convert your hard-coded linebreaks back.
        
           | Zamiel_Snawley wrote:
           | Even with just regular git on the command line, there is an
           | option for word-wise diffing, and hacky ways to get
           | character-wise diffing.
        
       | bruce511 wrote:
       | The biggest problem with this "not the true text" issue is when
       | coders encounter unicode.
       | 
       | A lot of coders, those who have worked in primarily english
       | countries see ascii as utf-8 and the difference is invisible.
       | They can go decades being oblivious to topics like encodings and
       | mappings and display.
       | 
       | So it can be surprising to them when they start dealing with
       | European characters for the first time. They view the text in one
       | place (like an editor which treats the file as utf-8) and another
       | (their program) which treats the text as ASCII.
       | 
       | It's hard to explain to them that "when I look at it" isn't a
       | universal truth, it also matters how the "look at it program"
       | chooses to interpret, and display, it.
        
         | rustybolt wrote:
         | In some cases pretending everything is ASCII is the sane thing
         | to do. With Unicode, sorting and case conversion are neigh
         | impossible to do correctly. While there are algorithms for
         | collating codepoints into (extended) grapheme clusters, there
         | is still a lot of freedom, so while there are wrong ways to do
         | it there is no canonical right way.
        
           | poincaredisk wrote:
           | >With Unicode, sorting and case conversion are neigh
           | impossible to do correctly
           | 
           | Surely you mean that sorting correctly is impossible without
           | Unicode? Otherwise you would have to hardcode the rules of
           | sorting strings correctly in my language (and all other
           | languages) yourself.
           | 
           | Unless your preferred solution is "close my eyes and prefer
           | non-ascii characters don't exist", then... I'm not a fan.
        
             | samatman wrote:
             | Sorting is impossible to do correctly without knowledge of
             | the language in which the text is written, because the
             | collation rules for symbols differ between languages.
             | Unicode, of course, defines those collation rules, and
             | UTF-8 sorts lexicographically using the same naive byte
             | comparison which works for ASCII.
             | 
             | Case conversion is similar except the default rules do a
             | very good job in general. But still, there are a few
             | language-specific quirks and, again, you do have to know
             | what language is involved to get those right.
             | 
             | I'm agreeing with you, to be clear, just adding that a)
             | Unicode isn't always enough, but it does a decent job if
             | you don't know the language in advance, and that it defines
             | the correct rules if you do know that.
        
           | teddyh wrote:
           | "Sorting is hard in other languages, so I would like to force
           | everybody to only use the characters from my language, no
           | characters from any other language. This will make it easy
           | for me."
        
         | elric wrote:
         | The same is true for all aspects of I18N and L11N. From
         | keyboard layouts to date formats. I've seen tools that expect
         | US qwerty use hard-coded shortcuts that are impossible to type
         | on different layouts.
         | 
         | Something assumptions and asses.
        
       | Evidlo wrote:
       | In my typical coding style I always start a newline for function
       | arguments so the tabs vs spaces argument has never mattered to
       | me.
       | 
       | The other way leads to a bunch of random indentation levels all
       | over the file which has always looked ugly to me.
        
       | deafpolygon wrote:
       | > I don't know whether this is just due to first-mover advantage
       | or not but it also looks like more projects use spaces over tabs.
       | So what's the point of going against the tide where there does
       | not seem to be a very powerful advantage anyway?
       | 
       | Sure, _now_. But, there was a time when I was a young man in
       | college (circa 1997) where professors and the industry would push
       | Tabs as a standard. Shortly after, the tides changed and we were
       | all using spaces.
       | 
       | > Stop hard-wrapping and just use soft-wrapping
       | 
       | Who cares about grep? I mean, aside from the OP and probably many
       | on here. Wrapping is a task that should be left up to the viewing
       | device/software. It can be made to be _responsive_ , which hard-
       | wrapping cannot be.
       | 
       | > newline
       | 
       | This really should be a solved issue by now. Both as users and by
       | software.
        
         | kugurerdem wrote:
         | > Shortly after, the tides changed and we were all using
         | spaces.
         | 
         | Very interesting. Thanks for sharing this information! What do
         | you think might have caused this though?
         | 
         | > Who cares about grep?
         | 
         | I do care. I find it much easier to work with a codebase that
         | has logs and error messages that can be easily searched.
         | Similarly, working on a blog with searchable text makes more
         | sense to me. Before switching to soft-wrapping, I used hard-
         | wrapping, and sometimes I would notice a typo or an issue in
         | one of the essays. When I tried to quickly search for a nearby
         | word, it wouldn't find it because the text had been hard-
         | wrapped. I think it also makes it far easier for outsiders to
         | navigate a repo which they are not familiar with.
         | 
         | About the newline, I agree.
        
           | SoftTalker wrote:
           | > What do you think might have caused this though?
           | 
           | The rise in web-browser-based code editors, where tab moves
           | to the next field on the form instead of inserting a literal
           | \t character.
        
           | samatman wrote:
           | grep -C 5 does the trick quite nicely I find. rg accepts the
           | same flag.
        
       | jiehong wrote:
       | Those points really depend on context:
       | 
       | - if it's code, you should be using an automatic code formatter
       | and that's it. - if you write prose, sure, soft wrap.
       | 
       | If you grep, \s doesn't care about space vs tabs.
       | 
       | Sadly, elastic tabs never caught as the default [0].
       | 
       | Maybe we would need something like a "semantic alignment" marker
       | instead of using spaces for aligning things. Like beginning of
       | function name, beginning of function argument, etc.
       | 
       | [0]: https://nick-gravgaard.com/elastic-tabstops/
        
         | eviks wrote:
         | > you should be using an automatic code formatter
         | 
         | unless you want better alignment similar to the example of
         | vertical alignment, which automatic code formatters simply
         | don't allow for since they're much more simplistic.
         | 
         | > If you grep, \s doesn't care about space vs tabs.
         | 
         | so now instead of regular easy-to-type space you have to do \s
         | to search for a phrase?
        
       | thomassmith65 wrote:
       | In the end, what truly matters is whether the codebase is
       | consistent--either using tabs or spaces throughout
       | 
       | I use tabs for code indentation, but spaces for non-code
       | indentation (eg: for ascii diagrams within comments).
       | 
       | Anyone who has converted a lot of code, from different projects,
       | from spaces to tabs will have noticed: the vast majority of code
       | with spaces contains a few screwups where a line or two in a
       | 4-spaced file actually contains 3 spaces.
       | 
       | Why that happens, despite editors automatically converting tabs
       | to spaces, is beyond me, but it is a ubiquitous phenomenon. I
       | suspect this is the real reason some people, certainly myself,
       | prefer tabs.
        
         | GuB-42 wrote:
         | I like the "tabs for indentation, spaces for alignment" style,
         | but it is not always easy for formatting tools as there is no
         | simple way to convert spaces to tabs, the tool has to be aware
         | of the syntax somehow, or you need to do some manual tweaks.
         | 
         | Screwups like missing or adding a space can happen easily even
         | with auto-indent, a common cause is splitting or merging a
         | line, i.e. changing a space with a newline and vice versa. That
         | space character has a tendency to end up where you don't want
         | it, or conversely, get eaten up. When using tabs, invisible
         | space characters can end up between tabs.
         | 
         | In the end, on collaborative projects, I usually settle on 4
         | space indentation, as it is the most common and from my
         | experience, the least likely for people to screw up.
        
         | edflsafoiewq wrote:
         | Code with tabs inevitably ends up with X spaces where a tab
         | should be, which goes unnoticed until viewed with a different
         | tab width.
        
           | fire_lake wrote:
           | Could easily be fixed with CI
        
             | poincaredisk wrote:
             | Just like incorrect indentation using spaces.
             | 
             | I personally use autoformatter in all CI pipelines, and
             | error out for every change. This entirely kills the whole
             | issue of wrong indentation/dangling spaces/accidental
             | tabs/inconsistent formatting, etc.
        
           | thomassmith65 wrote:
           | That's a valid point, though I think it's evitable (if that's
           | a word) because many editors have visible white space, and
           | it's easier for the eye to catch 4 dots after a long tab than
           | to notice that an indentation is 7 dots instead of 8, etc. It
           | also attracts more attention when using arrow keys to
           | navigate the offending lines.
        
           | samatman wrote:
           | Bingo. And the only solution to this problem, consistent use
           | of a formatter, solves the equivalent problem with spaced
           | indentation just as well.
           | 
           | I would prefer to have the spacing version of this problem,
           | personally, because that way I can always see that there's a
           | problem, and can do so without resorting to changing tab
           | widths or making invisible characters visible.
        
         | crazygringo wrote:
         | > _a few screwups where a line or two in a 4-spaced file
         | actually contains 3 spaces. Why this happens... is beyond me_
         | 
         | In my experience it's usually from copy-paste, usually because
         | the cursor wasn't at the right position when pasting. The
         | cursor not being at the right position because you deleted some
         | spaces to reduce the level of indentation before pasting, but
         | didn't do the right number. While tab inserts the right number
         | of spaces, delete still deletes spaces.
         | 
         | Also occasionally due to a find-replace that accidentally
         | included a leading space, which can be hard to see when the
         | find/replace boxes are in a proportional font.
        
         | norir wrote:
         | I solve this problem at the language level. Instead of using an
         | external formatter, indentation is enforced by the compiler and
         | requires tabs. Significant indentation is used for multiline
         | functions rather than block delimiters. On a given line, spaces
         | may be used for alignment purposes but only after the first non
         | whitespace token. This is no harder to parse than arbitrary
         | whitespace between tokens and guarantees a uniform format for
         | any valid program in the language.
         | 
         | I know not everyone will agree with me, but I think defining
         | whitespace in a language as essentially [ \t\n] between any
         | token is a language design mistake.
        
         | everybodyknows wrote:
         | > ascii diagrams within comments
         | 
         | As an aside, what tools do you use to produce the diagrams?
        
         | ericyd wrote:
         | On my team this happens due to individual laziness and I don't
         | think tabs would solve that problem.
        
       | eviks wrote:
       | > This is also partly the reason why I use spaces most of the
       | time. If you still end up adjusting the tab width to match
       | others' preferences, what's the purpose of using tabs in the
       | first place?
       | 
       | Or if you used elastic tabstops, the pursose of using tabs would
       | be that this alignment happens automatically on edits instead of
       | you having to adjust the number of spaces manually
        
       | keybored wrote:
       | The niche for hard-wrapping is straightforward. Sending patches
       | via email.
       | 
       | In these types of communities there is no formal markup. So what
       | is code and what is text? You can't tell. Some might use "code
       | fence". Some might use four-space indents. Some might just dump
       | code in between prose. And when you comment on a patch you
       | comment directly on the diff.[1]
       | 
       | You can't just let the email reader go to town on the text.
       | That's fine for prose but annoying for code where every line
       | break is either intentional or machine formatted.
       | 
       | The author mentions the downside of browsing on a mobile device.
       | Yeah I sometimes do that. But the primary mode for this kind of
       | browsing might just be on a laptop/desktop. Certainly if you plan
       | on doing some coding. (not just browsing the email archive for
       | discussions that happened eight years ago... not that I would
       | ever do that)
       | 
       | [1] Maybe diffs are easy to parse out of a message since each
       | line starts with `+`, `-` or a space. After you have peeled away
       | the quoting.
        
       | wavemode wrote:
       | > Soft Wrapping vs Hard Wrapping
       | 
       | This is actually one of HTML's most underrated features - there
       | is no distinction between hard and soft wrapping. Any whitespace,
       | of any form and quantity, between any two words is just converted
       | to a single space in the rendered output.
       | 
       | Thus the developer, in a code editor, is free to hard wrap and
       | indent the text in whatever way makes the most visual sense.
       | Meanwhile in the rendered output the actual wrapping that occurs
       | (if any) is controlled by the stylesheet.
       | 
       | I wish more programming languages had multiline string syntax
       | that could do this (automatically remove all newlines and
       | indentation). It turns out to be quite useful in a variety of
       | domains.
        
         | playingalong wrote:
         | Useful? Yes.
         | 
         | But then you need some way to provide the exact
         | indentation/spacing in some cases. And the easiest is to
         | provide them verbatim.
        
           | SoftTalker wrote:
           | HTML has the "pre" element that does this.
        
             | jbaber wrote:
             | pre does a lot more than respect newlines.
        
               | arp242 wrote:
               | "white-space:pre-line" in CSS should make it only break
               | on hard enters. There are a bunch of values; see e.g.
               | MDN.
               | 
               | Can be used on any element of course, not just <pre>.
        
               | bastawhiz wrote:
               | No?
               | 
               | https://searchfox.org/mozilla-
               | central/source/layout/style/re...
               | 
               | The Firefox default style sets a fixed width font and
               | sets a small margin. What's "a lot more"?
        
         | macintux wrote:
         | A few years ago I was auto-generating HTML to ingest into an
         | older version of Confluence (pretending it was Markdown).
         | Confluence behaved differently (correctly) when I inserted hard
         | line breaks between elements. Took a while to figure that one
         | out.
        
       | zokier wrote:
       | Biggest thing about plain text to me is that it's not really a
       | thing. Markup languages generally are not considered plain text,
       | I would not consider code either as plain text (it's literally
       | called _code_ ). What does that leave? Prose and similar writings
       | maybe. But do you include control codes in plain text? I don't
       | think something like ASCII bell can be thought as plain text. The
       | various whitespace characters are tricky question... they
       | arguably are formatting codes, so if we think plain text as
       | unformatted text, then things like newlines don't really fit the
       | picture; instead we should have some semantic markers like
       | "paragraph separator". On the other hand if we allow plain text
       | to include formatted text, then it opens a whole can of worms on
       | different ways to represent formatting.
        
         | poincaredisk wrote:
         | We probably understand that term differently. Markdown and code
         | is "plain text" in my opinion (just like JSON and csv and ini
         | etc). Basically everything that can be easily edited using a
         | regular text editor is plaintext. Alternatively, everything
         | that can be checked into and diffed using git. I think this is
         | the same understanding the author has.
         | 
         | This rules out special characters and binary file formats, of
         | course.
        
           | zokier wrote:
           | but code and prose are intrinsically very different, code has
           | strictly defined structure and form in a way that free-form
           | text or prose does not. code is generally tree-shaped, text
           | is generally linear. and so on
        
             | zelphirkalt wrote:
             | A lot of code calls out to higher levels from lower levels.
             | Does it still count as tree?
             | 
             | Prose can also have strict structure.
        
       | slmjkdbtl wrote:
       | Question about hard wrapping: If you have a piece of hard
       | wrapping text how do you easily add text in the middle? Or you
       | just have to re-wrap all the following lines by hand
        
         | SoftTalker wrote:
         | One approach is you commit the change with the added text. This
         | diff will clearly show the change.
         | 
         | Then you rewrap and commit that, with a comment "rewrap" so the
         | reader knows there is no material change.
        
         | bediger4000 wrote:
         | This can be handled by any decent text editor. In vim,
         | highlight the problem area, then hit GQ. Everything is
         | rewrapped.
        
       | breck wrote:
       | I love everything about this post. Thank you. What a great way to
       | start the day. (One nit: it would be cool to have a link to the
       | source code for the post).
       | 
       | Here's my user test:
       | https://news.pub/?try=https://www.youtube.com/embed/rx7nv6R5...
        
         | kugurerdem wrote:
         | Hi Breck, it's a nice coincidence to see an author I recently
         | discovered through breckyunits.com, now reviewing one of my
         | essays :)
         | 
         | You never know how paths might cross.
         | 
         | I think the information you shared about the tabs is worth
         | mentioning. I'll reference your video and the tabs info you
         | provided in the addendum.
        
       ___________________________________________________________________
       (page generated 2024-10-13 22:01 UTC)