[HN Gopher] Lessons from Plain Text
___________________________________________________________________
Lessons from Plain Text
Author : kugurerdem
Score : 88 points
Date : 2024-10-10 10:00 UTC (3 days ago)
(HTM) web link (www.rugu.dev)
(TXT) w3m dump (www.rugu.dev)
| bediger4000 wrote:
| Greppability is an interesting idea, and a good one, but I'm
| going to disagree with the recommendation
|
| > Stop hard-wrapping and just use soft-wrapping,
|
| Grep for some pattern in soft-wrapped text and you get a lot of
| extraneous material.
|
| You also can't grep for things "at the beginning of the line",
| which is often an important indicator. When I did a lot of plain
| C programming, I would put function names at the start of a line,
| below their return type to make it easy to grep for a function
| definition, rather than just uses.
|
| Soft-wrapping also limits the use of diffability, a complement to
| grepability. You might correct a single letter in a misspelled
| word in a soft-wrapped paragraph. Do a "git diff" or equivalent
| and you'll get back a huge block of "changed" text. Useless.
| Short, hard wrapped lines make it easy to see diffs.
| kugurerdem wrote:
| Can you detail a bit more what you mean by extraneous material?
| Is it something like "you now _also_ need tools that can do
| soft-wrapping "? Even if that's the case, I think it is easier
| to wrap a text than to unwrap it (programmatically). So, if you
| need hard-wrapping, you can just do it.
|
| Wrapping is just as simple as; `fold -s -w 80 input.txt`
|
| Unwrapping usually turns out to be harder according to my
| experiences. [1]
|
| > You also can't grep for things "at the beginning of the
| line", which is often an important indicator. When I did a lot
| of plain C programming, I would put function names at the start
| of a line, below their return type to make it easy to grep for
| a function definition, rather than just uses.
|
| I see what you mean. But I don't think your approach conflicts
| with my recommendation for soft-wrapping. You can still soft-
| wrap regular text files while choosing to separate certain
| lines of code for clarity. What you're doing might not even be
| considered "hard-wrapping" in the typical sense--it's not like
| you're breaking a 240-character line into multiple lines.
| You're simply formatting the definition in a way that suits
| your style, and it's perfectly ok!
|
| For the last one, you can simply use `git diff --word-diff`.
| Also, platforms like GitHub already highlight word-based diffs,
| so it usually is very easy to spot the changes.
|
| [1]. https://news.ycombinator.com/item?id=39227848
| bediger4000 wrote:
| Extraneous material in my example would be potentially the
| rest of a large paragraph if I change one word. The example
| would be the diffs of a "Word" doc: the whole paragraph shows
| as a diff. Folding after the "git diff" would just mean
| visually picking the diff out of a lot of text.
|
| I do a lot of Go programming these days, and there's a
| conventional format for code that ends up with a lot of hard
| wrapped lines, so my C example is just that, an example.
|
| Maybe Markdown would be a better example. When I edit
| markdown, I move around phrases, clauses and sentences. It's
| certainly possible to do this with a gigantic soft wrapped
| chunk of text, but it's much easier with one clause or even
| phrase per hard wrapped (at 74 characters or less) line.
| Grepability and diffability and even running text through sed
| or awk are easier. You're not relying on text coloring.
| Editing with vim is easier, it has commands to move the
| cursor to next word, previous paragraph etc etc.
|
| This is one of those things like tabs or spaces and byte
| order marks. We're unlikely to convince each other.
| MrJohz wrote:
| That doesn't sound like soft or hard wrapping, though, that
| sounds like semantic wrapping, which is a separate concept
| entirely. With semantic wrapping you put each sentence (or
| similar) on a new line, which helps with diffing. But if
| that sentence runs over e.g. 80 characters, you still need
| to decide whether you're going to hard wrap or soft wrap
| that sentence. And in the inverse direction, if you don't
| do semantic wrapping, you'll have similar issues with diffs
| regardless of whether you use hard wraps or soft wraps.
|
| So I think that's a good argument for doing semantic
| wrapping of code and text (I guess semantic wrapping for
| code is just not writing everything in one long line
| separated by semicolons), but once you've put in semantic
| line breaks, you still need to decide how to handle text
| that spans multiple lines.
| a1369209993 wrote:
| > But if that sentence runs over e.g. 80 characters,
| > you still need to decide > whether you're going
| to hard wrap or soft wrap that sentence.
|
| No I don't. Semantic wrapping all the way.
| MrJohz wrote:
| > This is a sentence that includes the word "Lopadotemach
| oselachogaleokranioleipsanodrimhypotrimmatosilphiokarabom
| elitokatakechymenokichlepikossyphophattoperisteralektryon
| optekephalliokigklopeleiolagoiosiraiobaphetraganopterygon
| " in it. > How should it be wrapped
| semantically?
|
| This is a psychological case to demonstrate how semantic
| wrapping does not by itself solve the "hard vs soft"
| wrapping question. If the answer is that the word should
| remain as a single word, then you are using soft wraps
| (or no wraps at all). If the answer is that the word
| should be split into 80 character chunks, then you're
| using hard wraps.
| a1369209993 wrote:
| > How should it be wrapped semantically?
|
| I have no idea what the semantics of that word are, which
| is information that is required in order to perform
| semantic wrapping. (Inherently, since conveying such
| semantics is one of the major benefits of semantic
| wrapping.)
|
| However, you included embedded control characters (C2 AD
| aka 'SOFT HYPHEN'; below replaced with '-') that encode
| less semantic information than is necessary for _proper_
| semantic wrapping, but not none:
|
| Lopado-temacho-selacho-galeo-kranio-leipsano-drim-hypo-
| trimmato-silphio-karabo-melito-katakechy-meno-kichl-epi-
| kossypho-phatto-perister-alektryon-opte-kephallio-kigklo-
| peleio-lagoio-siraio-baphe-tragano-pterygon.
|
| Web browsers use that information to do poor-quality
| semantic wrapping automatically - _actual_ hard or
| soft[0] wrapping would produce something like:
| Lopadotemachoselachogaleokranioleipsanod-
| rimhypotrimmatosilphiokarabomelitokatake-
| chymenokichlepikossyphophattoperisterale-
| ktryonoptekephalliokigklopeleiolagoiosir-
| aiobaphetraganopterygon.
|
| Which looks like the following from a partly-semanically-
| aware perspective:
|
| Lopado-temacho-selacho-galeo-kranio-leipsano-d[BREAK]rim-
| hypo-trimmato-silphio-karabo-melito-katake[BREAK]chy-
| meno-kichl-epi-kossypho-phatto-perister-ale[BREAK]ktryon-
| opte-kephallio-kigklo-peleio-lagoio-sir[BREAK]aio-baphe-
| tragano-pterygon.
|
| The fact that you included soft hyphens rather concedes
| the point that hard and soft[0] wrapping is incorrect.
|
| 0: Or rather, _non-semantic_ , which is what we're
| actually arguing over. Technically, semantic wrapping is
| a subset of hard wrapping, but it's a _specific_ subset
| that isn 't what is expressed by just saying "hard
| wrapping". Kind of like how birds aren't what anyone
| means when they just say "dinosaurs".
| card_zero wrote:
| Huh? Presumably "just use soft-wrapping" only applies to _long
| lines_. It isn 't banning you from using linebreaks before
| function definitions. The next section is about linebreaks!
| Earlier on, it talks about lining thing up with tabs! It can't
| be advising you to write entire programs on one line.
|
| But anyway, don't you have a code editor with a sidebar with
| function names, that you can click on to go to the definitions?
| Sounds like choosing to navigate via grep is the nature of the
| problem with grepability. And other search tools that aren't
| regex based can search for multi-line text. This isn't about
| plain text, it's about Vim. It's like saying "this farmer's
| field should be constructed differently because it isn't
| skateable".
| defanor wrote:
| Neither does soft wrap play nicely with preformatted texts
| (e.g., ASCII graphics, including diagrams and ornaments such as
| titles centered with spaces, as well as formatted code).
|
| And the support for soft-wrapping in tools varies: it may be
| completely unavailable, or just turned off by default, and
| generally unused in such a case.
|
| I think reflowable text enters the area of markup languages,
| rather than plain text.
| eviks wrote:
| > Soft-wrapping also limits the use of diffability
|
| there are better tools for that that show word-based diff
| instead of a huge block. There aren't such tools that can
| convert your hard-coded linebreaks back.
| Zamiel_Snawley wrote:
| Even with just regular git on the command line, there is an
| option for word-wise diffing, and hacky ways to get
| character-wise diffing.
| bruce511 wrote:
| The biggest problem with this "not the true text" issue is when
| coders encounter unicode.
|
| A lot of coders, those who have worked in primarily english
| countries see ascii as utf-8 and the difference is invisible.
| They can go decades being oblivious to topics like encodings and
| mappings and display.
|
| So it can be surprising to them when they start dealing with
| European characters for the first time. They view the text in one
| place (like an editor which treats the file as utf-8) and another
| (their program) which treats the text as ASCII.
|
| It's hard to explain to them that "when I look at it" isn't a
| universal truth, it also matters how the "look at it program"
| chooses to interpret, and display, it.
| rustybolt wrote:
| In some cases pretending everything is ASCII is the sane thing
| to do. With Unicode, sorting and case conversion are neigh
| impossible to do correctly. While there are algorithms for
| collating codepoints into (extended) grapheme clusters, there
| is still a lot of freedom, so while there are wrong ways to do
| it there is no canonical right way.
| poincaredisk wrote:
| >With Unicode, sorting and case conversion are neigh
| impossible to do correctly
|
| Surely you mean that sorting correctly is impossible without
| Unicode? Otherwise you would have to hardcode the rules of
| sorting strings correctly in my language (and all other
| languages) yourself.
|
| Unless your preferred solution is "close my eyes and prefer
| non-ascii characters don't exist", then... I'm not a fan.
| samatman wrote:
| Sorting is impossible to do correctly without knowledge of
| the language in which the text is written, because the
| collation rules for symbols differ between languages.
| Unicode, of course, defines those collation rules, and
| UTF-8 sorts lexicographically using the same naive byte
| comparison which works for ASCII.
|
| Case conversion is similar except the default rules do a
| very good job in general. But still, there are a few
| language-specific quirks and, again, you do have to know
| what language is involved to get those right.
|
| I'm agreeing with you, to be clear, just adding that a)
| Unicode isn't always enough, but it does a decent job if
| you don't know the language in advance, and that it defines
| the correct rules if you do know that.
| teddyh wrote:
| "Sorting is hard in other languages, so I would like to force
| everybody to only use the characters from my language, no
| characters from any other language. This will make it easy
| for me."
| elric wrote:
| The same is true for all aspects of I18N and L11N. From
| keyboard layouts to date formats. I've seen tools that expect
| US qwerty use hard-coded shortcuts that are impossible to type
| on different layouts.
|
| Something assumptions and asses.
| Evidlo wrote:
| In my typical coding style I always start a newline for function
| arguments so the tabs vs spaces argument has never mattered to
| me.
|
| The other way leads to a bunch of random indentation levels all
| over the file which has always looked ugly to me.
| deafpolygon wrote:
| > I don't know whether this is just due to first-mover advantage
| or not but it also looks like more projects use spaces over tabs.
| So what's the point of going against the tide where there does
| not seem to be a very powerful advantage anyway?
|
| Sure, _now_. But, there was a time when I was a young man in
| college (circa 1997) where professors and the industry would push
| Tabs as a standard. Shortly after, the tides changed and we were
| all using spaces.
|
| > Stop hard-wrapping and just use soft-wrapping
|
| Who cares about grep? I mean, aside from the OP and probably many
| on here. Wrapping is a task that should be left up to the viewing
| device/software. It can be made to be _responsive_ , which hard-
| wrapping cannot be.
|
| > newline
|
| This really should be a solved issue by now. Both as users and by
| software.
| kugurerdem wrote:
| > Shortly after, the tides changed and we were all using
| spaces.
|
| Very interesting. Thanks for sharing this information! What do
| you think might have caused this though?
|
| > Who cares about grep?
|
| I do care. I find it much easier to work with a codebase that
| has logs and error messages that can be easily searched.
| Similarly, working on a blog with searchable text makes more
| sense to me. Before switching to soft-wrapping, I used hard-
| wrapping, and sometimes I would notice a typo or an issue in
| one of the essays. When I tried to quickly search for a nearby
| word, it wouldn't find it because the text had been hard-
| wrapped. I think it also makes it far easier for outsiders to
| navigate a repo which they are not familiar with.
|
| About the newline, I agree.
| SoftTalker wrote:
| > What do you think might have caused this though?
|
| The rise in web-browser-based code editors, where tab moves
| to the next field on the form instead of inserting a literal
| \t character.
| samatman wrote:
| grep -C 5 does the trick quite nicely I find. rg accepts the
| same flag.
| jiehong wrote:
| Those points really depend on context:
|
| - if it's code, you should be using an automatic code formatter
| and that's it. - if you write prose, sure, soft wrap.
|
| If you grep, \s doesn't care about space vs tabs.
|
| Sadly, elastic tabs never caught as the default [0].
|
| Maybe we would need something like a "semantic alignment" marker
| instead of using spaces for aligning things. Like beginning of
| function name, beginning of function argument, etc.
|
| [0]: https://nick-gravgaard.com/elastic-tabstops/
| eviks wrote:
| > you should be using an automatic code formatter
|
| unless you want better alignment similar to the example of
| vertical alignment, which automatic code formatters simply
| don't allow for since they're much more simplistic.
|
| > If you grep, \s doesn't care about space vs tabs.
|
| so now instead of regular easy-to-type space you have to do \s
| to search for a phrase?
| thomassmith65 wrote:
| In the end, what truly matters is whether the codebase is
| consistent--either using tabs or spaces throughout
|
| I use tabs for code indentation, but spaces for non-code
| indentation (eg: for ascii diagrams within comments).
|
| Anyone who has converted a lot of code, from different projects,
| from spaces to tabs will have noticed: the vast majority of code
| with spaces contains a few screwups where a line or two in a
| 4-spaced file actually contains 3 spaces.
|
| Why that happens, despite editors automatically converting tabs
| to spaces, is beyond me, but it is a ubiquitous phenomenon. I
| suspect this is the real reason some people, certainly myself,
| prefer tabs.
| GuB-42 wrote:
| I like the "tabs for indentation, spaces for alignment" style,
| but it is not always easy for formatting tools as there is no
| simple way to convert spaces to tabs, the tool has to be aware
| of the syntax somehow, or you need to do some manual tweaks.
|
| Screwups like missing or adding a space can happen easily even
| with auto-indent, a common cause is splitting or merging a
| line, i.e. changing a space with a newline and vice versa. That
| space character has a tendency to end up where you don't want
| it, or conversely, get eaten up. When using tabs, invisible
| space characters can end up between tabs.
|
| In the end, on collaborative projects, I usually settle on 4
| space indentation, as it is the most common and from my
| experience, the least likely for people to screw up.
| edflsafoiewq wrote:
| Code with tabs inevitably ends up with X spaces where a tab
| should be, which goes unnoticed until viewed with a different
| tab width.
| fire_lake wrote:
| Could easily be fixed with CI
| poincaredisk wrote:
| Just like incorrect indentation using spaces.
|
| I personally use autoformatter in all CI pipelines, and
| error out for every change. This entirely kills the whole
| issue of wrong indentation/dangling spaces/accidental
| tabs/inconsistent formatting, etc.
| thomassmith65 wrote:
| That's a valid point, though I think it's evitable (if that's
| a word) because many editors have visible white space, and
| it's easier for the eye to catch 4 dots after a long tab than
| to notice that an indentation is 7 dots instead of 8, etc. It
| also attracts more attention when using arrow keys to
| navigate the offending lines.
| samatman wrote:
| Bingo. And the only solution to this problem, consistent use
| of a formatter, solves the equivalent problem with spaced
| indentation just as well.
|
| I would prefer to have the spacing version of this problem,
| personally, because that way I can always see that there's a
| problem, and can do so without resorting to changing tab
| widths or making invisible characters visible.
| crazygringo wrote:
| > _a few screwups where a line or two in a 4-spaced file
| actually contains 3 spaces. Why this happens... is beyond me_
|
| In my experience it's usually from copy-paste, usually because
| the cursor wasn't at the right position when pasting. The
| cursor not being at the right position because you deleted some
| spaces to reduce the level of indentation before pasting, but
| didn't do the right number. While tab inserts the right number
| of spaces, delete still deletes spaces.
|
| Also occasionally due to a find-replace that accidentally
| included a leading space, which can be hard to see when the
| find/replace boxes are in a proportional font.
| norir wrote:
| I solve this problem at the language level. Instead of using an
| external formatter, indentation is enforced by the compiler and
| requires tabs. Significant indentation is used for multiline
| functions rather than block delimiters. On a given line, spaces
| may be used for alignment purposes but only after the first non
| whitespace token. This is no harder to parse than arbitrary
| whitespace between tokens and guarantees a uniform format for
| any valid program in the language.
|
| I know not everyone will agree with me, but I think defining
| whitespace in a language as essentially [ \t\n] between any
| token is a language design mistake.
| everybodyknows wrote:
| > ascii diagrams within comments
|
| As an aside, what tools do you use to produce the diagrams?
| ericyd wrote:
| On my team this happens due to individual laziness and I don't
| think tabs would solve that problem.
| eviks wrote:
| > This is also partly the reason why I use spaces most of the
| time. If you still end up adjusting the tab width to match
| others' preferences, what's the purpose of using tabs in the
| first place?
|
| Or if you used elastic tabstops, the pursose of using tabs would
| be that this alignment happens automatically on edits instead of
| you having to adjust the number of spaces manually
| keybored wrote:
| The niche for hard-wrapping is straightforward. Sending patches
| via email.
|
| In these types of communities there is no formal markup. So what
| is code and what is text? You can't tell. Some might use "code
| fence". Some might use four-space indents. Some might just dump
| code in between prose. And when you comment on a patch you
| comment directly on the diff.[1]
|
| You can't just let the email reader go to town on the text.
| That's fine for prose but annoying for code where every line
| break is either intentional or machine formatted.
|
| The author mentions the downside of browsing on a mobile device.
| Yeah I sometimes do that. But the primary mode for this kind of
| browsing might just be on a laptop/desktop. Certainly if you plan
| on doing some coding. (not just browsing the email archive for
| discussions that happened eight years ago... not that I would
| ever do that)
|
| [1] Maybe diffs are easy to parse out of a message since each
| line starts with `+`, `-` or a space. After you have peeled away
| the quoting.
| wavemode wrote:
| > Soft Wrapping vs Hard Wrapping
|
| This is actually one of HTML's most underrated features - there
| is no distinction between hard and soft wrapping. Any whitespace,
| of any form and quantity, between any two words is just converted
| to a single space in the rendered output.
|
| Thus the developer, in a code editor, is free to hard wrap and
| indent the text in whatever way makes the most visual sense.
| Meanwhile in the rendered output the actual wrapping that occurs
| (if any) is controlled by the stylesheet.
|
| I wish more programming languages had multiline string syntax
| that could do this (automatically remove all newlines and
| indentation). It turns out to be quite useful in a variety of
| domains.
| playingalong wrote:
| Useful? Yes.
|
| But then you need some way to provide the exact
| indentation/spacing in some cases. And the easiest is to
| provide them verbatim.
| SoftTalker wrote:
| HTML has the "pre" element that does this.
| jbaber wrote:
| pre does a lot more than respect newlines.
| arp242 wrote:
| "white-space:pre-line" in CSS should make it only break
| on hard enters. There are a bunch of values; see e.g.
| MDN.
|
| Can be used on any element of course, not just <pre>.
| bastawhiz wrote:
| No?
|
| https://searchfox.org/mozilla-
| central/source/layout/style/re...
|
| The Firefox default style sets a fixed width font and
| sets a small margin. What's "a lot more"?
| macintux wrote:
| A few years ago I was auto-generating HTML to ingest into an
| older version of Confluence (pretending it was Markdown).
| Confluence behaved differently (correctly) when I inserted hard
| line breaks between elements. Took a while to figure that one
| out.
| zokier wrote:
| Biggest thing about plain text to me is that it's not really a
| thing. Markup languages generally are not considered plain text,
| I would not consider code either as plain text (it's literally
| called _code_ ). What does that leave? Prose and similar writings
| maybe. But do you include control codes in plain text? I don't
| think something like ASCII bell can be thought as plain text. The
| various whitespace characters are tricky question... they
| arguably are formatting codes, so if we think plain text as
| unformatted text, then things like newlines don't really fit the
| picture; instead we should have some semantic markers like
| "paragraph separator". On the other hand if we allow plain text
| to include formatted text, then it opens a whole can of worms on
| different ways to represent formatting.
| poincaredisk wrote:
| We probably understand that term differently. Markdown and code
| is "plain text" in my opinion (just like JSON and csv and ini
| etc). Basically everything that can be easily edited using a
| regular text editor is plaintext. Alternatively, everything
| that can be checked into and diffed using git. I think this is
| the same understanding the author has.
|
| This rules out special characters and binary file formats, of
| course.
| zokier wrote:
| but code and prose are intrinsically very different, code has
| strictly defined structure and form in a way that free-form
| text or prose does not. code is generally tree-shaped, text
| is generally linear. and so on
| zelphirkalt wrote:
| A lot of code calls out to higher levels from lower levels.
| Does it still count as tree?
|
| Prose can also have strict structure.
| slmjkdbtl wrote:
| Question about hard wrapping: If you have a piece of hard
| wrapping text how do you easily add text in the middle? Or you
| just have to re-wrap all the following lines by hand
| SoftTalker wrote:
| One approach is you commit the change with the added text. This
| diff will clearly show the change.
|
| Then you rewrap and commit that, with a comment "rewrap" so the
| reader knows there is no material change.
| bediger4000 wrote:
| This can be handled by any decent text editor. In vim,
| highlight the problem area, then hit GQ. Everything is
| rewrapped.
| breck wrote:
| I love everything about this post. Thank you. What a great way to
| start the day. (One nit: it would be cool to have a link to the
| source code for the post).
|
| Here's my user test:
| https://news.pub/?try=https://www.youtube.com/embed/rx7nv6R5...
| kugurerdem wrote:
| Hi Breck, it's a nice coincidence to see an author I recently
| discovered through breckyunits.com, now reviewing one of my
| essays :)
|
| You never know how paths might cross.
|
| I think the information you shared about the tabs is worth
| mentioning. I'll reference your video and the tabs info you
| provided in the addendum.
___________________________________________________________________
(page generated 2024-10-13 22:01 UTC)