[HN Gopher] What if Git worked with programming languages?
       ___________________________________________________________________
        
       What if Git worked with programming languages?
        
       Author : LukeEF
       Score  : 138 points
       Date   : 2021-09-27 13:26 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | iso8859-1 wrote:
       | This is on the http://lamdu.org roadmap.
        
       | kazinator wrote:
       | The blogger does not understand Git, fundamentally.
       | 
       | Git does does not work with text. It stores snapshots of
       | artifacts.
       | 
       | The diffs that you see when you use the various commands like git
       | log -p are recovered from the snapshots, when those artifacts
       | happen to be text files.
       | 
       | Git absolutely works with texts when you connect it with external
       | representations and tooling, such as when you "git format-patch"
       | and then "git am" to import that; and the rebasing workflows
       | obviously have textual merging with conflict resolution. Still,
       | that seems like something that could be externalized. A language-
       | specific three-way-diff tool can handle a merge by parsing all
       | three pieces and working with ASTs. It's something that could be
       | developed later, yet still work with your old commits.
       | 
       | There is this: https://git-scm.com/docs/git-mergetool
       | 
       | No idea how well it works.
        
       | [deleted]
        
       | olodus wrote:
       | Ever since I learned about Git merge strategies and wrote a very
       | basic one myself, I've been wanting to write one that syntaxticly
       | understands a bit of the test framework code we use at work. It
       | is super annoying when you copy a test because you want to vary a
       | very specific case and gig gets all confused about what code is
       | and isn't the same.
       | 
       | (yeah I know I should break out the copied part but who always
       | has time for that)
        
       | ghoward wrote:
       | I'm actually working on a VCS based on this idea and on tracking
       | changes to binary files based on their structure as well. (It
       | turns out that the same techniques work for both.)
       | 
       | AMA and please give me feedback!
        
       | jakeinspace wrote:
       | I think the only useful way to implement AST-level diff/merge for
       | non-trivial codebases would require the compiler to provide the
       | parsed AST, since per-file ASTs would lack a lot of context. You
       | could also ask the user to provide a separate file or files that
       | describe the code topology, but why bother when the compiler can
       | spit out an AST itself? A diff tool which targets a few of the
       | bigger build systems (CMake, Maven, Gradle) and compilers might
       | work, and could worry about small build environments after
       | gaining momentum.
        
       | kitplummer wrote:
       | I'm just a bit more "generally" curious. Is `git` being the
       | _only_ DVCS a good thing? Not to say that `hg` or `darcs` don't
       | exist, just that the hub on top of git has pushed us in a
       | singular direction.
       | 
       | I would like to see, at least academically, something more.
        
         | tomphoolery wrote:
         | The choice of DVCS tooling is ancillary to the success of
         | GitHub. People learned Git so they could use GitHub, not the
         | other way around. At least, this is the way I remember it.
         | 
         | If someone comes along and builds the best forge software ever,
         | but uses Mercurial instead of Git, I'll bet a lot of people
         | would switch technologies at some point. Until then, I'd say
         | most people use GitHub because GitHub works for them, and they
         | use Git because that's how you interact with GitHub. They don't
         | care about the ivory tower benefits of their particular DVCS
         | tooling, all they care about is easily collaborating with their
         | teammates.
         | 
         | It would definitely be great if you could have a GitHub-like
         | experience using Mercurial or Darcs, but so far I haven't seen
         | anything close to that.
        
           | aayjaychan wrote:
           | Does GitLab count as a GitHub-like experience? There is a
           | fork of GitLab called Heptapod that supports Mercurial.
           | 
           | https://heptapod.net/
        
       | tomxor wrote:
       | > The fact that git works on lines of text [...] we could be
       | looking at the alterations to the abstract syntax tree.
       | 
       | Fundamentally git does not operate on text, it operates on files
       | (content addressed SCM not a ledger of text diffs); diffs are
       | generated upon request between arbitrary merkel trees. So there
       | is no need to implicate git in such a tool, it can be
       | independent:                      GIT_EXTERNAL_DIFF
       | When the environment variable GIT_EXTERNAL_DIFF is set, the
       | program                named by it is called to generate diffs,
       | and Git does not use its                builtin diff machinery.
       | For a path that is added, removed, or                modified,
       | GIT_EXTERNAL_DIFF is called with 7 parameters:
       | path old-file old-hex old-mode new-file new-hex new-mode
        
       | tombert wrote:
       | I would definitely support a Lisp-centric Git.
       | 
       | Whenever I do Clojure, something that can get difficult when
       | working with multiple people is how the
       | parentheses/brackets/braces stack up, especially when everyone
       | seems to have different opinions on how that works. As a result,
       | if you're not careful, when there's a merge conflict you can have
       | a ton of extra parentheses, which can be irritating to debug.
       | 
       | Obviously this is at some level an issue inherent to Lisps (and
       | to be clear, I love Lisps, and these small headaches are worth
       | it), but I think problems like that could be reduced if our
       | source controls were aware of the ASTs.
        
         | timgilbert wrote:
         | Yeah, I've long thought a diff tool that works on s-exprs
         | instead of lines would be invaluable for Lisp programming. It
         | doesn't seem like it would be too hard to write, either,
         | although getting GitHub etc to use it seems like it would be
         | its own challenge...
        
       | CodeIsTheEnd wrote:
       | I don't understand why GitHub hasn't solved the issue of diffs
       | starting with a '}' (or ')' or 'end'). Just slide the diff over
       | while it starts with a closing token! I suppose it's an artifact
       | of the diffing algorithm, but aren't there better diffing
       | algorithms, even built-in within git?
       | 
       | This is by far the most obvious example of "git doesn't
       | understand programming languages", but it also seems like the
       | most straightforward to fix.
        
         | nemetroid wrote:
         | Git supports a few different diff algorithms. GitHub only seems
         | to support the standard Myers algorithm, though:
         | https://github.com/isaacs/github/issues/455
        
         | mynegation wrote:
         | It is because diff is syntax agnostic. You might be able to get
         | away with this hack in some cases but that complicates
         | algorithm and will break in some other cases (how about nested
         | brackets? Multiple brackets on one line?). Once you want to
         | handle this properly you need syntax aware diff algorithm and
         | some resources are linked in this discussion.
        
       | lamontcg wrote:
       | This has been posted before
        
       | mangecoeur wrote:
       | Interesting they mentioned Jupyter Notebooks but not NBDime
       | https://github.com/jupyter/nbdime which is a Jupyter plugin
       | specifically to address this problem. Without it, diffing
       | notebooks is not feasible.
        
       | skybrian wrote:
       | If you're interested in this sort of thing you might want to look
       | at Dolt (for sharing databases in a git-like way) and Pijul,
       | which records diffs explicitly, rather than calculating them on
       | the fly.
       | 
       | I wonder if there might be a clever way to encode source code in
       | a Dolt database? Maybe each function should be a record?
        
         | LukeEF wrote:
         | Author is CTO of TerminusDB (https://terminusdb.com/), which is
         | a more graphy version of Dolt! Check it out.
        
       | hardwaregeek wrote:
       | I've wanted this for a while, but I will say there's some
       | caveats. Sometimes I want to commit just as a "it's the end of
       | the day, I want to leave, here's a code dump". I suppose you
       | could have multiple tiers of code saving.
       | 
       | I've also wondered about whether you could do code analysis with
       | time as a dimension. If you can analyze the evolution of the code
       | and pull old implementations, what can you do? Autocomplete is a
       | good example, as it can pull previous patterns you've used. Maybe
       | some way to tell the programmer "hold up, you've made this
       | mistake before, don't do it again"? I'm not sure.
        
         | inetknght wrote:
         | > I want to commit just as a "it's the end of the day, I want
         | to leave, here's a code dump"
         | 
         | 1. git checkout -b eod-$(date -Id)
         | 
         | 2. commit
         | 
         | 3. leave
         | 
         | 4. return
         | 
         | 5. git checkout -
         | 
         | 6. git merge - --no-commit
        
           | hardwaregeek wrote:
           | Yes, but if we're talking about some hypothetical tool that
           | requires a valid AST, there might be a situation where I
           | don't have a valid AST and want to save the code. Similarly,
           | I had a job where we used pre-commit hooks that ran a linting
           | script. I had to override the hook to commit which was
           | slightly annoying at times.
        
       | aardvark179 wrote:
       | I've done quite a lot of work on version management on structured
       | data (in my case this was for a version managed GIS database) and
       | it's not an easy problem, and is likely even harder with
       | something like an AST that is generated from a text file and so
       | does not preserve the identity of nodes. I'm not saying that it's
       | impossible, but it is more work and requires more tooling around
       | it than people think, and it keeps coming up here and other
       | places as a, "really good idea."
        
       | cormacrelf wrote:
       | Counterpoint: a quick google reveals diffsitter:
       | https://github.com/afnanenayet/diffsitter
       | 
       | The output could be a lot more compact, it could do better at
       | adding context (in the same way https://github.com/romgrk/nvim-
       | treesitter-context does, etc), but if you're interested in this
       | it's really within reach, go help out.
       | 
       | I wonder if you can use it for automerge yet.
        
       | Jensson wrote:
       | I don't see how this could ever work on evolving languages,
       | different GIT versions would produce different commits and read
       | commits differently based on the latest C++ standard. This would
       | potentially lead to version control bugs where different GIT
       | versions creates different results from the same commit, that is
       | horrible, version control needs to be 100% bug free in that
       | regard.
       | 
       | The only reasonable application would be to use a language AST
       | parser to better identify relevant text diffs, but the commits
       | still needs to be stored as text.
        
         | dboreham wrote:
         | This doesn't really make sense, because in order to have those
         | code changes compile correctly, there must be a corresponding
         | commit to CI config that changes the complier version or
         | compiler switches for the new language version. The "semantic-
         | diff-er" can also be driven by that commit such that it uses
         | the correct language version.
        
           | verdverm wrote:
           | It's non trivial to support multiple versions of the same
           | language on a host system. You have to account for dev
           | machines and workflows as well.
           | 
           | Docker can help with this, but often devs don't want to run a
           | container to build their code. It's a hard habit to change.
           | 
           | Now, consider how difficult it would be to get the differ to
           | understand where to find compiler versions.
        
         | shepherdjerred wrote:
         | Commits could be stored as is, the difference would be that
         | diffs are clearer when presented to a human.
        
         | pkghost wrote:
         | How is this different from any other problem that is already
         | solved by version pegging?
        
       | ozim wrote:
       | I feel this is just an example of "worse is better" and whole
       | proposition as interesting but totally not practical and I would
       | not like for GIT to go anywhere near that idea.
        
       | raxxorrax wrote:
       | Theoretically it might work, but I don't think I am too fond of
       | the idea. I used git pull to completely waste my source and it
       | would have been nice for git to have more intelligence here, but
       | in the end I think some of its success lies in its simplicity.
       | 
       | SVN isn't too bad and not too much of a difference to git if you
       | use a central repository anyway. The main neat thing was to just
       | have one hidden folder, not in every subdirectory.
       | 
       | Git would also need the ability to transform from AST to source
       | for every language. A bit unrealistic and there is no benefit to
       | it. Could also do that with Assemblies and some meta info for the
       | decompiler.
        
       | [deleted]
        
       | gumby wrote:
       | Shared (concurrent) code editors might work better if their
       | CRDT/OT model worked at that level.
       | 
       | Not that I really want to edit code in a shared environment
       | (editing documents that way is bad enough), but just musing...
        
       | nerdponx wrote:
       | Storing AST instead of source code is one of the goals of the
       | very interesting Unison programming language:
       | https://www.unisonweb.org/
       | 
       | Part of what's nice about Git (and plain text in general) is that
       | it's the lowest common denominator for a lot of things. This is
       | why traditional Unix tools are built oriented around streams of
       | bytes. Text is a low level carrier protocol; you can encode
       | almost anything in it, but you need to agree on some kind of
       | format.
       | 
       | The good part is that you can use very very generic tools on
       | almost arbitrary pieces of data. The bad part is that you might
       | have to do a lot of parsing and re-parsing of the same data, and
       | you have to contend with the dangers of underspecified formats.
       | 
       | Git follows the Unix tradition in this regard. As a result, it is
       | nearly universal in what it can store. You can use it to store
       | pretty much anything, but you are now at the lowest common
       | denominator of support for any particular data format.
       | 
       | Git-for-ASTs will no longer have this universality property, but
       | will gain a lot more power in the covered domain. This is a
       | design tradeoff.
       | 
       | One thing that's nice about Git is that you can specify arbitrary
       | diff drivers with the "attributes" system. So even if the Git
       | database is storing plain text, your diff driver can parse your
       | source code into ASTs and present AST diffs to you when you run
       | `git diff`. Perhaps more impressive, you can configure custom
       | _merge_ drivers, so you can (theoretically) implement semantic
       | merging of ASTs right inside Git.
       | 
       | There are probably some fundamental limitations of this system,
       | because the underlying data is still stored as blobs of bytes.
       | But you can get pretty far as long as you don't mind parsing and
       | re-parsing the same text over and over.
        
         | ssivark wrote:
         | Has this approach been tried? (Unison or otherwise...)
        
       | mumblemumble wrote:
       | I would maybe be interested in Git allowing you to plug in your
       | own diff generators for different file types.
       | 
       | But I would not want Git itself trying to understand the contents
       | of files. That seems to me to be an idea that lives on a
       | misconception of the "things programmers believe about names"
       | variety. Not every file in source control is source code. Not
       | every programming language's grammar maps to an abstract syntax
       | tree. In some files, such as makefiles, the difference between
       | tabs and spaces is semantically significant. Some languages (such
       | as Fortran and Racket) have variable syntax. And so on and so
       | forth.
       | 
       | So I think that we _really_ don 't want the source control system
       | itself trying to get too smart about the contents of files. That
       | will inevitably make the source control system less compatible
       | with the various kinds of things you might want to put into
       | source control. And it will also make the source control system a
       | lot more complicated than it would otherwise be, in return for a
       | largely theoretical payoff.
       | 
       | But if we want to delegate the work of generating diffs off to
       | other people, so that Git can allow for syntax or semantics-aware
       | diffing without having to personally wade into that quagmire (and
       | perhaps also allowing language communities to support multiple
       | source control systems, a bit like how it works with LSP), that
       | might be an interesting thing to experiment with.
        
         | ffwacom wrote:
         | > Not every programming language's grammar maps to an abstract
         | syntax tree
         | 
         | Are there some examples of this?
        
           | mumblemumble wrote:
           | Forth. You could certainly define a formal grammar for it and
           | construct an AST, but it would be trivial and not very
           | useful.
        
           | simcop2387 wrote:
           | Perl, this is because you can't actually properly parse Perl
           | without also running Perl code at the same time.
        
             | justinator wrote:
             | https://metacpan.org/pod/PPI
        
         | madmax96 wrote:
         | I disagree. Many engineers want to refactor across a sequence
         | of small PRs, for example. Small PRs are a good thing, because
         | they're easier to understand. But today, Git makes this
         | painful. Also, understanding how the meaning of code changes
         | over time can help reduce bugs.
         | 
         | The solution will have to be pluggable. But I think it is
         | possible, and there are sane things to do (e.g. fall back to
         | vanilla git) when there are missing plugs.
        
         | saurik wrote:
         | > I would maybe be interested in Git allowing you to plug in
         | your own diff generators for different file types.
         | 
         | This is already supported.
        
           | lux wrote:
           | A common example is UnityYAMLMerge for merging the Unity game
           | engine's generated files.
           | 
           | https://docs.unity3d.com/Manual/SmartMerge.html
           | 
           | Configuring it to work with Git and others is a little ways
           | down the page, but would apply the same for other diff tools.
        
           | [deleted]
        
           | franga2000 wrote:
           | I looked this up and for anyone wondering, it's called
           | "diff/merge drivers", but there are only a handful of them
           | out there. Some highlights from a few minutes of searching:
           | 
           | - MS Office: https://github.com/lcnittl/DMFO - SQLite:
           | https://github.com/cannadayr/git-sqlite - Jupyter notebooks:
           | https://nbdime.readthedocs.io/en/latest/vcs.html#git-
           | integra...
           | 
           | One big caveat of this is that since git doesn't really store
           | just a stack of diffs, despite the fact it presents itself as
           | such to the user, a custom merge driver will not make your
           | .git grow any less than it would normally.
        
             | est31 wrote:
             | > One big caveat of this is that since git doesn't really
             | store just a stack of diffs, despite the fact it presents
             | itself as such to the user, a custom merge driver will not
             | make your .git grow any less than it would normally.
             | 
             | Note that git does support using deltas for storage. But
             | according to docs, custom diff drivers aren't used for
             | those, instead it's a instruction based format.
             | 
             | https://git-scm.com/docs/pack-
             | format#_deltified_representati...
        
       | mabbo wrote:
       | Reading this article, I feel as though the author doesn't deeply
       | understand git.
       | 
       | git works on blobs of data, not files, and not lines of text. It
       | doesn't just happen to also work on binary files- that's all it
       | works on.
       | 
       | Now, if the author is suggesting that git-diff ought to have a
       | language specific mode that parses changed files as ASTs to
       | compare, now I'm interested. Let's do that. I'll help!
       | 
       | But git does not need to change how it works for that to happen.
       | Git does not even need git-diff to exist to serve it's main
       | purpose.
        
         | mbauman wrote:
         | You can already choose different `diff` programs to use for
         | particular filetypes. E.g., nbdime for Jupiter notebooks:
         | 
         | https://nbdime.readthedocs.io/en/latest/vcs.html#git-integra...
        
         | hardwaregeek wrote:
         | I dunno I feel like you're focusing on a detail that's not
         | particularly relevant. The author's main thrust is precisely
         | what you described about parsing changed files as ASTs.
        
           | nemetroid wrote:
           | It isn't relevant to the author's vision of content-aware
           | diffing, but it _is_ relevant to the author 's complaints
           | about how Git's (alleged) text-based-ness makes Git awkward
           | to use with Jupyter notebooks. Has the author tried searching
           | the web for "git diff jupyter"?
        
         | munificent wrote:
         | The author is likely using "git" to mean "the entire typical
         | git user experience that git users spend time looking at".
         | 
         | And, from that perspective, Git-the-UX definitely does work on
         | line-oriented files.
        
         | screye wrote:
         | The git extension on VSCode is already pretty good at doing
         | diffs on jupyter notebooks.
         | 
         | I distinctly remember this not being a core feature of stock
         | git and needing Jupytext to enable version control on
         | notebooks. So, I feel like this sort of language specific stuff
         | is already happening, but not in any unified product.
        
         | munk-a wrote:
         | There's also a historical angle here that's important to
         | inspect - Git was designed to specifically be content agnostic.
         | There are some predecessors in the SCM space (like VSS) that
         | are specifically language aware and allow the checking out of
         | line ranges (pinning them so that no one else will make a
         | conflicting change specifically) and even entire functions -
         | these systems can cause a lot of grief while failing to protect
         | the logic they're specifically trying to protect. As the warts
         | on SVN got more and more visible I think the general assumption
         | was that the replacement SCM would come out of this code aware
         | space - but it didn't and in retrospect we all dodged a huge
         | bullet when that happened.
         | 
         | I absolutely adore tooling around git that makes diffs more
         | visible - one thing I absolutely gush over is anything that can
         | detect and highlight function reordering... however, the core
         | process of merging and rebasing and all that jazz - I don't
         | think we're going to find anything automated that I'll ever
         | trust when I'm not working on a ridiculously clean codebase -
         | minor changes can have echo effects and when two people are
         | coding in the same general area they need to be aware of what
         | the other person is trying to do.
        
         | tux3 wrote:
         | Note that git does work with diffs a lot.
         | 
         | Rebases and cherry-picks work by applying diffs, not by copying
         | blobs. Auto-merging also needs to look at file content as text,
         | you can't auto-merge a binary file with git.
         | 
         | It's an often repeated fact that if you look inside Git, it
         | doesn't work with diffs, it works with blobs. But if you look
         | closer, it's often diffs again!
        
           | arghwhat wrote:
           | With cherry-picks (and thus rebase), you ask git to turn a
           | commit into a patch, so it does just that.
           | 
           | I would mostly consider auto merges (which I guess are bolted
           | on) as the main area where git itself uses diffs during
           | resolution and even then only as a suggested resolution (you
           | get warned and need to confirm it when validating the merge).
           | 
           | So no, it's blobs all the way down. Darcs and Pijul are patch
           | based though.
        
             | cryptonector wrote:
             | Merges, rebases, cherry-picks, are all the same kind of
             | thing. A merge is essentially a rebase that squashes all
             | the commits being picked.
        
             | tux3 wrote:
             | It's true that git is blob based, as opposed to patch
             | based, but it's not the full picture! In practice, git
             | stores a lot more diffs (or rather, deltas) than it stores
             | loose blobs. (And you probably know this already, but I
             | feel it's still worth making explicit)
             | 
             | This is necessary, because when a repo accumulates commits,
             | it becomes a lot more efficient to store most of the
             | objects as deltas instead of separate blobs. If Git didn't
             | do this, it would have a lot of copies, and they would take
             | a lot of space.
             | 
             | So the fundamental model of git is truly based on blobs in
             | theory, but in practice many or most git commands will
             | operate on packfiles, and if you look in your .git object
             | store, most likely you will have a few big packfiles
             | containing most objects, and then a much smaller collection
             | of loose blobs.
             | 
             | All those diffs are what the "resolving deltas" progress
             | indicator that people see when they do a big clone, fetch,
             | or checkout is about =)
        
               | ori_b wrote:
               | > In practice, git stores a lot more diffs (or rather,
               | deltas) than it stores loose blobs.
               | 
               | The diffs it stores are not the diffs you see in git
               | diff.
               | 
               | They're rolling checksum based chunks. The data that the
               | delta is computed against is picked with a heuristic
               | ("sort by name and date, try the top 10, and use the
               | smallest result"). And, in practice, the heuristic diffs
               | the older files against the newer ones, rather than
               | diffing in chronological order, so that getting recent
               | data doesn't involve a lot of delta application.
               | 
               | The git deltification is better thought of as a
               | compression method than as diffing.
        
         | dboreham wrote:
         | Pretty sure OP does understand, and is proposing what you
         | deduced.
         | 
         | Incorporate some semantic understanding of the version
         | controlled data into the VCS. Currently this work is
         | subcontracted to humans.
        
           | mabbo wrote:
           | Maybe I'm misunderstanding. It's just lines like this:
           | 
           | > The text-orientated design of git reflects...
           | 
           | > The current version of git is also able to find differences
           | in binary files.
           | 
           | > if we were storing information as ASTs, rather than lines
           | of text
           | 
           | These all, to me, show a gap in the authors understanding of
           | how git works. And that's okay- git is often easier to use
           | than is to understand.
           | 
           | But if they had a better understanding, they could make their
           | point far better. And without understanding, they won't be
           | able to implement this idea.
        
       | alkonaut wrote:
       | I'd give anything just to get a few basic merge modes. For
       | example "this file can treat two one line additions as
       | unordered".
       | 
       | So any shared append-only file (a change log, an enumeration,...)
       | doesn't automatically conflict.
       | 
       | Syntax aware diffing would be great too, but I'd take something
       | much simpler. For syntax aware stuff I'd love something that
       | could tell semantic changes from noise.
        
       | maweki wrote:
       | Working on the AST is quite an interesting idea, until your
       | comments aren't in the AST and you want to commit a syntax error
       | of work in progress.
       | 
       | Not to mention changing ASTs (while maintaining concrete syntax)
       | in different versions of the language.
        
       | ufo wrote:
       | I'm trying to remember the citation, but I remember seeing a
       | presentation once from someone who studied this and they said
       | that the thing that worked best was a hybrid approach: use
       | structured diff at the top level of the program (modules /
       | methods) but use line-based for statements and expressions.
       | According to them, the structured diff can give unintuitive
       | results if applied at the lowest syntactic levels.
        
       | Karellen wrote:
       | `git` generally doesn't work with lines of text. Mostly it works
       | with opaque file blobs and directory trees.
       | 
       | `git diff` and `git merge` work with lines of text _by default_ -
       | but they don 't have to. You can supply your own `diff` and
       | `merge` tools with the `difftool.*` and `mergetool.*` config
       | options, try them out with `git-difftool` and `git-mergetool`
       | commands, and set the default with the `git.diff` and `git.merge`
       | config options.
       | 
       | If someone wanted to create AST-based diff and merge tools for a
       | given language, they could be plugged right into the existing
       | `git` infrastructure and it would work with them absolutely fine.
        
         | kapep wrote:
         | > If someone wanted to create AST-based diff and merge tools
         | for a given language, they could be plugged right into the
         | existing `git` infrastructure and it would work with them
         | absolutely fine.
         | 
         | There's a lot tooling in the Eclipse modelling ecosystem which
         | could be easily used for this. Storing XML-based models in git
         | is no problem and there's tooling for diffing and merging
         | models via a GUI or programmatically. Combined with the fact
         | that xtext DSLs use EMF models to represent ASTs, it wouldn't
         | be too hard to glue together an AST-based a diff/merge tool for
         | an xtext DSL.
        
         | bspammer wrote:
         | This feature is useful in so many different places. I use it to
         | diff small encrypted files in my repo - just add `gpg -d` as a
         | diff configuration and now I can use git log, diff etc in a
         | meaningful way with binary files.
         | 
         | I've heard of people using it with pdfs as well - a pdf to html
         | converter lets you get a good idea of what changed in the
         | document.
        
         | colonwqbang wrote:
         | Yes, I think this article is coming at it from the wrong end.
         | Git is hardly the problem here, nor is it going to provide the
         | solution.
         | 
         | The problem seems to be that we are lacking the format and the
         | toolchain to manipulate it, and that is not the fault of git.
         | 
         | What is the state of the art in this area? Does somebody know
         | of a viable format and toolchain, or any interesting projects
         | looking to build them?
        
         | tyleo wrote:
         | I believe that semantic merge does something like this:
         | https://www.semanticmerge.com/
        
         | dTal wrote:
         | What if generating a diff is nontrivial? Say you rename an
         | identifier. That might be a single command in an IDE. A
         | sufficiently high-level "diff" format could easily capture that
         | intent. But working backwards from hundreds of touched lines
         | across many files to deduce that single semantic edit is not
         | trivial. Git assumes that arbitrary diffs can be deduced from
         | "before" and "after" files, but this isn't the case - it may be
         | that you'd rather generate the new file from the diff!
        
         | rileymat2 wrote:
         | > `git` generally doesn't work with lines of text. Mostly it
         | works with opaque file blobs and directory trees.
         | 
         | I am not sure this is true.
         | 
         | In the past it gave me problems with line ending normalization
         | between windows/mac/linux, in and out. In those cases it
         | definitely had a lines of text view of things.
        
           | Ajedi32 wrote:
           | It _is_ generally true, but yes; automatic line ending
           | conversion is an exception. You can turn it off with `git
           | config --global core.autocrlf false`, though be aware that
           | can cause issues if you have developers on different
           | operating systems creating and committing files with
           | different line endings.
        
           | ajanuary wrote:
           | Git is delegating to the diff to work out how to merge the
           | blobs. It's the diff that is having trouble with the line
           | endings.
        
             | rileymat2 wrote:
             | No, in the past, git would _change_ line endings.
             | 
             | A check in on one machine and a check out on another
             | machine would give different files.
        
         | indentit wrote:
         | I guess this is what diffsitter[1] is for.
         | 
         | [1]: https://news.ycombinator.com/item?id=27875333
        
         | kmeisthax wrote:
         | Indeed. The Composer merge driver is critical for being able to
         | work with modern PHP frameworks without tearing your hear out
         | on every merge.
         | 
         | Merge drivers are Git's most powerful and least known feature,
         | and I really wish they were more common.
        
       | auscompgeek wrote:
       | Note that you can specify a custom merge driver for different
       | file types using a combination of gitattributes and git-config:
       | https://git-scm.com/book/en/v2/Customizing-Git-Git-Attribute...
        
       | Smaug123 wrote:
       | I'm surprised they didn't mention Unison
       | (https://www.unisonweb.org/), whose big idea is an immutable
       | content-addressable store of ASTs. I really hope it changes
       | everything.
        
         | renox wrote:
         | Except that Unison created its own language which makes pretty
         | sure that they are doomed to fail.. I don't know if there is a
         | technical reason for the new language or if it's NIH syndrome.
        
       | atonalfreerider wrote:
       | Self-promote: Primitive does AST diffing and represents the
       | changes graphically
       | 
       | primitive.io
        
       | jrm4 wrote:
       | What if Programming Languages worked with Lines of Text?
        
       | afavour wrote:
       | I do kind of love the idea of Git using ASTs instead of source
       | code. It makes a ton of sense.
       | 
       | Even just in the immediate term I wish I could make Git(hub)
       | tabs/2 spaces/4 spaces/whatever agnostic. Seems crazy to me that
       | in 2021 we still have to make opinionated choices across orgs
       | about what to use... why can't we pull the code down, view it in
       | whatever setup we want, then commit a normalized version?
       | 
       |  _[whispers] this is actually something tabs allow you to do
       | natively by setting custom tab widths in text editors but I 've
       | given up trying to sell people on tabs at this point and just
       | want to be able to do my own thing_
        
         | Anon_troll wrote:
         | The whitespace and formatting are not significant to the
         | compiler, but they can provide a lot of information to the
         | reader of the code.
         | 
         | You can often see where the writer put the most effort and
         | thought by just seeing how they wrote it. This can help
         | analyzing a codebase considerably.
         | 
         | If everything is normalized, you lose those valuable cues.
        
         | geofft wrote:
         | One of the practical issues here is, if your code fails to
         | compile in CI with an error like
         | /home/ci/src/foo.c:123:45: error: use of undeclared identifier
         | 'a'
         | 
         | or                   /home/ci/src/bar.py:50: syntax error in
         | type comment
         | 
         | or crashes in production with an error like
         | java.lang.NullPointerException             at
         | com.example.Baz.doThings(Baz.java:1337)
         | 
         | you really want to be able to find line 123 column 45, line 50,
         | or line 1337 in your editor, and have that be the _same_ line
         | as what your CI compiled and deployed.
         | 
         | On its own, tabs vs. spaces only affects columns, and you can
         | probably figure things out without columns (although it's a
         | shame to lose it). But different tab sizes affect how long your
         | lines are, and line wrapping is a thing that people care about
         | at least as much as tabs vs. spaces (people with different size
         | monitor or fonts will easily see too-long or too-short lines on
         | their display; if your spaces are equivalent to the tab stop,
         | the distinction is literally invisible). And once you start
         | rewrapping lines, everyone's line numbers are different.
         | 
         | I think it's possible to solve this by using some sort of AST-
         | based index into the file and teaching IDEs to let you seek
         | based on that, but it's suddenly a more complex problem.
        
           | ratww wrote:
           | This is already a very common problem with a solution:
           | transpiled JS already needs source maps to display errors
           | correctly.
        
             | geofft wrote:
             | No, I don't think that's the same problem / the same
             | solution. A source map translates between a layout checked
             | into the code and a format generated at build time. I'm
             | talking about translating between a layout in a developer's
             | local workspace and the layout checked into the code.
             | 
             | Since the developer can choose whatever formatting options
             | they want, there isn't a single source map that can be
             | referenced in the compiled version of the code, so
             | backtraces etc. So the transformation cannot be done at the
             | point the error is displayed (compiler output or backtrace
             | output), it has to be done in the context of the
             | developer's local workspace.
             | 
             | I think source maps could probably be inspiration for
             | solving this problem, but I don't think they would work
             | directly - and even if they did, the real problem here is
             | not designing a solution, it's getting everyone's IDEs to
             | work properly with it. Source maps work largely because the
             | major browsers know how to deal with source maps in JS.
             | You'd have to extend this to all the other ecosystems, at
             | the very least.
        
         | pbiggar wrote:
         | fwiw, this is what we do in Dark [1]. We store (serialized)
         | ASTs, then then we pretty print them in the editor. This
         | converts the AST into tokens that you see on your screen,
         | complete with configurable* indentation, line-length, etc. Code
         | would be displayed according to your config* and the same code
         | displayed differently to a different developer looking at the
         | same code.
         | 
         | [1] https://darklang.com
         | 
         | * I haven't actually enabled users to configure this, but it's
         | just some variables called 'indent' and `lineLength` in the
         | code
        
         | enriquto wrote:
         | _[whispers] don 't give up! There's quite a bunch of us. Our
         | day will come! Long live glorious tabs!_
        
           | silon42 wrote:
           | I'm fine with using tabs, but my tab width will be set to
           | 8... be sure to obey line length limits with that in mind.
        
             | jcelerier wrote:
             | anything beyond 2 is heresy, and some days I'm tempted to
             | go down to 1
        
               | enriquto wrote:
               | Heretic! From the book of Linus [0], chapter one:
               | 
               | > Tabs are 8 characters, and thus indentations are also 8
               | characters. There are heretic movements that try to make
               | indentations 4 (or even 2!) characters deep, and that is
               | akin to trying to define the value of PI to be 3.
               | 
               | [0]
               | https://www.kernel.org/doc/html/latest/process/coding-
               | style....
        
               | klyrs wrote:
               | Cursed April fools update: tabs are now p spaces wide.
        
               | jcelerier wrote:
               | of course PI isn't 3, it's 1 (from a distance)
        
               | a1369209993 wrote:
               | Only if you're a cosmologist.[0]
               | 
               | 0: http://xkcd.com/2205/
        
               | giomasce wrote:
               | Math trivia: there are cases on which it is sensible, in
               | sufficiently advanced mathematics, to define pi as 3 (or
               | whatever other number).
               | 
               | I don't use tabs, but if I'd say that the biggest
               | advantage of using tabs is that everybody can configure
               | their own editor to make them as large as they wish.
        
               | mellavora wrote:
               | Three shall be the number of the counting and the number
               | of the counting shall be three. Four shalt thou not
               | count, neither shalt thou count two, excepting that thou
               | then proceedeth to three. Five is right out.
        
               | [deleted]
        
             | encryptluks2 wrote:
             | There are no line limits. That is what word wrapping is
             | for.
        
         | convolvatron wrote:
         | having presentation by flexible and different than the
         | underlying model is a great idea for code
         | 
         | but admit it, tabs are fragile and a pretty weak implementation
        
           | wutbrodo wrote:
           | > admit it, tabs are fragile and a pretty weak implementation
           | 
           | Could you elaborate? I don't have a personal opinion here and
           | have only worked in orgs that require spaces, but I'm not
           | familiar with the criticisms of tabs.
        
             | convolvatron wrote:
             | it only works for initial indentation, so people that like
             | columnar layouts are kinda screwed. auto-tabbing tools will
             | take n-spaces and turn them into a tab, which screws up
             | stuff.
             | 
             | lets just take the whole idea one step further and either
             | use tools that reformat based on agreed upon styles
             | (meaning a developer could reasonably take the source,
             | preject it into their preferred style and project it back
             | out again).
             | 
             | or store the canonical version as structured data in a
             | database and always project it into some text for viewing.
             | 
             | broader adoption of formatters has drastically reduced the
             | number of pointless and emotional formatting arguments I've
             | gotten into. lets push that further.
        
             | gregmac wrote:
             | For me the problem happens as soon as tabs are used for
             | alignment, instead of just indent. The benefit of tabs is
             | custom tabstop. If anyone does anything that undermines
             | that benefit, you might as well use spaces to avoid all the
             | problems caused.
             | 
             | Consider the following code:                   if (x)
             | {             SomeMethod(paramater1,
             | paramater2,                        parameter3);         }
             | 
             | If done "properly", it is:                   if (x)
             | {         <tab>SomeMethod(paramater1,
             | <tab><spaces...>paramater2,
             | <tab><spaces...>parameter3);         }
             | 
             | What I often see, that totally breaks the entire point of
             | tabs:                   if (x)         {
             | <tab>SomeMethod(paramater1,
             | <tab><tab><tab><space><space>paramater2,
             | <tab><tab><tab><space><space>parameter3);         }
             | 
             | The same thing happens if you are trying to align table-
             | style code:                   var badMixedTypeArrayExample
             | = [             [ "some",          true,        128,  x
             | ],             [ "long strings",  true,          8,
             | someLongVariable  ],             [ "and",           false,
             | 16384,  x                 ],             [ "short",
             | true,   12345678,  anotherVariable   ],         ];
             | 
             | If tabs are used between fields, it will look like a hot
             | mess to anyone with a different tabstop than the author.
        
               | zkldi wrote:
               | if (x)         {             SomeMethod(paramater1,
               | paramater2,                        parameter3);         }
               | 
               | You simply should not write this code. It's unclear, and
               | performs nonsense indentation. You could do:
               | if (x)         {             SomeMethod(
               | paramater1,                 paramater2,
               | parameter3             );         }
               | 
               | If you need your function to use line breaks.
        
               | gregmac wrote:
               | I totally agree; I personally hate this style of code!
               | However, people still write it (in the same way they
               | screw up tabs+spaces), and in some code bases it's "the
               | style" they use.
               | 
               | I've also seen a lot of SQL and LINQ (C#) written in this
               | way, as well as things like:                   var
               | longString = "Line 1\n" +                          "Line
               | 2\n" +                           "Line 3";
        
               | zkldi wrote:
               | Personally, I'd go as far to say that `alignment` is an
               | anti-pattern.
               | 
               | Setting up an automatic formatter and using tabs is
               | personally the best for all worlds. Space-like alignments
               | like                   var someReallyLongVar = 5;
               | var x                 = 10;
               | 
               | Are the worst!
        
               | cool_scatter wrote:
               | Which is the reason for the very common stance "tabs for
               | indentation, spaces for alignment".
        
               | nybble41 wrote:
               | Which is easy to say, but hard to make everyone do
               | correctly. First you need to ensure that everyone uses an
               | editor with a "visible whitespace" option, and turns it
               | on, so they can see whether they have the right
               | whitespace. Then you get to spend precious programming
               | time turning one kind of whitespace into another since
               | most editors will get it wrong when they auto-indent.
               | 
               | Either use spaces everywhere so you have total control
               | over the layout or forego alignment (other than block
               | indentation). Mixing tabs and spaces is a path to
               | madness.
        
               | PaulDavisThe1st wrote:
               | This is part of the reason why editors for programmers
               | and editors for general text editing are not the right
               | thing.
               | 
               | I have F11 in emacs bound to whitespace-cleanup, which
               | takes care of it all for me. And supertabs mode in
               | general works just the way it should with tabs-
               | indent/spaces-align.
               | 
               | Then there's also clang-fmt, possibley used as a post-
               | receive hook in git (and some other VCS) which makes
               | irrelevant what the programmer's editor did, mostly.
        
             | jcranmer wrote:
             | The main supposed advantage of tabs is that everyone can
             | set their own custom preferred tab-width and be done with
             | it, but this advantage doesn't actually play out in
             | practice:
             | 
             | * There's usually a maximum line length restriction as
             | well, so you need to know what the tab-width is to figure
             | out if a line needs to be broken into multiple lines.
             | 
             | * There are also cases where you need exact-column
             | alignment, even across multiple indent widths. A simple
             | example is as follows:                 module whatever {
             | fn long_function_name_whatever(arg1: type1,
             | arg2: type2,                                        arg3:
             | type3,                                        etc:   etc4,
             | do:     you,                                        get:
             | my,                                        point:  now);
             | }
             | 
             | So in practice, tab width for a project is actually fixed
             | to a particular value. And then you discover that wrong-
             | tab-width code becomes quite annoying to read. I hate
             | reading GNU style guide code, which uses 8-space-tabs but
             | indent-width of 4, because the indenting is unreadable
             | unless I mess with the tab spacing for an individual file
             | I'm reading.
        
               | Asraelite wrote:
               | Alignment can just be done with spaces. This can then be
               | enforced by a style checker.
               | 
               | But the maximum line length problem is real. I would be
               | 100% for tabs if it wasn't for this issue and imo it's
               | the only real criticism you can make that doesn't have a
               | good solution.
        
               | smolder wrote:
               | The good solution to the line length problem is to not be
               | strict about them. My line length rule is usually "stay
               | roughly within 100 spaces, 120 is too long." If you are
               | seriously undermined by lines being too long, then your
               | text editor choice/setup might be worth revisiting.
        
               | zerocrates wrote:
               | The alignment in the comment above doesn't work with
               | tabs: your initial line is going to be tab-indented,
               | which means if you want those next lines to align with it
               | you don't have any options for it to work.
               | 
               | Now, I tend to find it's better to just avoid that kind
               | of alignment in your code style completely (just push the
               | first arg to a new line so you're not space-ing
               | everything out a mile to match the function call open
               | paren) but if that's your style then you can't really do
               | it with actually variable tab widths.
        
         | thefreeman wrote:
         | If you append `?w=1` to the diff view URL on a pull request it
         | makes it whitespace agnostic just FYI
        
         | williamdclt wrote:
         | It's not that you're going too far, it's that you're not going
         | far enough!
         | 
         | It's not a Git question, it's a programming language question.
         | There's no reason source code need to be stored as plain
         | text[1]! Editors show it as text, we edit it as text, but why
         | wouldn't it be _stored_ as an AST? Not only does formatting
         | becomes an editor concern, but code could even be edited as a
         | tree, as a graph, as whatever you want
         | 
         | [1] - well, actually there's plenty of reasons: chiefly because
         | plaintext is very interoperable
        
           | jerf wrote:
           | "but why wouldn't it be _stored_ as an AST?"
           | 
           | It profoundly _is_. You can 't store "an AST". You can only
           | store a serialization of it. The official language grammar is
           | a serialization of the AST custom crafted for that language.
           | It is as much an "AST" as any other serialization would be;
           | all such alternative representations would all produce
           | isomorphic memory representations if parsed from a proper
           | library.
           | 
           | At a high level it may sound useful to try to then provide a
           | cross-language AST representation, but it's one of those
           | things that sounds great at a high level but as soon as you
           | actually tried to implement it for, say, Python and C++,
           | you'd rapidly discover that in practice there's not as much
           | opportunity for "generic AST operations" as you may think.
           | 
           | The problem isn't that it isn't "stored as an AST" but that
           | $YOUR_LANGUAGE apparently doesn't have good libraries or
           | mechanisms for getting at it. Go, for instance, ships with
           | the relevant bits of the compiler exposed, and as a result
           | there are tons of tools that operate on Go code as ASTs and
           | not textually, because it's readily available and supported
           | by the core language team. I use this only as an example I
           | know personally, there are other languages that have similar
           | sorts of support as well.
        
             | vlovich123 wrote:
             | I feel like you're picking a strawman here. The AST
             | serialization everyone is implying is one where you don't
             | need to token/lex but can just load it directly &
             | manipulate it (i.e. implying the on-disk version is a valid
             | AST or one who's validity can be trivially validated
             | without needing to have the entire language syntax &
             | grammar). First, that makes the compiler _much_ faster
             | because tokenization /lexing is moved to the "save" phase
             | which happens infrequently at human scale vs the
             | compile/processing phase which happens in an automated
             | fashion where the overhead can be notable. Additionally, if
             | you mmap the AST from disk into memory, you can use finer-
             | grained caching to memoize expensive analysis that happens
             | for faster compiles of code that's only changed slightly
             | (e.g. changing whitespace/comments wouldn't recompile
             | anything).
             | 
             | More importantly for advocates, it avoids needing to ship
             | the deserialization library and makes tooling simpler.
             | That's really why the idea of a simple AST format is so
             | attractive. Typically compiler frontends are typically very
             | tightly coupled to the underlying middle & back end.
             | There's some work in some languages to decouple this (e.g.
             | LSPs & Idea's failable parsing approach), but the efforts
             | are still very immature & it's still not clear to me that
             | it's worth it (see the last paragraph).
             | 
             | The main underlying challenge with making sure the on-disk
             | contents is well-formed according to the syntax rules is
             | that frequently you want to pause work at an intermediate
             | stage. This means you either have to make sure that
             | whatever state the user saves is a valid AST via editor
             | tricks (although I think this also typically means you have
             | to design the language around it), you reject saves, every
             | tooling library has to be capable of parsing malformed
             | ASTs, or you save a dirty transformation to apply to the
             | last known saved version so you can have the user resume
             | editing but otherwise tooling uses the "last known good"
             | version. That's the real challenge with having a serialized
             | version that's amenable to 3p tooling for interop.
             | 
             | Finally, all the "serialize the AST" solutions ignore the
             | problem of wanting to grep the codebase. This means you
             | need to change out several decades of line-oriented
             | manipulation tools in favor of new ones that are AST-based
             | & likely more complicated to write/maintain as compared
             | with one-line regular expressions. At least I've yet to see
             | any AST manipulation libraries that aren't drastically
             | different from existing text manipulation tools if clang-
             | tidy and Rust macros are any indication about what good
             | solutions to the problem look like today.
             | 
             | I think eventually we'll get AST serialization, but I think
             | it will be packaged into an entirely new language (like
             | Rust did with ownership) that also considers the tooling
             | aspect end-to-end rather than as a retrofit into existing
             | languages. Once that's successful, then I think we'll see
             | retrofits because the space will have been better explored
             | & other languages will benefit from the R&D into what a
             | successful path would look like.
        
               | seiferteric wrote:
               | > This means you need to change out several decades of
               | line-oriented manipulation tools in favor of new ones
               | that are AST-based
               | 
               | I wonder if a generic binary->text tool/library could
               | solve this. Grep could check the file mime type then call
               | the tool to convert from the binary format to the text
               | format if available. I could see this being useful for a
               | lot of binary formats.
        
               | res0nat0r wrote:
               | I think everyone may be interested in:
               | https://github.com/afnanenayet/diffsitter
               | 
               | Github having an option to have their PR GUI use an AST
               | diff like this could be a fun and useful option.
        
               | habitue wrote:
               | > that makes the compiler much faster because
               | tokenization/lexing is moved to the "save" phase which
               | happens infrequently at human scale
               | 
               | For dynamic languages like Ruby or Python, storing a pre-
               | parsed representation makes it a little faster. But for
               | compiled languages, lexing and parsing tend to be swamped
               | by the codegen step
        
               | vlovich123 wrote:
               | If you think about it more broadly where you can memoize
               | expensive results of AST -> code gen transformation or
               | AST -> AST simplification, then this will help
               | significantly for codegen, especially for incremental
               | builds but also clean builds if you have your CI cluster
               | sharing build cache information with your local devs.
               | 
               | Also, for a language like Rust, I'm not sure that there
               | isn't a significant amount of time spent validating
               | ownership & doing type inference. These are the kinds of
               | analysis you could save into the AST & thus save a
               | significant amount of build time when talking about large
               | projects. I agree for smaller projects a lot of these
               | optimizations are probably unimportant.
        
             | dotancohen wrote:
             | And this is why I love Python. It forces people to use the
             | same coding standard in some regards, and it forces people
             | to indent properly.
             | 
             | I really don't care anymore what that standard might be
             | (well, ok, I do prefer tabs) but I do care that it be
             | consistent. And I do DEMAND that proper nested indentation
             | be respected. Source code is meant to be human readable.
        
               | cratermoon wrote:
               | Python is not unique nor innovative in that respect. Even
               | FORTRAN and COBOL programs from the early days had very
               | strict rules about indentation and blocks.
               | 
               | The thing is, there's no reason we have to _store_ the
               | code like text. Even the punch card got this right: the
               | program wasn 't stored as text, it was stored as physical
               | holes in paper. A very experience programmer could often
               | look at a card with just the holes and have a rough idea
               | what it encoded, provided they knew if it was EBCDIC or
               | ASCII or whatever, but the computer didn't care. The
               | printed representation across the top line of the card
               | was just that: a representation.
        
               | dotancohen wrote:
               | I didn't claim that Python is unique nor innovative.
               | However, FORTRAN and COBOL are not modern languages in
               | the sense that one can reasonably expect a large
               | selection of first- and third- party libraries for most
               | common situations, and their availability on e.g. servers
               | is far more limited, thus learning them just for
               | scripting is not as good a choice as is Python.
        
               | thaumasiotes wrote:
               | > Even the punch card got this right: the program wasn't
               | stored as text, it was stored as physical holes in paper.
               | 
               | We still do that, storing executable programs as
               | executable binary rather than text. What else could you
               | do?
        
               | dotancohen wrote:
               | In the context of this thread, where we are discussing
               | Git as a medium for storing the form of the programs in a
               | format that is meant to be read and maintained by humans,
               | we do store the programs in text.
        
               | cratermoon wrote:
               | We store them as serialized text. We could store them as
               | nodes in an AST; we could store them as OLE/CFBF
               | structures like older versions of Microsoft Word, or do
               | what Ted Nelson suggested decades ago: T. H. Nelson,
               | "Complex information processing: a file structure for the
               | complex, the changing and the indeterminate," in
               | Proceedings of the 1965 20th national conference, New
               | York, NY, USA, Aug. 1965, pp. 84-100. doi:
               | 10.1145/800197.806036.
        
               | thaumasiotes wrote:
               | We do not store executables as serialized text, unless
               | they are meant to be executed by an interpreter.
        
           | cratermoon wrote:
           | > There's no reason source code need to be stored as plain
           | text
           | 
           | The same can be said for lots of different documents, and
           | it's been true for programs like Word for a long time. See
           | also [1]T. H. Nelson, "Complex information processing: a file
           | structure for the complex, the changing and the
           | indeterminate," in Proceedings of the 1965 20th national
           | conference, New York, NY, USA, Aug. 1965, pp. 84-100. doi:
           | 10.1145/800197.806036.
        
           | fwip wrote:
           | You might be interested in https://unisonweb.org/
        
           | Kinrany wrote:
           | Storing _everything_ in plain text is better. But there 's no
           | reason source code needs to be _edited_ as plain text.
        
             | ssivark wrote:
             | Serializing structured data into ASCII streams makes it
             | very hard to then deserialize and re-structure.
             | 
             | Plain text might be the lowest common denominator for
             | Unix/shell tools, but we can do far far better in how
             | structured data is exchanged, which would make it much
             | easier to programmatically manipulate & process.
        
               | [deleted]
        
         | BiteCode_dev wrote:
         | Yes, but only if it falls back to text diff as soon as there is
         | the smallest doubt it can't provide a good AST diff.
        
         | fstrthnscnd wrote:
         | Tabs do work as long as they aren't fixed width (I don't know
         | what you mean by "custom").
         | 
         | For instance, in many languages, one will sometimes have to
         | split a function call to many lines, and in most languages
         | function names aren't of fixed length, thus in order to get a
         | correct alignment for parameters, the tab width at that point
         | will have to match the function name length.
         | #include<stdio.h>              int main(int argc, char* argv[])
         | {             printf("%s %s %s %s\n",
         | __FILE__,                    __LINE__,
         | __DATE__,                    __TIME__);                  return
         | 0;         }
         | 
         | I agree with your idea of storing a normalized version of the
         | code in the repo: it wouldn't then matter whether that version
         | contains characters to align the code properly, it would just
         | be inserted by the editor/linter as needed. The difficulty is
         | that sometimes linting isn't enough, and some manual formatting
         | is needed. Or perhaps the formatting rules are under specified?
         | 
         | Another issue with AST diffing is when languages allow some
         | form of syntactic sugar as preprocessing: the compiler might
         | just see the simplified tree, not the one with the "sugary"
         | forms. A tool capable of parsing such languages should also be
         | able to handle these extensions.
        
           | Asraelite wrote:
           | > the tab width at that point will have to match the function
           | name length.
           | 
           | This is a non-issue. Use tabs for indentation and spaces for
           | alignment.
        
             | njharman wrote:
             | That is the kind of problem solution that ends up with you
             | now having 2 problems. New problem(s); having tabs and
             | spaces, having to think when to use them, having to
             | train/document everyone in usage, having to debate that
             | usage, having to correct code and chastise people who get
             | usage wrong.
             | 
             | Use a automatic code formatter with minimal options.
             | Automate either running code formatter on commit or denying
             | commits that change when code formatter is run on them.
        
               | Asraelite wrote:
               | Absolutely, I wouldn't dream of doing any kind of fancy
               | alignment by hand, only with an auto-formatter.
               | 
               | If I had to break arguments onto multiple lines without
               | an auto-formatter I would just keep it simple and use
               | another level of indentation instead of aligning them
               | with the function name.
        
             | Latty wrote:
             | Better yet, just never do alignment.
             | 
             | Obviously readability is subjective, but personally I find
             | alignment is never valuable outside of tables of data, and
             | I'd argue generally having tables of data embedded in your
             | code isn't ideal.                   #include<stdio.h>
             | int main(int argc, char* argv[]) {             printf(
             | "%s %s %s %s\n",                 __FILE__,
             | __LINE__,                 __DATE__,
             | __TIME__             );                  return 0;
             | }
             | 
             | Reads better to me and avoids the issue entirely. It also
             | plays more nicely with traditional diff tools anyway.
        
         | thrwyoilarticle wrote:
         | You can also write git hooks to turn their spaces into your
         | tabs & vice versa.
        
           | OJFord wrote:
           | That's not a good solution - every commit with an author
           | (well technically committer I suppose) whose opinion differs
           | to the last will be horrendous.
        
             | thrwyoilarticle wrote:
             | There won't be any difference. OP will run the script when
             | they checkout, work with their tabs, then run the script
             | when they commit. Spaces in, spaces out.
        
               | OJFord wrote:
               | Oh ok, sure. But then that's just a weak version of
               | what's being requested - an entirely neutral more
               | agnostic, _abstract_ format that stores the meaning
               | without any formatting at all.
        
       | vxNsr wrote:
       | Is it just me or is he describing an IDE with source control?
        
       | bialpio wrote:
       | This made me think of Unison: https://www.unisonweb.org/
       | 
       | Discussion: https://news.ycombinator.com/item?id=27652677
        
       | [deleted]
        
       | jpitz wrote:
       | Didn't the VisualAge IDEs do this with their built-in version
       | control? This was 20 years ago, and I seem to remember that the
       | version control was at the method level, not file level.
        
       | ClassAndBurn wrote:
       | Git is designed to require human oversight. This is usually a
       | feature, but in recent years has become a bug with things like
       | GitOps.
       | 
       | It's important to remember that Git is a terrible database
       | because of its lack of semantic structure. All conflicts require
       | a human who does have to context. This is why almost no one
       | builds a system that uses Git as a two way interface. And when
       | they do, its via Github Pull Requests (which go to humans) and
       | not Git itself.
       | 
       | In all, this makes it a wonderful general purpose shared
       | filesystem. And that's about it.
        
       | cies wrote:
       | > Structure editors haven't really taken off yet despite several
       | historical and contemporary attempts.
       | 
       | This is a nice contemporary one:
       | 
       | https://github.com/projectional-haskell/structured-haskell-m...
       | 
       | Lisps also have all kinds of options available in Emacs, but it
       | is more special to see this outside of the land of s-expressions.
        
       | jcrites wrote:
       | I've had loosely similar ideas before. The basic idea is to make
       | the compiler tool chain aware of diffs, and help scrutinize and
       | implement them. Refactoring suggestions could be included with
       | the diff.
       | 
       | For example, say you're dependending on a module and it renamed a
       | class/method/trait/macro/constant/whatever. A synchronous method
       | has become a sync or vice versa.
       | 
       | The diff could include programmatic instructions for consumers to
       | apply to their code bases switching them over to the new method.
       | This could be as simple as semantically changing the name used,
       | or in the case of changing sync to a sync it could add `await` in
       | the appropriate spot.
       | 
       | There's no limit to how complex the rewrite rules could be. You
       | could totally reorganize the parameters to a function and ship
       | that refactoring, or even add a parameter along with the default
       | code necessary to provide it.
       | 
       | Unlike the author I don't think code in Git will likely ever move
       | beyond source, plus the refactoring instructions needed to update
       | a change from a dependency -- perhaps a macro-like syntax.
       | 
       | Too much text manipulation is required from source control for me
       | to conceive of it being anything but human programming text in a
       | future I can imagine. Machines can already parse it; there
       | doesn't seem to be a compelling reason to store some other kind
       | of structure.
       | 
       | Refactoring wouldn't need to be any special Git extension, just a
       | file accompanying the commit with instructions for the language
       | tool chain.
       | 
       | Your IDE or CLI could walk you through interactively everywhere
       | it's getting applied, or you could apply all and review the
       | result in your app that consumes the module.
       | 
       | This would also open up security risks from accepting diffs from
       | dependencies and applying their refractors, but unless modules
       | are sandboxed quite well that's a risk you take with updating
       | dependencies anyway. And you can always scrutinize the
       | refactoring-diff manually after it runs before accepting it.
       | 
       | The industry would probably standardize onto the notion that a
       | change that requires running automated room factoring from a
       | dependency across your codebase is a major version change; in
       | other words a breaking or backwards incompatible change, just one
       | that's much easier to upgrade to.
       | 
       | Languages with macros or other programmatic transformers would be
       | well suited to this concept I think.
       | 
       | Maybe Rust macros could be enhanced for the purpose to pattern
       | match over an existing codebase somehow: not just the macro
       | invocation point, but anywhere, e.g., a given trait or function
       | is used; and then the output of the macro would not feed into the
       | next stage of the compiler step but would instead result in
       | rewriting code on disk to produce a diff that you examine and
       | apply to your code.
       | 
       | A capability like this would make it much easier to manage large
       | aggregate code bases consisting of many dependencies. OSS package
       | maintainers or infrastructure providers at companies could ship
       | nominally backwards -incompatible changes that are still actually
       | compatible when you run the macro transformer that updates the
       | code that uses them.
       | 
       | For a simple example, imagine that `foo()` was previously a
       | function and the implementation chooses to add some optional
       | parameters or those with default values and change it into a
       | macro `foo!()`. The accompanying transformer would semantically
       | identify references to `foo()` and make the necessary updates.
       | You could rename global constants or traits or other code
       | elements this way.
       | 
       | Consider the way in which Google Guava has had to evolve over
       | time. A number of its features have become part of the Java
       | language, and thus the classes deprecated and removed gradually.
       | With a compiler facility like what I am describing, users of
       | Guava could run the transformer to migrate older could bases that
       | use methods like Guava's `Preconditions.checkNotNull(Object,
       | message))` to use Java's now-standard `Objects. requireNonNull (T
       | obj, String message)`. Because the maintainers of Guava wanted to
       | keep it modern, current, avoid redundancy, and designed in the
       | best way that they knew how, they made a number of breaking
       | changes for which the project lead later apologized [1]. Most
       | changes wouldn't have been be painful if accompanied by automated
       | refactoring.
       | 
       | You could allow the transformer to produce code that still needs
       | work from humans to finalize and compile. In that case it could
       | change as much as it's able and leave instructions at each call
       | site.
       | 
       | At my company we have bots that submit proposed code changes to
       | our codebase that need to be taken over as author by a human,
       | reviewed, sometimes lightly edited, and shipped, and they work
       | quite well. One bit finds unused code and submits diffs to remove
       | it if it's been in the code repository long enough. Another
       | detects when launch experiments have been at 100% all on one
       | treatment for a long period of time (meaning the feature has
       | launched) and submits code removing the experiment check. The
       | latter sometimes require removes surrounding code that would
       | subsequently become unused from removing the experiment check.
       | 
       | These have provided meaningful value in helping keep the codebase
       | tidy and I look forward to more automation like this in the
       | future, including diff-aware compilers and refactoring tools.
       | 
       | [1]
       | https://www.reddit.com/r/java/comments/mr03mi/comment/guk848...
        
       ___________________________________________________________________
       (page generated 2021-09-27 23:01 UTC)