[HN Gopher] What if Git worked with programming languages?
___________________________________________________________________
What if Git worked with programming languages?
Author : LukeEF
Score : 138 points
Date : 2021-09-27 13:26 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| iso8859-1 wrote:
| This is on the http://lamdu.org roadmap.
| kazinator wrote:
| The blogger does not understand Git, fundamentally.
|
| Git does does not work with text. It stores snapshots of
| artifacts.
|
| The diffs that you see when you use the various commands like git
| log -p are recovered from the snapshots, when those artifacts
| happen to be text files.
|
| Git absolutely works with texts when you connect it with external
| representations and tooling, such as when you "git format-patch"
| and then "git am" to import that; and the rebasing workflows
| obviously have textual merging with conflict resolution. Still,
| that seems like something that could be externalized. A language-
| specific three-way-diff tool can handle a merge by parsing all
| three pieces and working with ASTs. It's something that could be
| developed later, yet still work with your old commits.
|
| There is this: https://git-scm.com/docs/git-mergetool
|
| No idea how well it works.
| [deleted]
| olodus wrote:
| Ever since I learned about Git merge strategies and wrote a very
| basic one myself, I've been wanting to write one that syntaxticly
| understands a bit of the test framework code we use at work. It
| is super annoying when you copy a test because you want to vary a
| very specific case and gig gets all confused about what code is
| and isn't the same.
|
| (yeah I know I should break out the copied part but who always
| has time for that)
| ghoward wrote:
| I'm actually working on a VCS based on this idea and on tracking
| changes to binary files based on their structure as well. (It
| turns out that the same techniques work for both.)
|
| AMA and please give me feedback!
| jakeinspace wrote:
| I think the only useful way to implement AST-level diff/merge for
| non-trivial codebases would require the compiler to provide the
| parsed AST, since per-file ASTs would lack a lot of context. You
| could also ask the user to provide a separate file or files that
| describe the code topology, but why bother when the compiler can
| spit out an AST itself? A diff tool which targets a few of the
| bigger build systems (CMake, Maven, Gradle) and compilers might
| work, and could worry about small build environments after
| gaining momentum.
| kitplummer wrote:
| I'm just a bit more "generally" curious. Is `git` being the
| _only_ DVCS a good thing? Not to say that `hg` or `darcs` don't
| exist, just that the hub on top of git has pushed us in a
| singular direction.
|
| I would like to see, at least academically, something more.
| tomphoolery wrote:
| The choice of DVCS tooling is ancillary to the success of
| GitHub. People learned Git so they could use GitHub, not the
| other way around. At least, this is the way I remember it.
|
| If someone comes along and builds the best forge software ever,
| but uses Mercurial instead of Git, I'll bet a lot of people
| would switch technologies at some point. Until then, I'd say
| most people use GitHub because GitHub works for them, and they
| use Git because that's how you interact with GitHub. They don't
| care about the ivory tower benefits of their particular DVCS
| tooling, all they care about is easily collaborating with their
| teammates.
|
| It would definitely be great if you could have a GitHub-like
| experience using Mercurial or Darcs, but so far I haven't seen
| anything close to that.
| aayjaychan wrote:
| Does GitLab count as a GitHub-like experience? There is a
| fork of GitLab called Heptapod that supports Mercurial.
|
| https://heptapod.net/
| tomxor wrote:
| > The fact that git works on lines of text [...] we could be
| looking at the alterations to the abstract syntax tree.
|
| Fundamentally git does not operate on text, it operates on files
| (content addressed SCM not a ledger of text diffs); diffs are
| generated upon request between arbitrary merkel trees. So there
| is no need to implicate git in such a tool, it can be
| independent: GIT_EXTERNAL_DIFF
| When the environment variable GIT_EXTERNAL_DIFF is set, the
| program named by it is called to generate diffs,
| and Git does not use its builtin diff machinery.
| For a path that is added, removed, or modified,
| GIT_EXTERNAL_DIFF is called with 7 parameters:
| path old-file old-hex old-mode new-file new-hex new-mode
| tombert wrote:
| I would definitely support a Lisp-centric Git.
|
| Whenever I do Clojure, something that can get difficult when
| working with multiple people is how the
| parentheses/brackets/braces stack up, especially when everyone
| seems to have different opinions on how that works. As a result,
| if you're not careful, when there's a merge conflict you can have
| a ton of extra parentheses, which can be irritating to debug.
|
| Obviously this is at some level an issue inherent to Lisps (and
| to be clear, I love Lisps, and these small headaches are worth
| it), but I think problems like that could be reduced if our
| source controls were aware of the ASTs.
| timgilbert wrote:
| Yeah, I've long thought a diff tool that works on s-exprs
| instead of lines would be invaluable for Lisp programming. It
| doesn't seem like it would be too hard to write, either,
| although getting GitHub etc to use it seems like it would be
| its own challenge...
| CodeIsTheEnd wrote:
| I don't understand why GitHub hasn't solved the issue of diffs
| starting with a '}' (or ')' or 'end'). Just slide the diff over
| while it starts with a closing token! I suppose it's an artifact
| of the diffing algorithm, but aren't there better diffing
| algorithms, even built-in within git?
|
| This is by far the most obvious example of "git doesn't
| understand programming languages", but it also seems like the
| most straightforward to fix.
| nemetroid wrote:
| Git supports a few different diff algorithms. GitHub only seems
| to support the standard Myers algorithm, though:
| https://github.com/isaacs/github/issues/455
| mynegation wrote:
| It is because diff is syntax agnostic. You might be able to get
| away with this hack in some cases but that complicates
| algorithm and will break in some other cases (how about nested
| brackets? Multiple brackets on one line?). Once you want to
| handle this properly you need syntax aware diff algorithm and
| some resources are linked in this discussion.
| lamontcg wrote:
| This has been posted before
| mangecoeur wrote:
| Interesting they mentioned Jupyter Notebooks but not NBDime
| https://github.com/jupyter/nbdime which is a Jupyter plugin
| specifically to address this problem. Without it, diffing
| notebooks is not feasible.
| skybrian wrote:
| If you're interested in this sort of thing you might want to look
| at Dolt (for sharing databases in a git-like way) and Pijul,
| which records diffs explicitly, rather than calculating them on
| the fly.
|
| I wonder if there might be a clever way to encode source code in
| a Dolt database? Maybe each function should be a record?
| LukeEF wrote:
| Author is CTO of TerminusDB (https://terminusdb.com/), which is
| a more graphy version of Dolt! Check it out.
| hardwaregeek wrote:
| I've wanted this for a while, but I will say there's some
| caveats. Sometimes I want to commit just as a "it's the end of
| the day, I want to leave, here's a code dump". I suppose you
| could have multiple tiers of code saving.
|
| I've also wondered about whether you could do code analysis with
| time as a dimension. If you can analyze the evolution of the code
| and pull old implementations, what can you do? Autocomplete is a
| good example, as it can pull previous patterns you've used. Maybe
| some way to tell the programmer "hold up, you've made this
| mistake before, don't do it again"? I'm not sure.
| inetknght wrote:
| > I want to commit just as a "it's the end of the day, I want
| to leave, here's a code dump"
|
| 1. git checkout -b eod-$(date -Id)
|
| 2. commit
|
| 3. leave
|
| 4. return
|
| 5. git checkout -
|
| 6. git merge - --no-commit
| hardwaregeek wrote:
| Yes, but if we're talking about some hypothetical tool that
| requires a valid AST, there might be a situation where I
| don't have a valid AST and want to save the code. Similarly,
| I had a job where we used pre-commit hooks that ran a linting
| script. I had to override the hook to commit which was
| slightly annoying at times.
| aardvark179 wrote:
| I've done quite a lot of work on version management on structured
| data (in my case this was for a version managed GIS database) and
| it's not an easy problem, and is likely even harder with
| something like an AST that is generated from a text file and so
| does not preserve the identity of nodes. I'm not saying that it's
| impossible, but it is more work and requires more tooling around
| it than people think, and it keeps coming up here and other
| places as a, "really good idea."
| cormacrelf wrote:
| Counterpoint: a quick google reveals diffsitter:
| https://github.com/afnanenayet/diffsitter
|
| The output could be a lot more compact, it could do better at
| adding context (in the same way https://github.com/romgrk/nvim-
| treesitter-context does, etc), but if you're interested in this
| it's really within reach, go help out.
|
| I wonder if you can use it for automerge yet.
| Jensson wrote:
| I don't see how this could ever work on evolving languages,
| different GIT versions would produce different commits and read
| commits differently based on the latest C++ standard. This would
| potentially lead to version control bugs where different GIT
| versions creates different results from the same commit, that is
| horrible, version control needs to be 100% bug free in that
| regard.
|
| The only reasonable application would be to use a language AST
| parser to better identify relevant text diffs, but the commits
| still needs to be stored as text.
| dboreham wrote:
| This doesn't really make sense, because in order to have those
| code changes compile correctly, there must be a corresponding
| commit to CI config that changes the complier version or
| compiler switches for the new language version. The "semantic-
| diff-er" can also be driven by that commit such that it uses
| the correct language version.
| verdverm wrote:
| It's non trivial to support multiple versions of the same
| language on a host system. You have to account for dev
| machines and workflows as well.
|
| Docker can help with this, but often devs don't want to run a
| container to build their code. It's a hard habit to change.
|
| Now, consider how difficult it would be to get the differ to
| understand where to find compiler versions.
| shepherdjerred wrote:
| Commits could be stored as is, the difference would be that
| diffs are clearer when presented to a human.
| pkghost wrote:
| How is this different from any other problem that is already
| solved by version pegging?
| ozim wrote:
| I feel this is just an example of "worse is better" and whole
| proposition as interesting but totally not practical and I would
| not like for GIT to go anywhere near that idea.
| raxxorrax wrote:
| Theoretically it might work, but I don't think I am too fond of
| the idea. I used git pull to completely waste my source and it
| would have been nice for git to have more intelligence here, but
| in the end I think some of its success lies in its simplicity.
|
| SVN isn't too bad and not too much of a difference to git if you
| use a central repository anyway. The main neat thing was to just
| have one hidden folder, not in every subdirectory.
|
| Git would also need the ability to transform from AST to source
| for every language. A bit unrealistic and there is no benefit to
| it. Could also do that with Assemblies and some meta info for the
| decompiler.
| [deleted]
| gumby wrote:
| Shared (concurrent) code editors might work better if their
| CRDT/OT model worked at that level.
|
| Not that I really want to edit code in a shared environment
| (editing documents that way is bad enough), but just musing...
| nerdponx wrote:
| Storing AST instead of source code is one of the goals of the
| very interesting Unison programming language:
| https://www.unisonweb.org/
|
| Part of what's nice about Git (and plain text in general) is that
| it's the lowest common denominator for a lot of things. This is
| why traditional Unix tools are built oriented around streams of
| bytes. Text is a low level carrier protocol; you can encode
| almost anything in it, but you need to agree on some kind of
| format.
|
| The good part is that you can use very very generic tools on
| almost arbitrary pieces of data. The bad part is that you might
| have to do a lot of parsing and re-parsing of the same data, and
| you have to contend with the dangers of underspecified formats.
|
| Git follows the Unix tradition in this regard. As a result, it is
| nearly universal in what it can store. You can use it to store
| pretty much anything, but you are now at the lowest common
| denominator of support for any particular data format.
|
| Git-for-ASTs will no longer have this universality property, but
| will gain a lot more power in the covered domain. This is a
| design tradeoff.
|
| One thing that's nice about Git is that you can specify arbitrary
| diff drivers with the "attributes" system. So even if the Git
| database is storing plain text, your diff driver can parse your
| source code into ASTs and present AST diffs to you when you run
| `git diff`. Perhaps more impressive, you can configure custom
| _merge_ drivers, so you can (theoretically) implement semantic
| merging of ASTs right inside Git.
|
| There are probably some fundamental limitations of this system,
| because the underlying data is still stored as blobs of bytes.
| But you can get pretty far as long as you don't mind parsing and
| re-parsing the same text over and over.
| ssivark wrote:
| Has this approach been tried? (Unison or otherwise...)
| mumblemumble wrote:
| I would maybe be interested in Git allowing you to plug in your
| own diff generators for different file types.
|
| But I would not want Git itself trying to understand the contents
| of files. That seems to me to be an idea that lives on a
| misconception of the "things programmers believe about names"
| variety. Not every file in source control is source code. Not
| every programming language's grammar maps to an abstract syntax
| tree. In some files, such as makefiles, the difference between
| tabs and spaces is semantically significant. Some languages (such
| as Fortran and Racket) have variable syntax. And so on and so
| forth.
|
| So I think that we _really_ don 't want the source control system
| itself trying to get too smart about the contents of files. That
| will inevitably make the source control system less compatible
| with the various kinds of things you might want to put into
| source control. And it will also make the source control system a
| lot more complicated than it would otherwise be, in return for a
| largely theoretical payoff.
|
| But if we want to delegate the work of generating diffs off to
| other people, so that Git can allow for syntax or semantics-aware
| diffing without having to personally wade into that quagmire (and
| perhaps also allowing language communities to support multiple
| source control systems, a bit like how it works with LSP), that
| might be an interesting thing to experiment with.
| ffwacom wrote:
| > Not every programming language's grammar maps to an abstract
| syntax tree
|
| Are there some examples of this?
| mumblemumble wrote:
| Forth. You could certainly define a formal grammar for it and
| construct an AST, but it would be trivial and not very
| useful.
| simcop2387 wrote:
| Perl, this is because you can't actually properly parse Perl
| without also running Perl code at the same time.
| justinator wrote:
| https://metacpan.org/pod/PPI
| madmax96 wrote:
| I disagree. Many engineers want to refactor across a sequence
| of small PRs, for example. Small PRs are a good thing, because
| they're easier to understand. But today, Git makes this
| painful. Also, understanding how the meaning of code changes
| over time can help reduce bugs.
|
| The solution will have to be pluggable. But I think it is
| possible, and there are sane things to do (e.g. fall back to
| vanilla git) when there are missing plugs.
| saurik wrote:
| > I would maybe be interested in Git allowing you to plug in
| your own diff generators for different file types.
|
| This is already supported.
| lux wrote:
| A common example is UnityYAMLMerge for merging the Unity game
| engine's generated files.
|
| https://docs.unity3d.com/Manual/SmartMerge.html
|
| Configuring it to work with Git and others is a little ways
| down the page, but would apply the same for other diff tools.
| [deleted]
| franga2000 wrote:
| I looked this up and for anyone wondering, it's called
| "diff/merge drivers", but there are only a handful of them
| out there. Some highlights from a few minutes of searching:
|
| - MS Office: https://github.com/lcnittl/DMFO - SQLite:
| https://github.com/cannadayr/git-sqlite - Jupyter notebooks:
| https://nbdime.readthedocs.io/en/latest/vcs.html#git-
| integra...
|
| One big caveat of this is that since git doesn't really store
| just a stack of diffs, despite the fact it presents itself as
| such to the user, a custom merge driver will not make your
| .git grow any less than it would normally.
| est31 wrote:
| > One big caveat of this is that since git doesn't really
| store just a stack of diffs, despite the fact it presents
| itself as such to the user, a custom merge driver will not
| make your .git grow any less than it would normally.
|
| Note that git does support using deltas for storage. But
| according to docs, custom diff drivers aren't used for
| those, instead it's a instruction based format.
|
| https://git-scm.com/docs/pack-
| format#_deltified_representati...
| mabbo wrote:
| Reading this article, I feel as though the author doesn't deeply
| understand git.
|
| git works on blobs of data, not files, and not lines of text. It
| doesn't just happen to also work on binary files- that's all it
| works on.
|
| Now, if the author is suggesting that git-diff ought to have a
| language specific mode that parses changed files as ASTs to
| compare, now I'm interested. Let's do that. I'll help!
|
| But git does not need to change how it works for that to happen.
| Git does not even need git-diff to exist to serve it's main
| purpose.
| mbauman wrote:
| You can already choose different `diff` programs to use for
| particular filetypes. E.g., nbdime for Jupiter notebooks:
|
| https://nbdime.readthedocs.io/en/latest/vcs.html#git-integra...
| hardwaregeek wrote:
| I dunno I feel like you're focusing on a detail that's not
| particularly relevant. The author's main thrust is precisely
| what you described about parsing changed files as ASTs.
| nemetroid wrote:
| It isn't relevant to the author's vision of content-aware
| diffing, but it _is_ relevant to the author 's complaints
| about how Git's (alleged) text-based-ness makes Git awkward
| to use with Jupyter notebooks. Has the author tried searching
| the web for "git diff jupyter"?
| munificent wrote:
| The author is likely using "git" to mean "the entire typical
| git user experience that git users spend time looking at".
|
| And, from that perspective, Git-the-UX definitely does work on
| line-oriented files.
| screye wrote:
| The git extension on VSCode is already pretty good at doing
| diffs on jupyter notebooks.
|
| I distinctly remember this not being a core feature of stock
| git and needing Jupytext to enable version control on
| notebooks. So, I feel like this sort of language specific stuff
| is already happening, but not in any unified product.
| munk-a wrote:
| There's also a historical angle here that's important to
| inspect - Git was designed to specifically be content agnostic.
| There are some predecessors in the SCM space (like VSS) that
| are specifically language aware and allow the checking out of
| line ranges (pinning them so that no one else will make a
| conflicting change specifically) and even entire functions -
| these systems can cause a lot of grief while failing to protect
| the logic they're specifically trying to protect. As the warts
| on SVN got more and more visible I think the general assumption
| was that the replacement SCM would come out of this code aware
| space - but it didn't and in retrospect we all dodged a huge
| bullet when that happened.
|
| I absolutely adore tooling around git that makes diffs more
| visible - one thing I absolutely gush over is anything that can
| detect and highlight function reordering... however, the core
| process of merging and rebasing and all that jazz - I don't
| think we're going to find anything automated that I'll ever
| trust when I'm not working on a ridiculously clean codebase -
| minor changes can have echo effects and when two people are
| coding in the same general area they need to be aware of what
| the other person is trying to do.
| tux3 wrote:
| Note that git does work with diffs a lot.
|
| Rebases and cherry-picks work by applying diffs, not by copying
| blobs. Auto-merging also needs to look at file content as text,
| you can't auto-merge a binary file with git.
|
| It's an often repeated fact that if you look inside Git, it
| doesn't work with diffs, it works with blobs. But if you look
| closer, it's often diffs again!
| arghwhat wrote:
| With cherry-picks (and thus rebase), you ask git to turn a
| commit into a patch, so it does just that.
|
| I would mostly consider auto merges (which I guess are bolted
| on) as the main area where git itself uses diffs during
| resolution and even then only as a suggested resolution (you
| get warned and need to confirm it when validating the merge).
|
| So no, it's blobs all the way down. Darcs and Pijul are patch
| based though.
| cryptonector wrote:
| Merges, rebases, cherry-picks, are all the same kind of
| thing. A merge is essentially a rebase that squashes all
| the commits being picked.
| tux3 wrote:
| It's true that git is blob based, as opposed to patch
| based, but it's not the full picture! In practice, git
| stores a lot more diffs (or rather, deltas) than it stores
| loose blobs. (And you probably know this already, but I
| feel it's still worth making explicit)
|
| This is necessary, because when a repo accumulates commits,
| it becomes a lot more efficient to store most of the
| objects as deltas instead of separate blobs. If Git didn't
| do this, it would have a lot of copies, and they would take
| a lot of space.
|
| So the fundamental model of git is truly based on blobs in
| theory, but in practice many or most git commands will
| operate on packfiles, and if you look in your .git object
| store, most likely you will have a few big packfiles
| containing most objects, and then a much smaller collection
| of loose blobs.
|
| All those diffs are what the "resolving deltas" progress
| indicator that people see when they do a big clone, fetch,
| or checkout is about =)
| ori_b wrote:
| > In practice, git stores a lot more diffs (or rather,
| deltas) than it stores loose blobs.
|
| The diffs it stores are not the diffs you see in git
| diff.
|
| They're rolling checksum based chunks. The data that the
| delta is computed against is picked with a heuristic
| ("sort by name and date, try the top 10, and use the
| smallest result"). And, in practice, the heuristic diffs
| the older files against the newer ones, rather than
| diffing in chronological order, so that getting recent
| data doesn't involve a lot of delta application.
|
| The git deltification is better thought of as a
| compression method than as diffing.
| dboreham wrote:
| Pretty sure OP does understand, and is proposing what you
| deduced.
|
| Incorporate some semantic understanding of the version
| controlled data into the VCS. Currently this work is
| subcontracted to humans.
| mabbo wrote:
| Maybe I'm misunderstanding. It's just lines like this:
|
| > The text-orientated design of git reflects...
|
| > The current version of git is also able to find differences
| in binary files.
|
| > if we were storing information as ASTs, rather than lines
| of text
|
| These all, to me, show a gap in the authors understanding of
| how git works. And that's okay- git is often easier to use
| than is to understand.
|
| But if they had a better understanding, they could make their
| point far better. And without understanding, they won't be
| able to implement this idea.
| alkonaut wrote:
| I'd give anything just to get a few basic merge modes. For
| example "this file can treat two one line additions as
| unordered".
|
| So any shared append-only file (a change log, an enumeration,...)
| doesn't automatically conflict.
|
| Syntax aware diffing would be great too, but I'd take something
| much simpler. For syntax aware stuff I'd love something that
| could tell semantic changes from noise.
| maweki wrote:
| Working on the AST is quite an interesting idea, until your
| comments aren't in the AST and you want to commit a syntax error
| of work in progress.
|
| Not to mention changing ASTs (while maintaining concrete syntax)
| in different versions of the language.
| ufo wrote:
| I'm trying to remember the citation, but I remember seeing a
| presentation once from someone who studied this and they said
| that the thing that worked best was a hybrid approach: use
| structured diff at the top level of the program (modules /
| methods) but use line-based for statements and expressions.
| According to them, the structured diff can give unintuitive
| results if applied at the lowest syntactic levels.
| Karellen wrote:
| `git` generally doesn't work with lines of text. Mostly it works
| with opaque file blobs and directory trees.
|
| `git diff` and `git merge` work with lines of text _by default_ -
| but they don 't have to. You can supply your own `diff` and
| `merge` tools with the `difftool.*` and `mergetool.*` config
| options, try them out with `git-difftool` and `git-mergetool`
| commands, and set the default with the `git.diff` and `git.merge`
| config options.
|
| If someone wanted to create AST-based diff and merge tools for a
| given language, they could be plugged right into the existing
| `git` infrastructure and it would work with them absolutely fine.
| kapep wrote:
| > If someone wanted to create AST-based diff and merge tools
| for a given language, they could be plugged right into the
| existing `git` infrastructure and it would work with them
| absolutely fine.
|
| There's a lot tooling in the Eclipse modelling ecosystem which
| could be easily used for this. Storing XML-based models in git
| is no problem and there's tooling for diffing and merging
| models via a GUI or programmatically. Combined with the fact
| that xtext DSLs use EMF models to represent ASTs, it wouldn't
| be too hard to glue together an AST-based a diff/merge tool for
| an xtext DSL.
| bspammer wrote:
| This feature is useful in so many different places. I use it to
| diff small encrypted files in my repo - just add `gpg -d` as a
| diff configuration and now I can use git log, diff etc in a
| meaningful way with binary files.
|
| I've heard of people using it with pdfs as well - a pdf to html
| converter lets you get a good idea of what changed in the
| document.
| colonwqbang wrote:
| Yes, I think this article is coming at it from the wrong end.
| Git is hardly the problem here, nor is it going to provide the
| solution.
|
| The problem seems to be that we are lacking the format and the
| toolchain to manipulate it, and that is not the fault of git.
|
| What is the state of the art in this area? Does somebody know
| of a viable format and toolchain, or any interesting projects
| looking to build them?
| tyleo wrote:
| I believe that semantic merge does something like this:
| https://www.semanticmerge.com/
| dTal wrote:
| What if generating a diff is nontrivial? Say you rename an
| identifier. That might be a single command in an IDE. A
| sufficiently high-level "diff" format could easily capture that
| intent. But working backwards from hundreds of touched lines
| across many files to deduce that single semantic edit is not
| trivial. Git assumes that arbitrary diffs can be deduced from
| "before" and "after" files, but this isn't the case - it may be
| that you'd rather generate the new file from the diff!
| rileymat2 wrote:
| > `git` generally doesn't work with lines of text. Mostly it
| works with opaque file blobs and directory trees.
|
| I am not sure this is true.
|
| In the past it gave me problems with line ending normalization
| between windows/mac/linux, in and out. In those cases it
| definitely had a lines of text view of things.
| Ajedi32 wrote:
| It _is_ generally true, but yes; automatic line ending
| conversion is an exception. You can turn it off with `git
| config --global core.autocrlf false`, though be aware that
| can cause issues if you have developers on different
| operating systems creating and committing files with
| different line endings.
| ajanuary wrote:
| Git is delegating to the diff to work out how to merge the
| blobs. It's the diff that is having trouble with the line
| endings.
| rileymat2 wrote:
| No, in the past, git would _change_ line endings.
|
| A check in on one machine and a check out on another
| machine would give different files.
| indentit wrote:
| I guess this is what diffsitter[1] is for.
|
| [1]: https://news.ycombinator.com/item?id=27875333
| kmeisthax wrote:
| Indeed. The Composer merge driver is critical for being able to
| work with modern PHP frameworks without tearing your hear out
| on every merge.
|
| Merge drivers are Git's most powerful and least known feature,
| and I really wish they were more common.
| auscompgeek wrote:
| Note that you can specify a custom merge driver for different
| file types using a combination of gitattributes and git-config:
| https://git-scm.com/book/en/v2/Customizing-Git-Git-Attribute...
| Smaug123 wrote:
| I'm surprised they didn't mention Unison
| (https://www.unisonweb.org/), whose big idea is an immutable
| content-addressable store of ASTs. I really hope it changes
| everything.
| renox wrote:
| Except that Unison created its own language which makes pretty
| sure that they are doomed to fail.. I don't know if there is a
| technical reason for the new language or if it's NIH syndrome.
| atonalfreerider wrote:
| Self-promote: Primitive does AST diffing and represents the
| changes graphically
|
| primitive.io
| jrm4 wrote:
| What if Programming Languages worked with Lines of Text?
| afavour wrote:
| I do kind of love the idea of Git using ASTs instead of source
| code. It makes a ton of sense.
|
| Even just in the immediate term I wish I could make Git(hub)
| tabs/2 spaces/4 spaces/whatever agnostic. Seems crazy to me that
| in 2021 we still have to make opinionated choices across orgs
| about what to use... why can't we pull the code down, view it in
| whatever setup we want, then commit a normalized version?
|
| _[whispers] this is actually something tabs allow you to do
| natively by setting custom tab widths in text editors but I 've
| given up trying to sell people on tabs at this point and just
| want to be able to do my own thing_
| Anon_troll wrote:
| The whitespace and formatting are not significant to the
| compiler, but they can provide a lot of information to the
| reader of the code.
|
| You can often see where the writer put the most effort and
| thought by just seeing how they wrote it. This can help
| analyzing a codebase considerably.
|
| If everything is normalized, you lose those valuable cues.
| geofft wrote:
| One of the practical issues here is, if your code fails to
| compile in CI with an error like
| /home/ci/src/foo.c:123:45: error: use of undeclared identifier
| 'a'
|
| or /home/ci/src/bar.py:50: syntax error in
| type comment
|
| or crashes in production with an error like
| java.lang.NullPointerException at
| com.example.Baz.doThings(Baz.java:1337)
|
| you really want to be able to find line 123 column 45, line 50,
| or line 1337 in your editor, and have that be the _same_ line
| as what your CI compiled and deployed.
|
| On its own, tabs vs. spaces only affects columns, and you can
| probably figure things out without columns (although it's a
| shame to lose it). But different tab sizes affect how long your
| lines are, and line wrapping is a thing that people care about
| at least as much as tabs vs. spaces (people with different size
| monitor or fonts will easily see too-long or too-short lines on
| their display; if your spaces are equivalent to the tab stop,
| the distinction is literally invisible). And once you start
| rewrapping lines, everyone's line numbers are different.
|
| I think it's possible to solve this by using some sort of AST-
| based index into the file and teaching IDEs to let you seek
| based on that, but it's suddenly a more complex problem.
| ratww wrote:
| This is already a very common problem with a solution:
| transpiled JS already needs source maps to display errors
| correctly.
| geofft wrote:
| No, I don't think that's the same problem / the same
| solution. A source map translates between a layout checked
| into the code and a format generated at build time. I'm
| talking about translating between a layout in a developer's
| local workspace and the layout checked into the code.
|
| Since the developer can choose whatever formatting options
| they want, there isn't a single source map that can be
| referenced in the compiled version of the code, so
| backtraces etc. So the transformation cannot be done at the
| point the error is displayed (compiler output or backtrace
| output), it has to be done in the context of the
| developer's local workspace.
|
| I think source maps could probably be inspiration for
| solving this problem, but I don't think they would work
| directly - and even if they did, the real problem here is
| not designing a solution, it's getting everyone's IDEs to
| work properly with it. Source maps work largely because the
| major browsers know how to deal with source maps in JS.
| You'd have to extend this to all the other ecosystems, at
| the very least.
| pbiggar wrote:
| fwiw, this is what we do in Dark [1]. We store (serialized)
| ASTs, then then we pretty print them in the editor. This
| converts the AST into tokens that you see on your screen,
| complete with configurable* indentation, line-length, etc. Code
| would be displayed according to your config* and the same code
| displayed differently to a different developer looking at the
| same code.
|
| [1] https://darklang.com
|
| * I haven't actually enabled users to configure this, but it's
| just some variables called 'indent' and `lineLength` in the
| code
| enriquto wrote:
| _[whispers] don 't give up! There's quite a bunch of us. Our
| day will come! Long live glorious tabs!_
| silon42 wrote:
| I'm fine with using tabs, but my tab width will be set to
| 8... be sure to obey line length limits with that in mind.
| jcelerier wrote:
| anything beyond 2 is heresy, and some days I'm tempted to
| go down to 1
| enriquto wrote:
| Heretic! From the book of Linus [0], chapter one:
|
| > Tabs are 8 characters, and thus indentations are also 8
| characters. There are heretic movements that try to make
| indentations 4 (or even 2!) characters deep, and that is
| akin to trying to define the value of PI to be 3.
|
| [0]
| https://www.kernel.org/doc/html/latest/process/coding-
| style....
| klyrs wrote:
| Cursed April fools update: tabs are now p spaces wide.
| jcelerier wrote:
| of course PI isn't 3, it's 1 (from a distance)
| a1369209993 wrote:
| Only if you're a cosmologist.[0]
|
| 0: http://xkcd.com/2205/
| giomasce wrote:
| Math trivia: there are cases on which it is sensible, in
| sufficiently advanced mathematics, to define pi as 3 (or
| whatever other number).
|
| I don't use tabs, but if I'd say that the biggest
| advantage of using tabs is that everybody can configure
| their own editor to make them as large as they wish.
| mellavora wrote:
| Three shall be the number of the counting and the number
| of the counting shall be three. Four shalt thou not
| count, neither shalt thou count two, excepting that thou
| then proceedeth to three. Five is right out.
| [deleted]
| encryptluks2 wrote:
| There are no line limits. That is what word wrapping is
| for.
| convolvatron wrote:
| having presentation by flexible and different than the
| underlying model is a great idea for code
|
| but admit it, tabs are fragile and a pretty weak implementation
| wutbrodo wrote:
| > admit it, tabs are fragile and a pretty weak implementation
|
| Could you elaborate? I don't have a personal opinion here and
| have only worked in orgs that require spaces, but I'm not
| familiar with the criticisms of tabs.
| convolvatron wrote:
| it only works for initial indentation, so people that like
| columnar layouts are kinda screwed. auto-tabbing tools will
| take n-spaces and turn them into a tab, which screws up
| stuff.
|
| lets just take the whole idea one step further and either
| use tools that reformat based on agreed upon styles
| (meaning a developer could reasonably take the source,
| preject it into their preferred style and project it back
| out again).
|
| or store the canonical version as structured data in a
| database and always project it into some text for viewing.
|
| broader adoption of formatters has drastically reduced the
| number of pointless and emotional formatting arguments I've
| gotten into. lets push that further.
| gregmac wrote:
| For me the problem happens as soon as tabs are used for
| alignment, instead of just indent. The benefit of tabs is
| custom tabstop. If anyone does anything that undermines
| that benefit, you might as well use spaces to avoid all the
| problems caused.
|
| Consider the following code: if (x)
| { SomeMethod(paramater1,
| paramater2, parameter3); }
|
| If done "properly", it is: if (x)
| { <tab>SomeMethod(paramater1,
| <tab><spaces...>paramater2,
| <tab><spaces...>parameter3); }
|
| What I often see, that totally breaks the entire point of
| tabs: if (x) {
| <tab>SomeMethod(paramater1,
| <tab><tab><tab><space><space>paramater2,
| <tab><tab><tab><space><space>parameter3); }
|
| The same thing happens if you are trying to align table-
| style code: var badMixedTypeArrayExample
| = [ [ "some", true, 128, x
| ], [ "long strings", true, 8,
| someLongVariable ], [ "and", false,
| 16384, x ], [ "short",
| true, 12345678, anotherVariable ], ];
|
| If tabs are used between fields, it will look like a hot
| mess to anyone with a different tabstop than the author.
| zkldi wrote:
| if (x) { SomeMethod(paramater1,
| paramater2, parameter3); }
|
| You simply should not write this code. It's unclear, and
| performs nonsense indentation. You could do:
| if (x) { SomeMethod(
| paramater1, paramater2,
| parameter3 ); }
|
| If you need your function to use line breaks.
| gregmac wrote:
| I totally agree; I personally hate this style of code!
| However, people still write it (in the same way they
| screw up tabs+spaces), and in some code bases it's "the
| style" they use.
|
| I've also seen a lot of SQL and LINQ (C#) written in this
| way, as well as things like: var
| longString = "Line 1\n" + "Line
| 2\n" + "Line 3";
| zkldi wrote:
| Personally, I'd go as far to say that `alignment` is an
| anti-pattern.
|
| Setting up an automatic formatter and using tabs is
| personally the best for all worlds. Space-like alignments
| like var someReallyLongVar = 5;
| var x = 10;
|
| Are the worst!
| cool_scatter wrote:
| Which is the reason for the very common stance "tabs for
| indentation, spaces for alignment".
| nybble41 wrote:
| Which is easy to say, but hard to make everyone do
| correctly. First you need to ensure that everyone uses an
| editor with a "visible whitespace" option, and turns it
| on, so they can see whether they have the right
| whitespace. Then you get to spend precious programming
| time turning one kind of whitespace into another since
| most editors will get it wrong when they auto-indent.
|
| Either use spaces everywhere so you have total control
| over the layout or forego alignment (other than block
| indentation). Mixing tabs and spaces is a path to
| madness.
| PaulDavisThe1st wrote:
| This is part of the reason why editors for programmers
| and editors for general text editing are not the right
| thing.
|
| I have F11 in emacs bound to whitespace-cleanup, which
| takes care of it all for me. And supertabs mode in
| general works just the way it should with tabs-
| indent/spaces-align.
|
| Then there's also clang-fmt, possibley used as a post-
| receive hook in git (and some other VCS) which makes
| irrelevant what the programmer's editor did, mostly.
| jcranmer wrote:
| The main supposed advantage of tabs is that everyone can
| set their own custom preferred tab-width and be done with
| it, but this advantage doesn't actually play out in
| practice:
|
| * There's usually a maximum line length restriction as
| well, so you need to know what the tab-width is to figure
| out if a line needs to be broken into multiple lines.
|
| * There are also cases where you need exact-column
| alignment, even across multiple indent widths. A simple
| example is as follows: module whatever {
| fn long_function_name_whatever(arg1: type1,
| arg2: type2, arg3:
| type3, etc: etc4,
| do: you, get:
| my, point: now);
| }
|
| So in practice, tab width for a project is actually fixed
| to a particular value. And then you discover that wrong-
| tab-width code becomes quite annoying to read. I hate
| reading GNU style guide code, which uses 8-space-tabs but
| indent-width of 4, because the indenting is unreadable
| unless I mess with the tab spacing for an individual file
| I'm reading.
| Asraelite wrote:
| Alignment can just be done with spaces. This can then be
| enforced by a style checker.
|
| But the maximum line length problem is real. I would be
| 100% for tabs if it wasn't for this issue and imo it's
| the only real criticism you can make that doesn't have a
| good solution.
| smolder wrote:
| The good solution to the line length problem is to not be
| strict about them. My line length rule is usually "stay
| roughly within 100 spaces, 120 is too long." If you are
| seriously undermined by lines being too long, then your
| text editor choice/setup might be worth revisiting.
| zerocrates wrote:
| The alignment in the comment above doesn't work with
| tabs: your initial line is going to be tab-indented,
| which means if you want those next lines to align with it
| you don't have any options for it to work.
|
| Now, I tend to find it's better to just avoid that kind
| of alignment in your code style completely (just push the
| first arg to a new line so you're not space-ing
| everything out a mile to match the function call open
| paren) but if that's your style then you can't really do
| it with actually variable tab widths.
| thefreeman wrote:
| If you append `?w=1` to the diff view URL on a pull request it
| makes it whitespace agnostic just FYI
| williamdclt wrote:
| It's not that you're going too far, it's that you're not going
| far enough!
|
| It's not a Git question, it's a programming language question.
| There's no reason source code need to be stored as plain
| text[1]! Editors show it as text, we edit it as text, but why
| wouldn't it be _stored_ as an AST? Not only does formatting
| becomes an editor concern, but code could even be edited as a
| tree, as a graph, as whatever you want
|
| [1] - well, actually there's plenty of reasons: chiefly because
| plaintext is very interoperable
| jerf wrote:
| "but why wouldn't it be _stored_ as an AST?"
|
| It profoundly _is_. You can 't store "an AST". You can only
| store a serialization of it. The official language grammar is
| a serialization of the AST custom crafted for that language.
| It is as much an "AST" as any other serialization would be;
| all such alternative representations would all produce
| isomorphic memory representations if parsed from a proper
| library.
|
| At a high level it may sound useful to try to then provide a
| cross-language AST representation, but it's one of those
| things that sounds great at a high level but as soon as you
| actually tried to implement it for, say, Python and C++,
| you'd rapidly discover that in practice there's not as much
| opportunity for "generic AST operations" as you may think.
|
| The problem isn't that it isn't "stored as an AST" but that
| $YOUR_LANGUAGE apparently doesn't have good libraries or
| mechanisms for getting at it. Go, for instance, ships with
| the relevant bits of the compiler exposed, and as a result
| there are tons of tools that operate on Go code as ASTs and
| not textually, because it's readily available and supported
| by the core language team. I use this only as an example I
| know personally, there are other languages that have similar
| sorts of support as well.
| vlovich123 wrote:
| I feel like you're picking a strawman here. The AST
| serialization everyone is implying is one where you don't
| need to token/lex but can just load it directly &
| manipulate it (i.e. implying the on-disk version is a valid
| AST or one who's validity can be trivially validated
| without needing to have the entire language syntax &
| grammar). First, that makes the compiler _much_ faster
| because tokenization /lexing is moved to the "save" phase
| which happens infrequently at human scale vs the
| compile/processing phase which happens in an automated
| fashion where the overhead can be notable. Additionally, if
| you mmap the AST from disk into memory, you can use finer-
| grained caching to memoize expensive analysis that happens
| for faster compiles of code that's only changed slightly
| (e.g. changing whitespace/comments wouldn't recompile
| anything).
|
| More importantly for advocates, it avoids needing to ship
| the deserialization library and makes tooling simpler.
| That's really why the idea of a simple AST format is so
| attractive. Typically compiler frontends are typically very
| tightly coupled to the underlying middle & back end.
| There's some work in some languages to decouple this (e.g.
| LSPs & Idea's failable parsing approach), but the efforts
| are still very immature & it's still not clear to me that
| it's worth it (see the last paragraph).
|
| The main underlying challenge with making sure the on-disk
| contents is well-formed according to the syntax rules is
| that frequently you want to pause work at an intermediate
| stage. This means you either have to make sure that
| whatever state the user saves is a valid AST via editor
| tricks (although I think this also typically means you have
| to design the language around it), you reject saves, every
| tooling library has to be capable of parsing malformed
| ASTs, or you save a dirty transformation to apply to the
| last known saved version so you can have the user resume
| editing but otherwise tooling uses the "last known good"
| version. That's the real challenge with having a serialized
| version that's amenable to 3p tooling for interop.
|
| Finally, all the "serialize the AST" solutions ignore the
| problem of wanting to grep the codebase. This means you
| need to change out several decades of line-oriented
| manipulation tools in favor of new ones that are AST-based
| & likely more complicated to write/maintain as compared
| with one-line regular expressions. At least I've yet to see
| any AST manipulation libraries that aren't drastically
| different from existing text manipulation tools if clang-
| tidy and Rust macros are any indication about what good
| solutions to the problem look like today.
|
| I think eventually we'll get AST serialization, but I think
| it will be packaged into an entirely new language (like
| Rust did with ownership) that also considers the tooling
| aspect end-to-end rather than as a retrofit into existing
| languages. Once that's successful, then I think we'll see
| retrofits because the space will have been better explored
| & other languages will benefit from the R&D into what a
| successful path would look like.
| seiferteric wrote:
| > This means you need to change out several decades of
| line-oriented manipulation tools in favor of new ones
| that are AST-based
|
| I wonder if a generic binary->text tool/library could
| solve this. Grep could check the file mime type then call
| the tool to convert from the binary format to the text
| format if available. I could see this being useful for a
| lot of binary formats.
| res0nat0r wrote:
| I think everyone may be interested in:
| https://github.com/afnanenayet/diffsitter
|
| Github having an option to have their PR GUI use an AST
| diff like this could be a fun and useful option.
| habitue wrote:
| > that makes the compiler much faster because
| tokenization/lexing is moved to the "save" phase which
| happens infrequently at human scale
|
| For dynamic languages like Ruby or Python, storing a pre-
| parsed representation makes it a little faster. But for
| compiled languages, lexing and parsing tend to be swamped
| by the codegen step
| vlovich123 wrote:
| If you think about it more broadly where you can memoize
| expensive results of AST -> code gen transformation or
| AST -> AST simplification, then this will help
| significantly for codegen, especially for incremental
| builds but also clean builds if you have your CI cluster
| sharing build cache information with your local devs.
|
| Also, for a language like Rust, I'm not sure that there
| isn't a significant amount of time spent validating
| ownership & doing type inference. These are the kinds of
| analysis you could save into the AST & thus save a
| significant amount of build time when talking about large
| projects. I agree for smaller projects a lot of these
| optimizations are probably unimportant.
| dotancohen wrote:
| And this is why I love Python. It forces people to use the
| same coding standard in some regards, and it forces people
| to indent properly.
|
| I really don't care anymore what that standard might be
| (well, ok, I do prefer tabs) but I do care that it be
| consistent. And I do DEMAND that proper nested indentation
| be respected. Source code is meant to be human readable.
| cratermoon wrote:
| Python is not unique nor innovative in that respect. Even
| FORTRAN and COBOL programs from the early days had very
| strict rules about indentation and blocks.
|
| The thing is, there's no reason we have to _store_ the
| code like text. Even the punch card got this right: the
| program wasn 't stored as text, it was stored as physical
| holes in paper. A very experience programmer could often
| look at a card with just the holes and have a rough idea
| what it encoded, provided they knew if it was EBCDIC or
| ASCII or whatever, but the computer didn't care. The
| printed representation across the top line of the card
| was just that: a representation.
| dotancohen wrote:
| I didn't claim that Python is unique nor innovative.
| However, FORTRAN and COBOL are not modern languages in
| the sense that one can reasonably expect a large
| selection of first- and third- party libraries for most
| common situations, and their availability on e.g. servers
| is far more limited, thus learning them just for
| scripting is not as good a choice as is Python.
| thaumasiotes wrote:
| > Even the punch card got this right: the program wasn't
| stored as text, it was stored as physical holes in paper.
|
| We still do that, storing executable programs as
| executable binary rather than text. What else could you
| do?
| dotancohen wrote:
| In the context of this thread, where we are discussing
| Git as a medium for storing the form of the programs in a
| format that is meant to be read and maintained by humans,
| we do store the programs in text.
| cratermoon wrote:
| We store them as serialized text. We could store them as
| nodes in an AST; we could store them as OLE/CFBF
| structures like older versions of Microsoft Word, or do
| what Ted Nelson suggested decades ago: T. H. Nelson,
| "Complex information processing: a file structure for the
| complex, the changing and the indeterminate," in
| Proceedings of the 1965 20th national conference, New
| York, NY, USA, Aug. 1965, pp. 84-100. doi:
| 10.1145/800197.806036.
| thaumasiotes wrote:
| We do not store executables as serialized text, unless
| they are meant to be executed by an interpreter.
| cratermoon wrote:
| > There's no reason source code need to be stored as plain
| text
|
| The same can be said for lots of different documents, and
| it's been true for programs like Word for a long time. See
| also [1]T. H. Nelson, "Complex information processing: a file
| structure for the complex, the changing and the
| indeterminate," in Proceedings of the 1965 20th national
| conference, New York, NY, USA, Aug. 1965, pp. 84-100. doi:
| 10.1145/800197.806036.
| fwip wrote:
| You might be interested in https://unisonweb.org/
| Kinrany wrote:
| Storing _everything_ in plain text is better. But there 's no
| reason source code needs to be _edited_ as plain text.
| ssivark wrote:
| Serializing structured data into ASCII streams makes it
| very hard to then deserialize and re-structure.
|
| Plain text might be the lowest common denominator for
| Unix/shell tools, but we can do far far better in how
| structured data is exchanged, which would make it much
| easier to programmatically manipulate & process.
| [deleted]
| BiteCode_dev wrote:
| Yes, but only if it falls back to text diff as soon as there is
| the smallest doubt it can't provide a good AST diff.
| fstrthnscnd wrote:
| Tabs do work as long as they aren't fixed width (I don't know
| what you mean by "custom").
|
| For instance, in many languages, one will sometimes have to
| split a function call to many lines, and in most languages
| function names aren't of fixed length, thus in order to get a
| correct alignment for parameters, the tab width at that point
| will have to match the function name length.
| #include<stdio.h> int main(int argc, char* argv[])
| { printf("%s %s %s %s\n",
| __FILE__, __LINE__,
| __DATE__, __TIME__); return
| 0; }
|
| I agree with your idea of storing a normalized version of the
| code in the repo: it wouldn't then matter whether that version
| contains characters to align the code properly, it would just
| be inserted by the editor/linter as needed. The difficulty is
| that sometimes linting isn't enough, and some manual formatting
| is needed. Or perhaps the formatting rules are under specified?
|
| Another issue with AST diffing is when languages allow some
| form of syntactic sugar as preprocessing: the compiler might
| just see the simplified tree, not the one with the "sugary"
| forms. A tool capable of parsing such languages should also be
| able to handle these extensions.
| Asraelite wrote:
| > the tab width at that point will have to match the function
| name length.
|
| This is a non-issue. Use tabs for indentation and spaces for
| alignment.
| njharman wrote:
| That is the kind of problem solution that ends up with you
| now having 2 problems. New problem(s); having tabs and
| spaces, having to think when to use them, having to
| train/document everyone in usage, having to debate that
| usage, having to correct code and chastise people who get
| usage wrong.
|
| Use a automatic code formatter with minimal options.
| Automate either running code formatter on commit or denying
| commits that change when code formatter is run on them.
| Asraelite wrote:
| Absolutely, I wouldn't dream of doing any kind of fancy
| alignment by hand, only with an auto-formatter.
|
| If I had to break arguments onto multiple lines without
| an auto-formatter I would just keep it simple and use
| another level of indentation instead of aligning them
| with the function name.
| Latty wrote:
| Better yet, just never do alignment.
|
| Obviously readability is subjective, but personally I find
| alignment is never valuable outside of tables of data, and
| I'd argue generally having tables of data embedded in your
| code isn't ideal. #include<stdio.h>
| int main(int argc, char* argv[]) { printf(
| "%s %s %s %s\n", __FILE__,
| __LINE__, __DATE__,
| __TIME__ ); return 0;
| }
|
| Reads better to me and avoids the issue entirely. It also
| plays more nicely with traditional diff tools anyway.
| thrwyoilarticle wrote:
| You can also write git hooks to turn their spaces into your
| tabs & vice versa.
| OJFord wrote:
| That's not a good solution - every commit with an author
| (well technically committer I suppose) whose opinion differs
| to the last will be horrendous.
| thrwyoilarticle wrote:
| There won't be any difference. OP will run the script when
| they checkout, work with their tabs, then run the script
| when they commit. Spaces in, spaces out.
| OJFord wrote:
| Oh ok, sure. But then that's just a weak version of
| what's being requested - an entirely neutral more
| agnostic, _abstract_ format that stores the meaning
| without any formatting at all.
| vxNsr wrote:
| Is it just me or is he describing an IDE with source control?
| bialpio wrote:
| This made me think of Unison: https://www.unisonweb.org/
|
| Discussion: https://news.ycombinator.com/item?id=27652677
| [deleted]
| jpitz wrote:
| Didn't the VisualAge IDEs do this with their built-in version
| control? This was 20 years ago, and I seem to remember that the
| version control was at the method level, not file level.
| ClassAndBurn wrote:
| Git is designed to require human oversight. This is usually a
| feature, but in recent years has become a bug with things like
| GitOps.
|
| It's important to remember that Git is a terrible database
| because of its lack of semantic structure. All conflicts require
| a human who does have to context. This is why almost no one
| builds a system that uses Git as a two way interface. And when
| they do, its via Github Pull Requests (which go to humans) and
| not Git itself.
|
| In all, this makes it a wonderful general purpose shared
| filesystem. And that's about it.
| cies wrote:
| > Structure editors haven't really taken off yet despite several
| historical and contemporary attempts.
|
| This is a nice contemporary one:
|
| https://github.com/projectional-haskell/structured-haskell-m...
|
| Lisps also have all kinds of options available in Emacs, but it
| is more special to see this outside of the land of s-expressions.
| jcrites wrote:
| I've had loosely similar ideas before. The basic idea is to make
| the compiler tool chain aware of diffs, and help scrutinize and
| implement them. Refactoring suggestions could be included with
| the diff.
|
| For example, say you're dependending on a module and it renamed a
| class/method/trait/macro/constant/whatever. A synchronous method
| has become a sync or vice versa.
|
| The diff could include programmatic instructions for consumers to
| apply to their code bases switching them over to the new method.
| This could be as simple as semantically changing the name used,
| or in the case of changing sync to a sync it could add `await` in
| the appropriate spot.
|
| There's no limit to how complex the rewrite rules could be. You
| could totally reorganize the parameters to a function and ship
| that refactoring, or even add a parameter along with the default
| code necessary to provide it.
|
| Unlike the author I don't think code in Git will likely ever move
| beyond source, plus the refactoring instructions needed to update
| a change from a dependency -- perhaps a macro-like syntax.
|
| Too much text manipulation is required from source control for me
| to conceive of it being anything but human programming text in a
| future I can imagine. Machines can already parse it; there
| doesn't seem to be a compelling reason to store some other kind
| of structure.
|
| Refactoring wouldn't need to be any special Git extension, just a
| file accompanying the commit with instructions for the language
| tool chain.
|
| Your IDE or CLI could walk you through interactively everywhere
| it's getting applied, or you could apply all and review the
| result in your app that consumes the module.
|
| This would also open up security risks from accepting diffs from
| dependencies and applying their refractors, but unless modules
| are sandboxed quite well that's a risk you take with updating
| dependencies anyway. And you can always scrutinize the
| refactoring-diff manually after it runs before accepting it.
|
| The industry would probably standardize onto the notion that a
| change that requires running automated room factoring from a
| dependency across your codebase is a major version change; in
| other words a breaking or backwards incompatible change, just one
| that's much easier to upgrade to.
|
| Languages with macros or other programmatic transformers would be
| well suited to this concept I think.
|
| Maybe Rust macros could be enhanced for the purpose to pattern
| match over an existing codebase somehow: not just the macro
| invocation point, but anywhere, e.g., a given trait or function
| is used; and then the output of the macro would not feed into the
| next stage of the compiler step but would instead result in
| rewriting code on disk to produce a diff that you examine and
| apply to your code.
|
| A capability like this would make it much easier to manage large
| aggregate code bases consisting of many dependencies. OSS package
| maintainers or infrastructure providers at companies could ship
| nominally backwards -incompatible changes that are still actually
| compatible when you run the macro transformer that updates the
| code that uses them.
|
| For a simple example, imagine that `foo()` was previously a
| function and the implementation chooses to add some optional
| parameters or those with default values and change it into a
| macro `foo!()`. The accompanying transformer would semantically
| identify references to `foo()` and make the necessary updates.
| You could rename global constants or traits or other code
| elements this way.
|
| Consider the way in which Google Guava has had to evolve over
| time. A number of its features have become part of the Java
| language, and thus the classes deprecated and removed gradually.
| With a compiler facility like what I am describing, users of
| Guava could run the transformer to migrate older could bases that
| use methods like Guava's `Preconditions.checkNotNull(Object,
| message))` to use Java's now-standard `Objects. requireNonNull (T
| obj, String message)`. Because the maintainers of Guava wanted to
| keep it modern, current, avoid redundancy, and designed in the
| best way that they knew how, they made a number of breaking
| changes for which the project lead later apologized [1]. Most
| changes wouldn't have been be painful if accompanied by automated
| refactoring.
|
| You could allow the transformer to produce code that still needs
| work from humans to finalize and compile. In that case it could
| change as much as it's able and leave instructions at each call
| site.
|
| At my company we have bots that submit proposed code changes to
| our codebase that need to be taken over as author by a human,
| reviewed, sometimes lightly edited, and shipped, and they work
| quite well. One bit finds unused code and submits diffs to remove
| it if it's been in the code repository long enough. Another
| detects when launch experiments have been at 100% all on one
| treatment for a long period of time (meaning the feature has
| launched) and submits code removing the experiment check. The
| latter sometimes require removes surrounding code that would
| subsequently become unused from removing the experiment check.
|
| These have provided meaningful value in helping keep the codebase
| tidy and I look forward to more automation like this in the
| future, including diff-aware compilers and refactoring tools.
|
| [1]
| https://www.reddit.com/r/java/comments/mr03mi/comment/guk848...
___________________________________________________________________
(page generated 2021-09-27 23:01 UTC)