[HN Gopher] Scaling Git's garbage collection
___________________________________________________________________
Scaling Git's garbage collection
Author : todsacerdoti
Score : 52 points
Date : 2022-09-13 16:02 UTC (6 hours ago)
(HTM) web link (github.blog)
(TXT) w3m dump (github.blog)
| forrestthewoods wrote:
| > At GitHub, we store a lot of Git data: more than 18.6 petabytes
| of it, to be precise.
|
| That actually seems kinda small.
|
| Git's lack of good support for large files means there's probably
| an exabyte of data that, imho, should be source control but
| isn't.
| kccqzy wrote:
| That's indeed small. I'd guess that Google probably stores 4
| orders of magnitude more data than GitHub.
|
| (I was in fact asked a long time ago in an interview to
| estimate how much disk was needed to store Google's search
| index.)
| sulam wrote:
| Glad it was a long time ago. Those kinds of questions are
| awful.
| isatty wrote:
| Agreed that it isn't ideal, but about "awful" specifically
| - I'm not too sure. I would never ask such a question but I
| would assume the intent is just to find out how you think
| and not to get you to spit out a number. Would it be fun if
| the interviewer worked together with you to approximate it?
| ajb wrote:
| You can't actually put the Android source in GitHub because of
| the 4GB per repo size limit. Niche problem, but shows the scale
| of things.
| kortex wrote:
| It would be amazing if Github/lab provided a backing store for
| www.dvc.org . I've been using to great effect, but I have to
| rely on separate AWS integration for storing the large objects
| in s3.
| sc68cal wrote:
| I wish they had not gone with uint32_t for storing mtimes, since
| they now have to deal with the 2038 problem, sometime in the
| future.
|
| I am surprised they didn't directly use time_t, so that they
| wouldn't have to deal with this (since some platforms have
| already gone to 64 bit time_t)
| kevingadd wrote:
| Wouldn't that mean if a platform changed time_t formats it
| would invalidate all their stored files?
| [deleted]
| grogers wrote:
| Well if they use unsigned 32 bit they at least extended it to
| Y2106 :-)
|
| But for this use case it's not really an issue though. FTA it
| sounded like they always write the mtime as now. It's unlikely
| they wouldn't GC the repo in 68 years to make wraparound an
| issue.
| est31 wrote:
| For on-disk formats, time_t would probably not be a good
| choice, but indeed, they have a time_t to uint32_t conversion
| going on, that is not even saturating, just cutting bits off:
|
| https://github.com/git/git/blob/e188ec3a735ae52a0d0d3c22f9df...
|
| https://github.com/git/git/blob/e188ec3a735ae52a0d0d3c22f9df...
| cesarb wrote:
| > I wish they had not gone with uint32_t for storing mtimes,
| since they now have to deal with the 2038 problem, sometime in
| the future.
|
| Since uint32_t is _unsigned_ , wouldn't it be the Y2106 problem
| instead?
|
| > I am surprised they didn't directly use time_t, so that they
| wouldn't have to deal with this (since some platforms have
| already gone to 64 bit time_t)
|
| You mentioned the problem yourself without noticing: _some_
| platforms have gone to 64-bit time_t, but others haven 't. This
| is a file format, which can be shared by multiple platforms, so
| it cannot use types which change size depending on the
| platform.
___________________________________________________________________
(page generated 2022-09-13 23:00 UTC)