[HN Gopher] Getting 10TB of GitHub logs and extracting details o...
       ___________________________________________________________________
        
       Getting 10TB of GitHub logs and extracting details of all users and
       repositories
        
       Author : zaric
       Score  : 83 points
       Date   : 2023-06-15 14:37 UTC (8 hours ago)
        
 (HTM) web link (trickest.com)
 (TXT) w3m dump (trickest.com)
        
       | didntcheck wrote:
       | I'm a big fan of neat statistical analyses, but when I look at
       | the archive site https://www.gharchive.org/ , the overwhelming
       | feeling I have is "creeped out". Taking periodic snapshots of
       | repositories and their issues and wikis sounds good, but do we
       | really need a log of every time someone watches an issue, and
       | every commit message being irrevocably set on public record? That
       | level of details on _individual_ activity seems to have little
       | value outside of cyberstalking
       | 
       | I did have a look at the bottom of the page where prominent uses
       | are listed, but nothing stands out as actually useful tbqh
        
         | TheBrokenRail wrote:
         | If you put information online publicly, you should be always
         | working under the assumption that it will be immediately
         | archived by someone. Whether that is the Internet Archive for
         | websites, a Discord bot archiving edits and deletions,
         | Pushshift (formerly) for Reddit, or just some private group
         | operating a web scraper.
         | 
         | At least in this situation the archived data is public.
        
         | mxmlnkn wrote:
         | This kind of information has become even more easily accessible
         | for everyone with the introduction of the Activity tab. I
         | really wish that force-pushed commits and deleted branches
         | would be permanently lost instead of being permanently stored
         | like on a blockchain.
         | 
         | https://github.blog/changelog/2023-05-31-view-repository-pus...
        
         | ResearchCode wrote:
         | Since people can't behave and come up with "business models"
         | like this, sites will have to become subscription-only.
        
         | lesquivemeau wrote:
         | Is this even GDPR compliant ?
        
           | kiririn wrote:
           | GitHub is already a GDPR joke. They won't delete anything
           | less than an entire account
        
           | pcdevils wrote:
           | Based on Serbia but operating the servers in Frankfurt. With
           | the cavalier attitude to scraping and linking people's
           | identifiable data without any sort of opt-in I had assumed it
           | would be a US company.
        
         | zx8080 wrote:
         | > That level of details on individual activity seems to have
         | little value outside of cyberstalking
         | 
         | A selling point for business-level is that they can do whatever
         | OKR they want on top of that details. To make any employee
         | dance whatever business can imagine. Any amount of Tb seems OK
         | until it helps selling "Business" tier github.
        
         | ziml77 wrote:
         | I have a GitHub account under my real name, but recently I've
         | started using GitHub under a couple of other names instead.
         | There's so much stuff you do in public on GitHub that I want to
         | avoid people doing exactly this kind of analysis on.
         | 
         | I wish using multiple identities was at least some level of
         | foolproof though. I have to be careful to configure my local
         | copies of repos to use the correct username, masked email, and
         | PGP signing key. It would be super handy if git had a global
         | config of multiple identities and if I could have it prompt me
         | to select one either when I clone a repo or the first time I
         | push to it.
        
           | zhfliz wrote:
           | note that this is technically against their TOS if not using
           | paid accounts:
           | 
           | > One person or legal entity may maintain no more than one
           | free Account (if you choose to control a machine account as
           | well, that's fine, but it can only be used for running a
           | machine)
           | 
           | https://docs.github.com/en/site-policy/github-
           | terms/github-t...
        
             | bigbillheck wrote:
             | They'll have to catch me first.
        
             | kfrzcode wrote:
             | internet would be better with total anonymity.
        
               | xur17 wrote:
               | That's one of my favorite parts of reddit. Accounts are
               | just pseudonyms, and you can generate as many as you
               | want. I personally generate a new one every few months,
               | which helps keep too much identifying data from building
               | up over time.
        
               | kfrzcode wrote:
               | and they're all identifiably attached to your metadata
               | package(s)
        
               | drdaeman wrote:
               | Which, at the very least, isn't public to everyone. So,
               | not perfect, but better than single public account.
        
               | awesome_dude wrote:
               | There's an almost constant debate on anonymity on the
               | internet, the argument has existed for at least the last
               | 30 or so years, and will likely continue for the lifetime
               | of the internet.
               | 
               | In the "use your real name" corner - people argue that
               | using an anonymous handle means that there is no
               | credibility, the lack of consequence means that people
               | can be /more/ disingenuous and provide bad faith
               | comments.
               | 
               | In the "be anonymous" corner - people argue that what you
               | say or do on the net now shouldn't affect you in 25 years
               | time (there are multiple examples of politicians, or
               | public figures who have said unwise things on the
               | internet, only to have their careers ripped apart decades
               | later)
               | 
               | For me (obviously) I have chosen the anonymous handle -
               | because in the past I have had employers contacted
               | because I hold different political views to some nasty
               | people, I've had my house marked by them, my life was
               | threatened, and some veiled threats about my children.
               | (libertarians are _very_ anti free speech when the speech
               | is anti their house of cards)
        
               | Firmwarrior wrote:
               | Well, to be fair, some of the most vile hateful stuff
               | I've ever seen in my life was posted on Facebook-powered
               | comment sections under news articles, next to real names
               | and pictures of smiling grandparents holding their
               | grandchildren
        
               | WalterBright wrote:
               | [flagged]
        
               | awesome_dude wrote:
               | > It's sad that that occurred. A major part of
               | libertarianism is free speech. But I'm sure you know that
               | left-wing governments prefer to send any dissenters to
               | the gulag.
               | 
               | Whataboutism?
               | 
               | Honestly if all you really have is "but those other bad
               | people are bad too" you have nothing at all.
        
           | jxramos wrote:
           | Private repos are still off limits for this sort of analysis
           | correct?
        
             | _a_a_a_ wrote:
             | Microsoft owns github so draw your own conclusions on
             | whether that's true and how long it will last if it is.
        
               | ahauxuueei wrote:
               | What a blast from the past. Trust no one, am I right?
               | 
               | What's up Mulder! This is deep throat speaking
        
               | _a_a_a_ wrote:
               | You are wrong. Don't be cynical, but equally don't be
               | naive. MS hasn't changed at core.
        
             | ziml77 wrote:
             | They should be, but those aren't a solution for anything
             | where you want to share, collaborate, or even just ask
             | questions on.
        
             | [deleted]
        
           | 0x0 wrote:
           | You can have includeIf sections in your .gitconfig that
           | applies only to things within a certain directory. So you if
           | you create top level directories in your $HOME for your
           | various identities, all you need to do is to make sure you
           | are cloning into and working within the appropriate directory
           | hierarchy.
        
             | mplewis9z wrote:
             | I do this to separate personal and company repositories on
             | the same machine, and it works flawlessly. An example
             | config looks like:
             | 
             | ~/.gitconfig:
             | 
             | ```
             | 
             | [includeIf "gitdir:~/Personal/"]                       path
             | = ~/personal.gitconfig
             | 
             | ```
             | 
             | ~/personal.gitconfig:
             | 
             | ```
             | 
             | [user]                       email = {personal email
             | address}
             | 
             | ```
             | 
             | And you can have arbitrary numbers of "profiles" like this,
             | as long as each is in their own directory.
        
         | killingtime74 wrote:
         | Don't blame the archiver who is only doing what is allowed of
         | them by Github. If you don't want to be public don't be public.
         | Other archivers aren't making themselves publicly known but
         | have the data all the same.
        
         | kfrzcode wrote:
         | Lol let me introduce you to a little organization called the
         | National Security Agency, with their "creepy" periodic
         | snapshots of much more intriguing datasets.
         | 
         | "Stellar Wind" is a good place to start.
         | 
         | Including, but of course not limited to, every communication
         | made by any person within the United States (or outgoing) for
         | the better part of two decades. Internet traffic,
         | communications, all of it.
         | 
         | https://oig.justice.gov/reports/2015/PSP-09-18-15-vol-III.pd...
        
           | WalterBright wrote:
           | Looks like I'm destined for the gulag for sure.
        
       | sleepychu wrote:
       | Was this written by GPT? I was quite interested in the topic of
       | the article but I started to get the brain fog I associate with
       | parsing ChatGPTs convoluted sentences.
        
         | thefourthchime wrote:
         | It also seems like a thinly veiled piece of product marketing.
        
       | zX41ZdbW wrote:
       | The article leaves a bitter taste of unnecessary complexity. Data
       | engineering should not be hard.
       | 
       | For example, you can load the GitHub Archive to ClickHouse, and
       | it will be accessible with interactive real-time queries:
       | https://ghe.clickhouse.tech/
       | 
       | See also https://til.simonwillison.net/clickhouse/github-explorer
        
       | atum47 wrote:
       | [flagged]
        
         | mtmail wrote:
         | IMHO it's in the guidelines "Please don't complain about
         | tangential annoyances--e.g. article or website formats, name
         | collisions, or back-button breakage. They're too common to be
         | interesting." https://news.ycombinator.com/newsguidelines.html
        
           | atum47 wrote:
           | did not know about this part of the guidelines, thanks.
           | 
           | a while back I was thinking about creating an extension that
           | deals with this issue but I've heard some browsers are
           | already working on that.
        
         | amusingimpala75 wrote:
         | Even
         | 
         | https://archive.ph/atw1q
         | 
         | didn't work correctly on the page, it just ceases to scroll
         | after a point.
        
           | George83728 wrote:
           | The website works fine _unless_ you enable javascript. That
           | 's usually the way it is with these sort of things. The
           | webdev or CMS creates a perfectly functional website using
           | HTML and CSS, then some javascript is added to shit the whole
           | thing up. Disable javascript by default for a better web
           | experience.
        
             | zaric wrote:
             | The noise is gone, enjoy your read! :)
        
         | [deleted]
        
       | sleepytimetea wrote:
       | The background with static noise really bothers me. Will have to
       | skip reading till they provide a disable button.
        
         | zaric wrote:
         | No more noise!
        
         | giancarlostoro wrote:
         | On Firefox and even Microsoft Edge there is a "reader mode"
         | option for most websites. I click on that often enough when I
         | expect an article, to remove noise from ads.
        
       ___________________________________________________________________
       (page generated 2023-06-15 23:01 UTC)