[HN Gopher] Getting 10TB of GitHub logs and extracting details o...
___________________________________________________________________
Getting 10TB of GitHub logs and extracting details of all users and
repositories
Author : zaric
Score : 83 points
Date : 2023-06-15 14:37 UTC (8 hours ago)
(HTM) web link (trickest.com)
(TXT) w3m dump (trickest.com)
| didntcheck wrote:
| I'm a big fan of neat statistical analyses, but when I look at
| the archive site https://www.gharchive.org/ , the overwhelming
| feeling I have is "creeped out". Taking periodic snapshots of
| repositories and their issues and wikis sounds good, but do we
| really need a log of every time someone watches an issue, and
| every commit message being irrevocably set on public record? That
| level of details on _individual_ activity seems to have little
| value outside of cyberstalking
|
| I did have a look at the bottom of the page where prominent uses
| are listed, but nothing stands out as actually useful tbqh
| TheBrokenRail wrote:
| If you put information online publicly, you should be always
| working under the assumption that it will be immediately
| archived by someone. Whether that is the Internet Archive for
| websites, a Discord bot archiving edits and deletions,
| Pushshift (formerly) for Reddit, or just some private group
| operating a web scraper.
|
| At least in this situation the archived data is public.
| mxmlnkn wrote:
| This kind of information has become even more easily accessible
| for everyone with the introduction of the Activity tab. I
| really wish that force-pushed commits and deleted branches
| would be permanently lost instead of being permanently stored
| like on a blockchain.
|
| https://github.blog/changelog/2023-05-31-view-repository-pus...
| ResearchCode wrote:
| Since people can't behave and come up with "business models"
| like this, sites will have to become subscription-only.
| lesquivemeau wrote:
| Is this even GDPR compliant ?
| kiririn wrote:
| GitHub is already a GDPR joke. They won't delete anything
| less than an entire account
| pcdevils wrote:
| Based on Serbia but operating the servers in Frankfurt. With
| the cavalier attitude to scraping and linking people's
| identifiable data without any sort of opt-in I had assumed it
| would be a US company.
| zx8080 wrote:
| > That level of details on individual activity seems to have
| little value outside of cyberstalking
|
| A selling point for business-level is that they can do whatever
| OKR they want on top of that details. To make any employee
| dance whatever business can imagine. Any amount of Tb seems OK
| until it helps selling "Business" tier github.
| ziml77 wrote:
| I have a GitHub account under my real name, but recently I've
| started using GitHub under a couple of other names instead.
| There's so much stuff you do in public on GitHub that I want to
| avoid people doing exactly this kind of analysis on.
|
| I wish using multiple identities was at least some level of
| foolproof though. I have to be careful to configure my local
| copies of repos to use the correct username, masked email, and
| PGP signing key. It would be super handy if git had a global
| config of multiple identities and if I could have it prompt me
| to select one either when I clone a repo or the first time I
| push to it.
| zhfliz wrote:
| note that this is technically against their TOS if not using
| paid accounts:
|
| > One person or legal entity may maintain no more than one
| free Account (if you choose to control a machine account as
| well, that's fine, but it can only be used for running a
| machine)
|
| https://docs.github.com/en/site-policy/github-
| terms/github-t...
| bigbillheck wrote:
| They'll have to catch me first.
| kfrzcode wrote:
| internet would be better with total anonymity.
| xur17 wrote:
| That's one of my favorite parts of reddit. Accounts are
| just pseudonyms, and you can generate as many as you
| want. I personally generate a new one every few months,
| which helps keep too much identifying data from building
| up over time.
| kfrzcode wrote:
| and they're all identifiably attached to your metadata
| package(s)
| drdaeman wrote:
| Which, at the very least, isn't public to everyone. So,
| not perfect, but better than single public account.
| awesome_dude wrote:
| There's an almost constant debate on anonymity on the
| internet, the argument has existed for at least the last
| 30 or so years, and will likely continue for the lifetime
| of the internet.
|
| In the "use your real name" corner - people argue that
| using an anonymous handle means that there is no
| credibility, the lack of consequence means that people
| can be /more/ disingenuous and provide bad faith
| comments.
|
| In the "be anonymous" corner - people argue that what you
| say or do on the net now shouldn't affect you in 25 years
| time (there are multiple examples of politicians, or
| public figures who have said unwise things on the
| internet, only to have their careers ripped apart decades
| later)
|
| For me (obviously) I have chosen the anonymous handle -
| because in the past I have had employers contacted
| because I hold different political views to some nasty
| people, I've had my house marked by them, my life was
| threatened, and some veiled threats about my children.
| (libertarians are _very_ anti free speech when the speech
| is anti their house of cards)
| Firmwarrior wrote:
| Well, to be fair, some of the most vile hateful stuff
| I've ever seen in my life was posted on Facebook-powered
| comment sections under news articles, next to real names
| and pictures of smiling grandparents holding their
| grandchildren
| WalterBright wrote:
| [flagged]
| awesome_dude wrote:
| > It's sad that that occurred. A major part of
| libertarianism is free speech. But I'm sure you know that
| left-wing governments prefer to send any dissenters to
| the gulag.
|
| Whataboutism?
|
| Honestly if all you really have is "but those other bad
| people are bad too" you have nothing at all.
| jxramos wrote:
| Private repos are still off limits for this sort of analysis
| correct?
| _a_a_a_ wrote:
| Microsoft owns github so draw your own conclusions on
| whether that's true and how long it will last if it is.
| ahauxuueei wrote:
| What a blast from the past. Trust no one, am I right?
|
| What's up Mulder! This is deep throat speaking
| _a_a_a_ wrote:
| You are wrong. Don't be cynical, but equally don't be
| naive. MS hasn't changed at core.
| ziml77 wrote:
| They should be, but those aren't a solution for anything
| where you want to share, collaborate, or even just ask
| questions on.
| [deleted]
| 0x0 wrote:
| You can have includeIf sections in your .gitconfig that
| applies only to things within a certain directory. So you if
| you create top level directories in your $HOME for your
| various identities, all you need to do is to make sure you
| are cloning into and working within the appropriate directory
| hierarchy.
| mplewis9z wrote:
| I do this to separate personal and company repositories on
| the same machine, and it works flawlessly. An example
| config looks like:
|
| ~/.gitconfig:
|
| ```
|
| [includeIf "gitdir:~/Personal/"] path
| = ~/personal.gitconfig
|
| ```
|
| ~/personal.gitconfig:
|
| ```
|
| [user] email = {personal email
| address}
|
| ```
|
| And you can have arbitrary numbers of "profiles" like this,
| as long as each is in their own directory.
| killingtime74 wrote:
| Don't blame the archiver who is only doing what is allowed of
| them by Github. If you don't want to be public don't be public.
| Other archivers aren't making themselves publicly known but
| have the data all the same.
| kfrzcode wrote:
| Lol let me introduce you to a little organization called the
| National Security Agency, with their "creepy" periodic
| snapshots of much more intriguing datasets.
|
| "Stellar Wind" is a good place to start.
|
| Including, but of course not limited to, every communication
| made by any person within the United States (or outgoing) for
| the better part of two decades. Internet traffic,
| communications, all of it.
|
| https://oig.justice.gov/reports/2015/PSP-09-18-15-vol-III.pd...
| WalterBright wrote:
| Looks like I'm destined for the gulag for sure.
| sleepychu wrote:
| Was this written by GPT? I was quite interested in the topic of
| the article but I started to get the brain fog I associate with
| parsing ChatGPTs convoluted sentences.
| thefourthchime wrote:
| It also seems like a thinly veiled piece of product marketing.
| zX41ZdbW wrote:
| The article leaves a bitter taste of unnecessary complexity. Data
| engineering should not be hard.
|
| For example, you can load the GitHub Archive to ClickHouse, and
| it will be accessible with interactive real-time queries:
| https://ghe.clickhouse.tech/
|
| See also https://til.simonwillison.net/clickhouse/github-explorer
| atum47 wrote:
| [flagged]
| mtmail wrote:
| IMHO it's in the guidelines "Please don't complain about
| tangential annoyances--e.g. article or website formats, name
| collisions, or back-button breakage. They're too common to be
| interesting." https://news.ycombinator.com/newsguidelines.html
| atum47 wrote:
| did not know about this part of the guidelines, thanks.
|
| a while back I was thinking about creating an extension that
| deals with this issue but I've heard some browsers are
| already working on that.
| amusingimpala75 wrote:
| Even
|
| https://archive.ph/atw1q
|
| didn't work correctly on the page, it just ceases to scroll
| after a point.
| George83728 wrote:
| The website works fine _unless_ you enable javascript. That
| 's usually the way it is with these sort of things. The
| webdev or CMS creates a perfectly functional website using
| HTML and CSS, then some javascript is added to shit the whole
| thing up. Disable javascript by default for a better web
| experience.
| zaric wrote:
| The noise is gone, enjoy your read! :)
| [deleted]
| sleepytimetea wrote:
| The background with static noise really bothers me. Will have to
| skip reading till they provide a disable button.
| zaric wrote:
| No more noise!
| giancarlostoro wrote:
| On Firefox and even Microsoft Edge there is a "reader mode"
| option for most websites. I click on that often enough when I
| expect an article, to remove noise from ads.
___________________________________________________________________
(page generated 2023-06-15 23:01 UTC)