[HN Gopher] The Python Package Index is now a GitHub secret scan...
___________________________________________________________________
The Python Package Index is now a GitHub secret scanning integrator
Author : rbanffy
Score : 327 points
Date : 2021-03-24 11:52 UTC (11 hours ago)
(HTM) web link (github.blog)
(TXT) w3m dump (github.blog)
| soheil wrote:
| This makes me wonder if Github should do basic code sanity checks
| on every repo. Things like checking for division by zero,
| infinite-loops, etc. They'd have to be very conservative checks
| as to not trigger false positives. But if there is benefit in
| secret scanning for all public repos there must be benefit in
| detecting other types of programmer mistakes.
| leblancfg wrote:
| They acquired LGTM (https://github.com/marketplace/lgtm) not
| too long ago, so expect this to happen.
| [deleted]
| RocketSyntax wrote:
| would love to see tighter integration with some GitHub Secret/
| Action publishing
| di wrote:
| Not sure if this is what you're asking for, but the PyPA does
| maintain a GitHub Action for publishing to PyPI as well:
| https://github.com/pypa/gh-action-pypi-publish
| sneak wrote:
| This is some epic-level brand building in action. Pretty soon,
| people just entering our industry will mistakenly believe that
| GitHub's ownership (Microsoft) wants open source to exist and
| thrive.
| molticrystal wrote:
| They got a decent list of partnered companies which you can find
| over here:
|
| https://docs.github.com/en/code-security/secret-security/abo...
|
| Glad they got our back.
| linkdd wrote:
| Great news!
|
| IMHO, Github should make it mandatory for integrated services to
| provide this feature.
| loloquwowndueo wrote:
| Wow today I learned this acronym. PyPI -> python package index,
| after using python for over a decade. Thanks!
| kspacewalk2 wrote:
| Pronounced pie-pee-aye, and not pee-pee, pie-pee or any of the
| other ways I heard it pronounced at work :)
| verall wrote:
| Based on my workplace, I'm pretty sure it's "pee-pee". Just
| like 'Qt' is "cue-tee".
|
| There's no winning these battles..
| porker wrote:
| > Just like 'Qt' is "cue-tee".
|
| How else would you want to pronounce it?
| conradludgate wrote:
| According to them, its just "cute"
| loloquwowndueo wrote:
| Right, I used to pronounce it as pie-pie. Might continue to
| do so but at least I know what it stands for :D
| danudey wrote:
| I call it pie-pie because that makes the most sense and
| sounds the least weird.
| mschulkind wrote:
| Just don't confuse it with PyPy, which is entirely different...
| fredley wrote:
| That's why you pronounce it "Cheese Shop"
| daviddavis wrote:
| And don't pronounce PyPI as pie-pie. It's pie-P-I.
| dec0dedab0de wrote:
| it was much easier when it was just called the cheese shop
| lostcolony wrote:
| Ah. The fat detective.
| cpcallen wrote:
| The headline sounds insidious (How dare PyPI and GitHub secretly
| scan me! I'm glad someone has revealed this dastardly collusion!)
| but it turns out they're actually doing something great.
| zitterbewegung wrote:
| Naming things is the hardest thing to do in computer science.
| brian_herman wrote:
| Yes brother I agree!
| doubleunplussed wrote:
| I thought it was the second hardest. At least that's what I
| remember, since I last checked.
| cbm-vic-20 wrote:
| That, and cache invalidation.
| teraku wrote:
| That, and off-by-one errors
| airstrike wrote:
| There are actually only two hard problems in computer
| science:
|
| 0) Cache invalidation
|
| 1) Naming things
|
| 5) Asynchronous callbacks
|
| 2) Off-by-one errors
|
| 3) Scope creep
|
| 6) Bounds checking
| jsheard wrote:
| 4294967295) Integer underflows
| macksd wrote:
| 7) Project estimation
| DonHopkins wrote:
| -1) Keeping secrets
| weeboid wrote:
| Luckily, building better garbage collectors is easy: ref
| pointers to each cons
| wizzwizz4 wrote:
| Naming things is the hardest thing to do in computer
| science.
| mbreese wrote:
| 7) February 29th.
| moviuro wrote:
| 7) Timezones
|
| FTFY
| Sebb767 wrote:
| 7.0000001) leap seconds
| _joel wrote:
| NaN) Javascript
| gogopuppygogo wrote:
| 9000) communicating
| eganist wrote:
| @dang, in re: this comment, any hopes of editing the title to
| say "secret-scanning" with a hyphen? Might add some clarity.
| melson wrote:
| good one
| z77dj3kl wrote:
| Is there some best practice on creating a format for secret keys?
| If I create an API with secret keys, should I make them something
| like z77dj3kl-secret-pk-[secret-stuff]?
|
| Is there an argument (security by obscurity?) that that makes it
| easier to spot it and abuse it?
|
| Or would it be better to encode it in the secret bits somehow,
| add 16 control bits that have known values?
| theoretick wrote:
| FWIW There's a new RFC for specifying a URI scheme:
| https://tools.ietf.org/html/rfc8959
| einpoklum wrote:
| As a non-Python person:
|
| Is it an easy mistake to make, for someone to inadvertently
| commit and push a "secret PyPI token"?
| progval wrote:
| I think not. The standard tools read the token from ~/.pypirc
| (or the console if absent). Inadvertent commits of the token
| probably only happens if you have a custom script with a
| hardcoded token.
| macintux wrote:
| Secrets in general leak into source code all the time, nothing
| specific about PyPI.
| klyrs wrote:
| I can certainly imagine putting a token into a deploy script in
| the same directory as a python package's repo. From there, it's
| a typo away from getting added and committed to the repo. So,
| it's better to keep those tokens elsewhere.
| einpoklum wrote:
| Isn't it totally verboten to put secret tokens / passwords
| into scripts? Regardless of language?
|
| When I write, say, bash scripts which do work using ssh, I
| don't specify a password: The user running the script will
| provide their own manually, or use ssh-copy-id, or edit the
| authorized_keys file on the target machine if they want to
| save themselves some typing. That is - authentication is
| decoupled from my script's actual work. Why is that not how
| things work with PyPI?
| progval wrote:
| It is. But even if it is strongly discouraged, some people
| will commit it anyway. Look at any beginner's repository,
| there is a high chance it contains files compiled from the
| source of the repo (executable, .pyc, ...), the developer's
| IDE config (.vscode, ...), __MACOSX, ...
| klyrs wrote:
| > Isn't it totally verboten to put secret tokens /
| passwords into scripts?
|
| It's only a rule because people have made the mistake
| enough to learn the lesson...
| hannasanarion wrote:
| If you are trying to publish your package for other people to
| download through the `pip` package manager, then yeah.
|
| Most python devs will probably never publish to PyPi, but this
| can save some headaches for those who do, especially for the
| first time.
| seanwilson wrote:
| Do any APIs standardise on a simple secret key pattern that can
| be easily identified as a secret? For example, all secrets have a
| "secret-" prefix? Or is this idea unworkable?
|
| I usually try and prefix e.g. fields in config files with
| "secret" to make it obvious they shouldn't be committed.
| csnover wrote:
| There was a discussion a while ago about IETF RFC 8959 which
| proposes a secret-token URI that might be of interest:
| https://news.ycombinator.com/item?id=25978185
| amichal wrote:
| These secret scanning integrations have been very helpful. We had
| a client ask to take a project open source recently that had
| started a few years ago as closed source. We of course checked
| over the current version of the code and have had linters in
| place to look for secrets for a while but not in the very early
| days of the project. In that one codebase we had:
|
| - AWS IAM token for S3 upload access to a throwaway dev bucket.
| The bucket had already been deleted but still... Got an email
| about it informing me the IAM token had been revoked by AWS
| within 5 minutes
|
| - A Slack webhook notification URL/secret. Committed as a example
| on a working branch and then git rm'ed but still active. Got an
| email about it and token revoked by Slack automatically within 5
| minutes.
|
| - A Mapbox API token. This one was funny. The token was indeed in
| there and functional but was in the docs/sample code for a
| dependency. Still, we got an email within the hour about it and
| were able to investigate.
|
| Edit: In this case we intentionally kept the commit history. A
| safer alternative (and one we normally practice) is to start a
| fresh repo for the open source variant.
| ed25519FUUU wrote:
| An overlooked vector is old commits. It's often times better to
| squash all commits before taking a project open source, which
| is a real shame for obvious reasons.
|
| Commit histories can spill a lot of secrets that are easy to
| overlook.
| psanford wrote:
| There are tools available to help look for this sort of thing
| (for both you and any potential attackers). TruffleHog[1] is
| the first one that comes to mind for me.
|
| I also like shhgit[2] for looking for secrets in
| repositories. (I don't think shhgit will look back in the git
| history for you though).
|
| [1]: https://github.com/dxa4481/truffleHog
|
| [2]: https://github.com/eth0izzle/shhgit
| amichal wrote:
| Thanks! I knew they existed but hadn't investigated for one
| that would look over past history. Will try out truffleHog.
| lstamour wrote:
| Another idea is to use a git commit hook, such as
| https://github.com/cloud-gov/caulking
| _the_inflator wrote:
| Absolutely this!
|
| Same problem here with inner source, that goes open source.
|
| I feel sorry for all our internal committers, however I know
| of "secrets", that went into the commit history. We are still
| considering our option, but tend to opt for deleting our
| commit history entirely and build a wall of fame for the
| former committers.
| jgalt212 wrote:
| My current fear is versioning back up systems. KeePass files
| may now have secure master keys, but maybe the version saved
| 18 mos ago did not.
|
| 1. Get an old copy 2. run dictionary attack 3. prosper
| danudey wrote:
| > A safer alternative (and one we normally practice) is to
| start a fresh repo for the open source variant.
|
| Note that it's also possible to go back and rewrite history
| (e.g. if you know what the tokens are and where/when they were
| committed), to preserve Git history while cleaning out tokens.
| It can be mildly slow or complicated, but there are tools to
| automate it, such as BFG Repo Cleaner[0] which is relatively
| easy to use (once you learn it).
|
| There are other awesome rewriting tools, like git filter-
| repo[1], but that operates solely on the structure of the
| repository (i.e. it can manipulate basically anything _except_
| file contents). Great for removing unwanted files or
| directories extremely fast, but not good for removing tokens
| (unless you want to remove the entire file the token was in).
| [0] https://rtyley.github.io/bfg-repo-cleaner/ [1]
| https://github.com/newren/git-filter-repo
| [deleted]
| amichal wrote:
| Learning so many options from this thread. I've used these
| tools when I knew what to look for but thats been the tricky
| bit.
|
| psanford also mentioned truffleHog and others, lstamour
| mentioned https://github.com/cloud-gov/caulking which is
| built on gitleaks which looks good. caulking's customized
| list of patterns for gitleaks is here
| https://github.com/cloud-gov/caulking/blob/master/local.toml
| Looks like it would have found the keys in my example case no
| problem.
| anderskaseorg wrote:
| When I helped to take Zulip open-source in 2015, I wrote a
| simple script that scrubbed secrets from the commit history
| using git fast-export and git fast-import. We replaced all our
| secrets with xxxxxxx placeholders, replaced internal customer
| references with dummy names, deleted and renamed certain files,
| and even did some code replacements that caused certain commit
| diffs to become empty so those commits could be removed from
| the history.
|
| https://github.com/zulip/zulip/blob/3.3/tools/zanitizer
|
| https://github.com/zulip/zulip/blob/3.3/tools/zanitizer_conf...
|
| The script was really fast (all ~10000 commits in a few
| minutes), which allowed us to iterate quickly on its
| configuration as we audited using gitk and other tools for
| remaining items to scrub.
|
| Doing this work allowed us to release with an essentially
| complete history going back to the first commit in 2012, which
| has been a really valuable resource for understanding why
| various Zulip subsystems were written the way they were.
|
| Nowadays there are other tools for scrubbing history that might
| be more polished, like BFG: https://rtyley.github.io/bfg-repo-
| cleaner/
| amichal wrote:
| Nice tooling. I've used bfg when we knew what patterns to
| look for. This project didn't generally access private data,
| had a reasonably well behaved team for most of its life (the
| pre-linter & code-review commits were my own damn fault).
| Since it was low risk, I just did a few manual `git log -S
| ...` and moved on. I was still very happy to have github
| catch my throwaway credentials and remind me in the most
| obvious way that these things go in `ENV` and not IN code
| even in examples!
| dthul wrote:
| I was seriously impressed when a few days ago I accidentally
| pushed my secret Discord bot token to Github and literally one
| second later I received a Discord message and an email letting me
| know that I leaked my token and that they deactivated it.
| Kaimunchi wrote:
| Look into this software for device management sclera VDMS -
| https://youtu.be/0_7V3lECy_s
| akhilpotla wrote:
| It would be nice instead if the git command prevented you from
| committing a file with a token in it.
| simonw wrote:
| In case anyone is interested, it looks like this is the
| implementation on the PyPI side:
| https://github.com/pypa/warehouse/pull/8563
| danudey wrote:
| > Fixes #6051 > See #7124 reverted in #8555 due to
| #8554 which is addressed in #8562 (pfew...) > Should
| not be merged before #8562: EDIT: > > Re-
| revert of the code. The bug that caused revert was splitted
| into #8562
|
| Software development in a nutshell, everyone.
| remram wrote:
| FYI pypi tokens look like
| pypi-9NX39cdNn0AH1cCl1bMT48eKzf4Rhvw1mipk1FZTPrpR9
|
| The integration means that GitHub knows to recognize this format,
| and calls some API of pypi.org when it finds one so PyPI can
| revoke it.
|
| As always, please allow me to lament that we don't have a
| standard for this, such as secret-
| token:pypi.org/9NX39cdNn0AH1cCl1bMT48eKzf4Rhvw1mipk1FZTPrpR9,
| which would let any system know that this string is a secret and
| that pypi.org should be notified (for example via POST
| pypi.org/.well-know/compromised-secret). See also
| https://news.ycombinator.com/item?id=25978185
| l0b0 wrote:
| One cool data format standard I only recently learned about is
| multihash[1] - a self-describing hash format: the first byte
| represents the hashing algorithm, the second byte represents
| the length of the hash, and the subsequent [length] bytes is
| the actual hash.
|
| Something similar for tokens would be really useful.
|
| [1] https://multiformats.io/multihash/
| nindalf wrote:
| According to the documentation
| (https://docs.github.com/en/developers/overview/secret-
| scanni...), secret issuers specify a regex that can detect
| secrets they've issued. "Be as precise as possible, because
| this will reduce the number of false positives" - that's the
| guideline from GitHub. Github runs the regex on every commit
| that is uploaded and informs the secret provider when a match
| occurs.
| kevincox wrote:
| I wonder if false-positives often result in GitHub sending
| secrets to the wrong service.
| danudey wrote:
| I wonder if any of those services have a combination of bad
| regexes and bad validation and could be SQL injected by
| committing a malicious faux-token to GitHub.
| woodruffw wrote:
| Hey there! I designed and implemented PyPI's tokens (although
| not the secret scanning integration).
|
| They're actually just macaroons[1] internally, which means that
| they could easily be upgraded at some point to include a
| reporting URL like you mention.
|
| Just as a tidbit: they were originally prefixed with "pypi:"
| rather than "pypi-", but that colon caused problems for a few
| packaging utilities. Any sort of in-band signaling like that is
| unlikely to gain widespread adoption for exactly that reason
| :-)
|
| [1]: https://en.wikipedia.org/wiki/Macaroons_(computer_science)
| leot wrote:
| > to help keep their customers safe
|
| The elimination of a distinction between "safety" and "security"
| is unhealthy imo, as it leads to a failure to distinguish between
| unintentional harm caused by nature, and intentional harm caused
| by other people.
|
| E.g. "safety first" is only intelligible if it doesn't also
| prevent you from trusting anyone (which is what would be implied
| by "security first" as a general priority).
| hannasanarion wrote:
| Do you lock your doors?
| leot wrote:
| Sometimes. But I can't say that I have a "security first"
| mindset, which seems analogous to "trust no one".
| brian_herman wrote:
| This is great hopefully we will get GitHub packages support for
| python soon. https://github.com/features/packages
| luhn wrote:
| It's on their public roadmap:
| https://github.com/github/roadmap/issues/94
|
| Unfortunately it's marked as "Future," so it's still a ways
| out.
| natemcintosh wrote:
| Can someone explain what exactly this means?
| stevekemp wrote:
| If you commit your AWS secrets/tokens, or similar, inside a
| python script it will now be discovered by github
| automatically.
|
| They have integrations with a bunch of services to recognize
| the tokens, and disable them. This means malicious users can't
| copy/paste them, spin up servers and leave you with a big bill.
| (Ideally, of course it could still happen, but the aim is to
| prevent that kind of thing.)
| JosephRedfern wrote:
| Though this has been true for a while, it's not what this
| announcement is about. This is specifically announcing
| automated scanning and reporting of PyPI keys, which if
| exposed, could allow a bad actor to distribute compromised
| Python packages via PyPi (e.g. pip)
| russfink wrote:
| And this is a potentially huge security issue. Think about
| all the systems software that relies on Python packages.
| geofft wrote:
| If you accidentally commit your PyPI private token to git and
| push it to GitHub, PyPI will detect this and disable the token
| within seconds (because there are absolutely bots who will try
| to find it and abuse it).
| eecc wrote:
| > From today, GitHub will scan every commit to a public
| repository for exposed PyPI API tokens. We will forward any
| tokens we find to PyPI, who will automatically disable them and
| notify their owners.
| [deleted]
| prepend wrote:
| It should reduce the possibility of pypi packages being taken
| over as the result of its owner being careless with theirs pypi
| credentials.
|
| I think it's good because the risk of a package being taken
| over is low, but very damaging if it occurs in a widely used
| package.
| nautilus12 wrote:
| I presume it means that if someone accidentally pushes up a
| token to a public github repo then it can't be used to hijack
| all the PyPi packages corresponding to that token to become
| malicious
| bombcar wrote:
| The API keys I've used (admittedly not many) all seem to be long
| random text strings - how does GitHub detect them? By then being
| used (ie in api code) or do they actually have a known format?
| di wrote:
| PyPI API keys have a known format, they start with "pypi-".
| Deathmax wrote:
| GitHub documents the process over at
| https://docs.github.com/en/developers/overview/secret-
| scanni.... You specify a regex, and you check if the secret is
| valid on your end.
| monkeybutton wrote:
| There must be an astounding number of false positives for
| common patterns like N-length string of base64 chars. Could
| someone upload a malicious file with millions of matching
| strings and watch Github DDoS a company's verification
| endpoint?
| neurostimulant wrote:
| I imagine the scanning would be rate-limited on per-repo
| basis.
| lostcolony wrote:
| Probably also a max false positive rate; this isn't a
| guarantee, just a service, so if it detects X false
| positives it could just exclude the repo entirely as
| problematic.
| monkeybutton wrote:
| Yeah, that would be reasonable.
| michaelcampbell wrote:
| "Now you have 2 problems."
| MattConfluence wrote:
| This is a difficult problem indeed, but thankfully it is just
| as difficult for the malicious actors as it is for the "good
| guys". Since various bad guys have presumably been scanning
| public repos for years already, Github and PyPa adding this
| feature is leveling the playing field, even if it is not a 100%
| accurate search algorithm.
| boarnoah wrote:
| Not sure how these particular scanners do it, but during
| security assessments you sometimes use tools that will find all
| strings in an application package with high entropy.
|
| Usually its junk, but occasionally you do get lucky and find
| tokens.
___________________________________________________________________
(page generated 2021-03-24 23:00 UTC)