[HN Gopher] Entropy, a CLI that scans files to find high entropy...
___________________________________________________________________
Entropy, a CLI that scans files to find high entropy lines (might
be secrets)
Author : lanfeust
Score : 222 points
Date : 2024-06-04 19:25 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| DLA wrote:
| This looks like a very handy CLI tool. Nice Go code also. Thanks.
| lanfeust wrote:
| thanks!
| trulyhnh wrote:
| GoDaddy open sourced something similar
| https://github.com/godaddy/tartufo
| thsksbd wrote:
| This is very cool, but I have a thought - I see this as a last
| line of defense, and I am concerned that this would this give a
| false sense of security leading people to be more reckless with
| secrets.
| 0cf8612b2e1e wrote:
| Ehhh considering how low the security bar is, I think it is
| better than nothing. If you inherit a code base, make it a
| quick initial action to see how much pain you can expect. In
| practice, I expect a tool like this has so many false positives
| you cannot keep it as an always running action. More a manual
| review you run occasionally.
|
| I hope that more secrets adopt a GitHub like convention where
| they are prefaced with an identifier string so that you do not
| require heuristics to detect them.
| lanfeust wrote:
| Indeed. I open-sourced `entropy` after we discovered an
| actual secret leak in our client codebase
| alexchantavy wrote:
| The pie in the sky goal for any security org is to have a cred
| rotation process that is so smooth that you're able to not
| worry about leaked creds because it's so fast and easy to
| rotate them. If the rotation is automated and if it's cheap and
| frictionless to do so, heck why not just rotate them multiple
| times a day.
| bongodongobob wrote:
| No, it's a way to audit and see the modes that your security
| policy is failing. At least that's how I look at it.
| kmoser wrote:
| You could make the same argument for any tool that does not
| provide high security. In fact security is layered, and no
| single tool should be relied upon to be your one security tool.
| You said as much yourself: "I see this as a last line of
| defense," but I don't see how you conclude that this would
| inherently cause people to be more reckless with secrets.
| lm411 wrote:
| Welcome to information security :)
| coppsilgold wrote:
| Note that in an adversarial setting this will only be effective
| against careless opponents.
|
| If you properly encode your secret it will have the entropy of
| its surroundings.
|
| For example you can hide a string of entropy (presumably
| something encrypted) in text as a biased output of an LLM. To
| recover it you would use the same LLM and measure deviations from
| next-token probabilities. This will also fool humans examining it
| as the sentence will be coherent.
| textninja wrote:
| What you described sounds like a very cool idea - LLM-driven
| text steganography, basically - but intentional obfuscation is
| not the problem this tool is trying to solve. To your point
| about secrets with entropy similar to the surrounding text,
| however, I wonder if this can pick up BIP39 Seed Phrases or if
| whole word entropy fades into the background.
| thephyber wrote:
| The LLM adds no value here. Procedural generation in a loop
| until some fitness function (perhaps a frequency analysis
| metric) is satisfied.
| eru wrote:
| The LLM is the fitness function.
| buildbot wrote:
| In general (for those unaware) this is called stenography. You
| can hide an image in the lower bits of another image for
| example too.
| dragonwriter wrote:
| _Steganography_ ; stenography is completely different.
| buildbot wrote:
| Thanks
| textninja wrote:
| The weights of the LLM become the private key (so it better be
| a pinned version of a model with open weights), and for most
| practical applications (i.e. unless you're willing to
| complicate your setup with fancy applied statistics and error
| correction) you'd have to use a temperature of 0 as baseline.
|
| Then, having done all that, such steganography may be
| detectable using this very tool by encoding the difference
| between the LLM's prediction and ground truth, but searching
| for substrings with low entropy instead!
| eru wrote:
| You seem to be making some weird assumptions?
|
| Here's how I would do this:
|
| Use some LLM, the weights need to be know to both parties in
| the communication.
|
| Producing text with the LLM means repeatedly feeding the LLM
| with the text-so-far to produce a probability distribution
| for the next token. You then use a random number generator to
| pick a token from that distribution.
|
| If you want to turn this into steganography, you first take
| your cleartext and encrypt it with any old encryption system.
| The resulting bistream should be random-looking, if your
| encryption ain't broken. Now you take the LLM-mechanism I
| described above, but instead of sampling via a random number
| generator, you use your ciphertext as the source of entropy.
| (You need to use something like arithmetic coding to convert
| between your uniformly random-looking bitstream and the
| heavily weighted choices you make to sample your LLM. See
| https://en.wikipedia.org/wiki/Arithmetic_coding)
|
| Almost any temperature will work, as long as it is known to
| both sender and receiver. (The 'temperature' parameter can be
| used to change the distribution, but it's still effectively a
| probability distribution at the end. And that's all that's
| required.)
| textninja wrote:
| I was imagining the message encoded in clear text, not
| encrypted form, because given the lengths required to
| coordinate protocol, keys, weights, and so on, I assumed
| there would be more efficient ways to disguise a message
| than a novel form of steganography. As such, I approached
| it as a toy problem, and considered detection by savvy
| parties to be a feature, not a bug; I imagined something
| more like a pirate broadcast than a secure line, and
| intentionally ignored the presumption about the message
| being encrypted first.
|
| That being said, yes, some of my assumptions were
| incorrect, mainly regarding temperature. For practical
| reasons I was envisioning this being implemented with a
| third party LLM (i.e. OpenAI's,) but I didn't realize those
| could have their RNG seeded as well. There is the
| security/convenience tradeoff to consider, however, and
| simply setting the temperature to 0 is a lot easier to
| coordinate between sender and receiver than adding two
| arbitrary numbers for temperature and seed.
|
| I misspoke, or at least left myself open to
| misinterpretation when I referred to the LLM's weights as a
| "secret key"; I didn't mean the weights themselves had to
| be kept under wraps, but rather I meant that either the
| weights had to be possessed by both parties (with the
| knowledge of which weights to use being the "secret") or
| they'd have to use a frozen version of a third party LLM,
| in which case the knowledge about which version to use
| would become the secret.
|
| As for how I might take a first stab at this if I were to
| try implementing it myself, I might encode the message
| using a low base (let's say binary or ternary) and make the
| first most likely token a 0, the second a 1, and so on, and
| to offset the risk of producing pure nonsense I would
| perhaps skip tokens with too large a gulf between the
| probabilities for the 1st and 2nd most common tokens.
| eru wrote:
| > I was imagining the message encoded in clear text, not
| encrypted form, [...]
|
| I was considering that, but I came to the conclusion that
| it would be an exceedingly poor choice.
|
| Steganography is there to hide that a message has been
| sent at all. If you make it do double duty as a poor-
| man's encryption, you are going to have a bad time.
|
| > As such, I approached it as a toy problem, and
| considered detection by savvy parties to be a feature,
| not a bug; I imagined something more like a pirate
| broadcast than a secure line, and intentionally ignored
| the presumption about the message being encrypted first.
|
| That's an interesting toy problem. In that case, I would
| still suggest to compress the message, to reduce
| redundancy.
| textninja wrote:
| > If you make it do double duty as a poor-man's
| encryption, you are going to have a bad time.
|
| For the serious use cases you evidently have in mind,
| yes, it's folly to have it do double duty, but at the end
| of the day steganography is an obfuscation technique
| orthogonal to encryption, so the question of whether to
| use encryption or not is a nuanced one. Anyhow, I don't
| think it's fair to characterize this elaborate
| steganography tech as a poor-man's encryption -- LLM
| tokens are expensive!
| eru wrote:
| > Anyhow, I don't think it's fair to characterize this
| elaborate steganography tech as a poor-man's encryption
| -- LLM tokens are expensive!
|
| I guess it's a "rich fool's encryption".
| textninja wrote:
| Haha, sure, you can call it that if you want, but foolish
| is cousin to fun, so one application of this tech would
| be as a comically overwrought way of communicating
| subtext to an adversary who may not be able to read
| between the lines otherwise. Imagine using all this
| highly sophisticated and expensive technology just to
| write "you're an asshole" to some armchair intelligence
| analyst who spent their afternoon and monthly token quota
| decoding your secret message.
|
| Seed for the message above is 42 by the way.
|
| (Just kidding!)
| __MatrixMan__ wrote:
| I imagine a social media site full of bots chatting about
| nonsense. Hidden in the nonsense are humans chatting about
| different nonsense. This way, server costs get paid for by
| advertisers, but its really only bots that see the ads anyway.
| spullara wrote:
| if the ads aren't effective people won't buy them
| eviks wrote:
| People have been buying ineffective ads since the invention
| of ads
| spullara wrote:
| zero clicks is a little different
| eviks wrote:
| Bots do click in real ad fraud, so your moved goalpost
| isn't all that solid
| spullara wrote:
| sorry, conversions is really what I meant. if the bots
| are also buying the stuff then it would work.
| otabdeveloper4 wrote:
| Not really, advertising is really the only field of human
| endeavour that is both data-driven and results-oriented.
|
| (Doesn't still stop smart people from committing fraud,
| but that is a different story.)
| benterix wrote:
| Unfortunately I beg to differ. I worked for several
| companies where we the management clearly saw that the
| results were very poor (for Facebook ads, for example)
| but continued to invest because there is a defined budget
| for it and so on. It was like this last year and 20 years
| ago.
| jazzyjackson wrote:
| these companies should be outcompeted by firms that don't
| blow a million dollars a month paying out to click
| fraudsters but alas the market is not perfectly
| competitive
|
| is it a cargo cult? it works for coca cola so maybe if we
| just spend a little more we'll see returns...
| benterix wrote:
| Yes, I feel it might be cargo cult, at least in part. The
| argument I usually heard was that "But other companies
| are doing that, too".
| otabdeveloper4 wrote:
| Yes, most fraud is inside the corporate structure. Not
| shady "hacker" types in Romania.
| j16sdiz wrote:
| It's called Twitter.
|
| It's no nonsense, just catvideo and porns.
| __MatrixMan__ wrote:
| Hmm yes, sensical things those.
|
| Are you proposing that they're really only posted as a
| medium for encoding something else that we're not privy to?
| If so, somebody took my idea.
| dools wrote:
| I think the opponent in the proposed use case for this tool is
| the gun you're pointing at your foot, and this tool prevents
| you from pulling the trigger.
| BeefWellington wrote:
| See also:
|
| - trufflehog: https://github.com/trufflesecurity/trufflehog
|
| - detect-secrets: https://github.com/Yelp/detect-secrets
|
| - semgrep secrets: https://semgrep.dev/products/semgrep-secrets
| -- (Paid, but may be included in existing licenses in some cases
| bbno4 wrote:
| Also see PyWhat for both interesting strings and secrets
| https://github.com/bee-san/pyWhat
| jonstewart wrote:
| noseyparker is another good one: https://github.com/praetorian-
| inc/noseyparker
|
| I think these solutions are all much better for finding secrets
| than something naive based on entropy. Yes, entropy is more
| general but these are well established tools that have been
| through the fire of many, many data sets.
| upg1979 wrote:
| See also:
|
| https://github.com/gitleaks/gitleaks
| xedeon wrote:
| ggshield from GitGuardian has been great for us.
|
| Their free service can also auto detect and notify you of leaked
| secrets, passwords or high entropy lines from your online repos.
|
| https://github.com/GitGuardian/ggshield
| icapybara wrote:
| Gonna have to explain how a "high entropy line" is calculated and
| why it might be secrets.
| ngonch wrote:
| For example: https://complexity-calculator.com/
| daemonologist wrote:
| Entropy of information is basically how well it can be
| compressed. Random noise usually doesn't compress much at all
| and thus has high entropy, whereas written natural language can
| usually be compressed quite a bit. Since many passwords and
| tokens will be randomly generated or at least nonsense, looking
| for high entropy might pick up on them.
|
| This package seems to be measuring entropy by counting the
| occurrences of each character in each line, and ranking lines
| with a high proportion of repeated characters as having low
| entropy. I don't know how closely this corresponds with the
| precise definition. Source:
| https://github.com/EwenQuim/entropy/blob/f7543efe130cfbb5f0a...
|
| More:
| https://en.wikipedia.org/wiki/Entropy_(information_theory)
| eru wrote:
| Of course, this heuristic fails for weak passwords.
|
| And it fails for passphrases like 'correct battery horse
| staple', which have a large enough total entropy to be good
| passwords, but have a low entropy per character.
| dumbo-octopus wrote:
| 4 diceware words is hardly a good password. It's ~51 bits
| of entropy, about the same as 8 random ascii symbols. It
| could be trivially cracked in less than an hour. Your
| average variable name assigned to the result of an object
| name with a method name called with a couple parameter
| names has much more entropy.
| conradludgate wrote:
| If you can crack a single 52bit password in an hour,
| that's suggesting you can crack a 40bit password every
| second. That's 1 trillion hashes per second.
| otabdeveloper4 wrote:
| Salts and timeouts made that password cracking technique
| obsolete anyways.
| dumbo-octopus wrote:
| Only for online access. Offline access is still a thing,
| and in no way "obsolete".
| dumbo-octopus wrote:
| 350B H/s was achieved in 2012 on consumer hardware.
| That's over 12 years ago, and several lifetimes of GPU
| improvements ago. 4 diceware words is simply not
| appropriate for anything remotely confidential, and it is
| bad for the community to pretend otherwise.
|
| https://theworld.com/~reinhold/dicewarefaq.html
| eru wrote:
| Just imagine my example used 8 words.
| dumbo-octopus wrote:
| But it didn't. It perpetuated the exceedingly common myth
| that 52 bits is somehow enough. This has been considered
| bad practice for well over a decade now.
| https://theworld.com/~reinhold/dicewarefaq.html
| baq wrote:
| So you do random capital words, random punctuation and
| add a number somewhere and you're at 60. Add more for
| whatever threat model you're trying to be secure against.
|
| https://beta.xkpasswd.net/
| eru wrote:
| The random punctuation sort-of defeats the point, doesn't
| it?
|
| Otherwise, I agree.
| baq wrote:
| Not sure; you can use the same character instead of a
| space and still get a few bits. Of course different ones
| would be better, but again, depends on how many bits you
| actually need.
| eru wrote:
| I thought the point was to construct a password that's
| secure enough _and_ easy to remember for humans.
|
| Adding random punctuation helps with the former, but
| might interfere with the latter. (In the extreme case,
| you just generate completely random strings character for
| character. That's the most secure, but the least
| memorable.)
| baq wrote:
| > enough
|
| key word here, I think we agree ;)
| hamasho wrote:
| I didn't know what entropy means in software, so here's the
| definition[0]:
|
| ---- Software entropy is a measure of the
| disorder or complexity of a software system. It is a natural
| tendency for software entropy to increase over time, as new
| features are added and the codebase becomes more complex.
| High entropy in software development means that the code is
| difficult to understand, maintain, and extend. It is often
| characterized by: Duplicated code: The same code
| or functionality is repeated in multiple places, which can make
| it difficult to find and fix bugs. Complex logic: The
| code is difficult to follow and understand, which can make it
| difficult to add new features or fix bugs without introducing new
| ones. Poor documentation: The code is not well-
| documented, which can make it difficult for new developers to
| understand and contribute to the codebase. Technical
| debt: The code has been patched and modified over time without
| proper refactoring, which can lead to a tangled and cluttered
| codebase. Low entropy in software development means
| that the code is well-organized, easy to understand, and
| maintain. It is often characterized by: Well-
| designed architecture: The code is structured in a logical way,
| with clear separation of concerns. Consistent coding
| style: The code follows a consistent coding style, which makes it
| easy to read and understand. Comprehensive
| documentation: The code is well-documented, with clear
| explanations of the code's purpose and functionality.
| Minimal technical debt: The code has been refactored regularly to
| remove technical debt, which makes it easy to add new features
| and fix bugs without introducing new ones.
|
| [0] https://www.kisphp.com/python/high-and-low-entropy-in-
| softwa...
| itemize wrote:
| thanks for the search. this is textual entropy however, I am
| not sure if definition is applicable
| eru wrote:
| Yes, it's not applicable. See
| https://en.wikipedia.org/wiki/Entropy_(information_theory)
| for something more applicable.
| hamasho wrote:
| Oh, thanks for pointing out.
| tonyabracadabra wrote:
| interesting! can the similar measurement be applied to finding
| redundant code (like low entropy) with extra works?
| cowsaymoo wrote:
| I transcend this problem by making all my database passwords
| 'abcd'
| kgeist wrote:
| The tool found "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstu
| vwxyz1234567890" in our codebase as a high entropy line :)
| g15jv2dp wrote:
| Well, it is...
| saurik wrote:
| I mean, it certainly has a low Kolmogorov complexity (which
| is what I would really want to be measuring somehow for
| this tool... note that I am not claiming that is possible:
| just an ideal); I am unsure whether how that affects the
| related bounds on Shannon entropy, though.
| jraph wrote:
| ...a very verbose way to match alphanumeric characters :-)
| ngneer wrote:
| Then use it as your password ;)
| josephg wrote:
| You can use LLMs as compressors, and I wonder how it would go
| with that.
|
| The approach is simple: Turn the file into a stream of
| tokens. For each token, ask a language model to generate the
| full set of predictions based on context, and sort based on
| likelihood. Look where the actual token appears in the sorted
| list. Low entropy symbols will be near the start of the list,
| and high entropy tokens near the end.
|
| I suspect most language models would deal with your alphabet
| example just fine, while still correctly spotting passwords
| and API keys. It would be a fun experiment to try!
| randomtoast wrote:
| Reminds me of https://xkcd.com/936/ I think "correct horse
| battery staple" has a low entropy, since it is just ordinary
| looking words (strings).
| josephg wrote:
| A quick Google search suggests English has about 10 bits of
| entropy per word. Having a long password like that can still
| have high total entropy I suppose, but it has a low entropy
| _density_.
| kqr wrote:
| Maybe 10 bits is the average over the dictionary - which is
| what matters here, but over normal text it is significantly
| less. Our best current estimation for relatively high-level
| text (texts published by the EU) is 6 bits per word[1].
|
| However, as our methods of predicting text improve, this
| number is revised down. LLMs ought to have made a serious
| dent in it, but I haven't looked up any newer results.
|
| Anyway, all of this to say is that which words are chosen
| matters, but how they are put together matters perhaps
| more.
|
| [1]: http://arxiv.org/pdf/1606.06996
| nvy wrote:
| Username: postgres
|
| Password: postgres
| krick wrote:
| Is there any good posts about the use of entropy for tasks like
| that? I am wondering for quite some time of how do people
| actually use it and if it is any effective, but never actually
| got to investigating the problem myself.
|
| First of all, how to define "entropy" for text is a bit unclear
| in the first place. Here it's as simple as `-Sum(x log(x))` where
| x = countOccurences(char) / len(text). And that raises a lot of
| questions about how good this actually works. How long string
| needs to be for this to work? Is there a [?]constant entropy for
| natural languages? Is there a better approach? I mean, it seems
| there _must_ be: "obviously" "vorpal" must have lower "entropy"
| than "hJ6&:a". You and I both "know" that because 1) the latter
| "seems" to use much larger character set than natural language;
| 2) even if it didn't, the ordering of characters matters, the
| former just "sounds" like a real word, despite being made up by
| Carroll. Yet this "entropy" everybody seems to use has no idea
| about any of it. Both will have exactly the same "entropy". So,
| ok, maybe this does work good enough for yet-another-github-
| password-searcher. But is there anything better? Is there more
| meaningful metric of randomness for text?
|
| Dozens of projects like this, everybody using "entropy" as if
| it's something obvious, but I've never seen a proper research on
| the subject.
| hackinthebochs wrote:
| Entropy is a measure of complexity or disorder of a signal. The
| interesting part is that the disorder is with respect to the
| proper basis or dictionary. Something can look complex in one
| encoding but be low entropy in the right encoding. You need to
| know the right basis, or figure it out from the context, to
| accurately determine the entropy of a signal. A much stronger
| way of building a tool like the OP is to have a few pre-
| computed dictionaries for a range of typical source texts
| (source code, natural language), then encode the string against
| each dictionary, comparing the compressibility of the string. A
| high entropy string like a secret will compress poorly against
| all available dictionaries.
| jazzyjackson wrote:
| bookmarking to think about later... does this hold for
| representing numbers as one base compared to another?
|
| Regarding a prime as having higher entropy / less structure
| than say a perfect square or highly divisible number
|
| a prime is a prime in any base, but the number of divisors
| will differ in non-primes, if the number is divisible by the
| base then it may appear to have more structure (smaller
| function necessary to derive, kolmogorov style),
|
| does prime factorization have anything to do with this? i can
| almost imagine choosing a large non-prime whose divisibity is
| only obvious with a particular base such that the base
| becomes the secret key - base of a number is basically
| specifying your dictionary, no?
| eigenket wrote:
| I don't think there's any interesting difference with
| different bases. Usually the base you represent stuff in is
| a relatively small number (because using very large bases
| is already wildly inefficient). I think it only usually
| makes sense to consider constant or logarithmic bases. If
| your base is scaling linearly with your number then things
| are going to be weird.
|
| The problem of finding factors is only complex when you're
| asking about relatively big factors. If you're looking for
| constant or log sized factors you can just do trial
| division and find them.
| maCDzP wrote:
| Also bookmarking to think about it.
|
| My mind drifted towards Fourier transform. Using the
| transform as a way of describing a system with less
| entropy?
|
| Or am I butchering all of mathematics by making this
| comparison?
| hackinthebochs wrote:
| There's some precedence for that. I'm pretty sure
| wavelets are SOTA for compression.
| hackinthebochs wrote:
| Changing the base of number representation with a random
| basis feels like XORing a string with a random string,
| which is to say you're adding entropy equal to the random
| string. My thinking is that for any number representation
| M, you can get any other number representation N given a
| well-chosen base. So when presented with the encoded N, the
| original number could be any other number with the same
| number of digits. But once you put reasonable bounds on the
| base, you lose that flexibility and end up adding
| negligible entropy.
| GTP wrote:
| > So when presented with the encoded N, the original
| number could be any other number with the same number of
| digits
|
| Not necessarily the same number of digits, when changing
| the base the number of digits may change as well. E.g.,
| decimal 8 becomes 1000 in binary.
| GTP wrote:
| > the number of divisors will differ in non-primes
|
| Could you please present an example of this?
| krick wrote:
| I only briefly browsed the code, but this seems to be roughly
| what yelp/detect-secrets does.
|
| Anyway, that doesn't really answer my question. To summarize
| answers in this thread, I think PhilipRoman has captured the
| essence of it: strictly speaking, the idea of entropy of a
| _known_ string is nonsense. So, as I suspected, information
| theory definition isn 't meaningfully applicable to the
| problem. And as other commenters like you mentioned, what we
| are _really_ trying to measure is basically Kolmogorov
| complexity, which, strictly speaking, is incomputable, but
| measuring the compression rate for some well-known popular
| compression algorithm (allegedly) seems to be good enough
| estimate, empirically.
|
| But I think it's still an interesting linguistic question.
| Meaningful or not, but it's well defined: so does it appear
| to work? Are there known constants for different kinds of
| text for any of these (or other) metrics? I would suspect
| this should have been explored already, but neither me, nor
| anybody in this thread apparently has ever stumbled upon such
| article.
| wwalexander wrote:
| The Kolmogorov complexity of an arbitrary string is
| uncomputable.
| PhilipRoman wrote:
| Entropy of a particular string isn't a rigorous mathematical
| idea, since by definition the string which is known can only
| take one value, the "entropy" is therefore zero bits. The
| reason why we can distinguish non-random data from random is
| that only a small subset of all possible states are considered
| useful for humans, and since we have an idea what that subset
| looks like, we can try to estimate what process was used to
| generate a particular string.
|
| There are of course statistical tests like
| https://en.wikipedia.org/wiki/Diehard_tests, which are good
| enough for distinguishing low entropy and high entropy data,
| but current pseudo-random number generators have no problem
| passing all of those, even though their actual "entropy" is
| just the seed plus approximate the complexity of the algorithm.
| josephg wrote:
| If you're looking for a rigorous mathematical idea, what
| people are trying to measure is the Kolmogorov complexity of
| the code. Measuring the compressed length is a rough estimate
| of that value.
|
| https://en.m.wikipedia.org/wiki/Kolmogorov_complexity
| PhilipRoman wrote:
| Yes, although (and here my understanding of Kolmogorov
| complexity ends) it still depends heavily on the choice of
| language and it seems to me like "aaaaaaaaa" is only less
| complex than "pSE+4z*K58" due to assuming a sane, human-
| centric language which is very different from the "average"
| of all possible languages. Which then leads me to wonder
| how to construct an adversarial turing-complete language
| which has unintuitive Kolmogorov complexities.
| kqr wrote:
| Kolmogorov complexity conventionally refers to the Turing
| machine as the base for implementation. This indeed makes
| repeated letters significantly less complex than that
| other string. (If you want intuition for how much code is
| needed to do something on a Turing machine, learn and
| play around a bit with Brainfuck. It's actually quite
| nice for that.)
| josephg wrote:
| > due to assuming a sane, human-centric language
|
| There's no requirement that the K-complexity is measured
| in a human centric language. Arguably all compression
| formats are languages too, which can be executed to
| produce the decompressed result. They are not designed to
| be human centric at all, and yet they do a surprisingly
| decent job at providing an estimate (well, upper bound)
| on Kolmogorov complexity. - As we can see in this
| program.
| g15jv2dp wrote:
| Why would I need to install go to run this tool? I thought one
| advantage of go was that devs could just distribute a single
| binary file that works...
| benterix wrote:
| Because it's a security tool so trusting a binary upfront
| defeats the purpose. With source you at least have the option
| to inspect what it really does.
| menacingly wrote:
| does the stated purpose of the tool influence whether or not
| you can trust it?
| spoonjim wrote:
| If you're trying to improve the security of your product by
| running random binaries from the Internet you're going to
| have a bad time
| saagarjha wrote:
| That's how most people run compilers
| benterix wrote:
| This is argumentum ad absurdum - there is a reason why
| trusting your kernel and compiler is a reasonable
| compromise, even though there might be security issues in
| them, but random pieces of software downloaded from the
| Internet is not.
| Ensorceled wrote:
| Wait ... you download random compilers from the internet?
| Or are you asserting equivalence between getting go from
| Google or Xcode from Apple and an random home brew
| install?
| alias_neo wrote:
| I think that question is a little backwards.
|
| Certain tools are more likely to be used by people working
| in spaces where they should/must be less trusting.
|
| If there was a tool (there is) to scan my platform
| deployment against some NCSC/NSA guidance for platform
| security, and I wanted to use it, I'm likely operating in a
| space that should consider being cautious about running
| random tools I find on the internet.
| g15jv2dp wrote:
| Uh? OP just released a docker image and wants to release a
| homebrew thingy. Even assuming that was you say is somehow
| sensible, it's not the reason, no. You're just grasping at
| straws.
| lanfeust wrote:
| I'd love to have it on _homebrew_ but my PR is denied so I 'll
| have to create my own brew tap or convince them to accept it.
|
| I'll also create a docker image.
|
| I just didn't expect this much popularity so the repo isn't
| 100% ready te be honest
| drexlspivey wrote:
| Making a tap is super easy, you just upload a file with 5 LoC
| to github. I wouldn't even bother with brew core.
| lanfeust wrote:
| Oh ok I'll try then
| lanfeust wrote:
| The docker container is now ready to use and documented on the
| home page
| diggan wrote:
| Just awaiting the Kubernetes setup/Helm charts now and soon
| almost anyone can use it!
| kqr wrote:
| Interesting. If I had to do this, I would have done something
| like perl -lne 'next unless $_; $z = qx(echo
| "$_" | gzip | wc -c); printf "%5.2f %s\n", $z/length($_), $_'
|
| on the principle that high entropy means it compresses badly.
| However, that uses each line as the dictionary, rather than the
| entire file, so it has a little trouble with very short lines
| which compress badly.
|
| It did react to this line return map { $_ > 1 ?
| 1 : ($_ < 0 ? 0 : $_) } @vs;
|
| which is valid code but indeed seems kind of high in entropy. I
| was also able to fool it to not detect a high-entropy line by
| adding a comment of natural English to it.
|
| I'm on the go but it would be interesting to see comparisons
| between the Perl command and this tool. The benefit of the Perl
| command is that it would run out of the box on any non-Windows
| machine so it might not need to be as powerful to gain adoption.
| blixt wrote:
| I guess you could take all lines in the file except the one
| you're testing and measure the filesize, then add the line and
| measure again. The delta should then be more fair. You could
| even do this by concatenating all code files and then testing
| line by line across the entire repo, but that would probably be
| too slow.
| GuB-42 wrote:
| I would use a better compressor than gzip but I have done this
| trick several times.
|
| xz or zstd may be better choices, or you can look at Hutter
| Prize [1] winners for best compression and therefore best
| entropy estimate.
|
| [1] http://prize.hutter1.net/
| nequo wrote:
| > best compression and therefore best entropy estimate
|
| That's a good point. But the Hutter Prize is for compressing
| a 1 GB file. On inputs as short as a line of code, gzip
| doesn't do so badly. For a longer line: $
| INPUT=' bool isRegPair() const { return kind() ==
| RegisterPair || kind() == LateRegisterPair || kind() ==
| SomeLateRegisterPair; }' $ echo "$INPUT" | gzip | wc -c
| 95 $ echo "$INPUT" | bzip2 | wc -c 118 $
| echo "$INPUT" | xz -F xz | wc -c 140 $ echo
| "$INPUT" | xz -F lzma | wc -c 97 $ echo "$INPUT"
| | zstd | wc -c 92
|
| For a shorter line: $ INPUT='
| ASSERT(regHi().isGPR());' $ echo "$INPUT" | gzip | wc
| -c 48 $ echo "$INPUT" | bzip2 | wc -c 73
| $ echo "$INPUT" | xz -F xz | wc -c 92 $ echo
| "$INPUT" | xz -F lzma | wc -c 51 $ echo "$INPUT"
| | zstd | wc -c 46
| josephg wrote:
| I learned Go many years ago doing some advent of code problems.
| As I solved each problem, my housemate pestered me for a look
| and then rewrote my solutions (each needing 10-50 lines of go)
| into Ruby one-liners. All the while making fun of Go and my
| silly programs. I wasn't intending to, but I ended up learning
| a lot Ruby that night too.
|
| Thankyou for continuing the tradition.
| crazygringo wrote:
| Are there any command-line tools for zip or similar that allow
| you to predefine a dictionary over one or more files, and then
| use that dictionary to compress small files?
|
| Which would require the dictionary as a separate input when
| decompressing, of course?
| kqr wrote:
| gzip (or really DEFLATE) does actually come with a small
| predefined dictionary (the "fixed Huffman codes" in the RFC)
| which is somewhat optimised for latin letters in UTF-8, but I
| have not verified that this is indeed what ends up being used
| when compressing individual lines of source code.
| crazypython wrote:
| It would be interesting to see a variant of this that used a
| small language model to measure entropy.
| saagarjha wrote:
| Why would you do that when measuring entropy is easy to do with
| a normal program
| saagarjha wrote:
| I assume this will have a bad time on compressed files?
| lanfeust wrote:
| .zip extension is ignored by default along with other binary
| formats :)
| saagarjha wrote:
| Right but like .tar.gz, etc. are also a thing
| lanfeust wrote:
| You can just add your extensions to ignore with --ignore-
| ext. But I'll add .tar.gz and .tar.bz2 since they are
| widely used.
| frumiousirc wrote:
| Or, have the tool recursively read the .tar files'
| contents.
| seethishat wrote:
| This reminds me of the program 'ent' (which I have used for a
| very long time)
|
| https://fourmilab.ch/random/
| MarkMarine wrote:
| Another way to do this would be to compress the file and compare
| the compressed size to the uncompressed size.
|
| Encrypted files do not compress well compared to code, I saw a
| phd thesis that postulated an inverse ratio of compression
| efficiency to performance data mining, this would be the opposite
| p0w3n3d wrote:
| xkcd.com/936/
| blixt wrote:
| I guess a language model like Llama 3 could model surprise on a
| token-by-token basis and detect the areas that are most
| surprising, i.e. highest entropy. Because as one example
| mentioned, the entire alphabet may have high entropy in some
| regards, but it should be very unsurprising to a code-aware
| language model that in a codebase you have the Base62 alphabet as
| a constant.
| weipe-af wrote:
| It would be useful if it also trawled through the full git
| history of the project - a secret could have been checked in and
| later removed, but still exist in the history.
| thomascountz wrote:
| Thank you DrJones for asking what a high entropy string is
| several years ago[0] and linking to a good article on it.[1]
|
| [0] https://news.ycombinator.com/item?id=13304641
|
| [1] https://www.splunk.com/en_us/blog/security/random-words-
| on-e...
| baryphonic wrote:
| Neat tool.
|
| Would be cool if this CLI could have a flag to read .gitignore
| and exclude all of the contents automatically.
|
| Also it might be cool to have different strategies for detecting
| secrets, e.g. Kolmogorov complexity as other comments have noted.
___________________________________________________________________
(page generated 2024-06-05 23:02 UTC)