[HN Gopher] Entropy, a CLI that scans files to find high entropy...
       ___________________________________________________________________
        
       Entropy, a CLI that scans files to find high entropy lines (might
       be secrets)
        
       Author : lanfeust
       Score  : 222 points
       Date   : 2024-06-04 19:25 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | DLA wrote:
       | This looks like a very handy CLI tool. Nice Go code also. Thanks.
        
         | lanfeust wrote:
         | thanks!
        
       | trulyhnh wrote:
       | GoDaddy open sourced something similar
       | https://github.com/godaddy/tartufo
        
       | thsksbd wrote:
       | This is very cool, but I have a thought - I see this as a last
       | line of defense, and I am concerned that this would this give a
       | false sense of security leading people to be more reckless with
       | secrets.
        
         | 0cf8612b2e1e wrote:
         | Ehhh considering how low the security bar is, I think it is
         | better than nothing. If you inherit a code base, make it a
         | quick initial action to see how much pain you can expect. In
         | practice, I expect a tool like this has so many false positives
         | you cannot keep it as an always running action. More a manual
         | review you run occasionally.
         | 
         | I hope that more secrets adopt a GitHub like convention where
         | they are prefaced with an identifier string so that you do not
         | require heuristics to detect them.
        
           | lanfeust wrote:
           | Indeed. I open-sourced `entropy` after we discovered an
           | actual secret leak in our client codebase
        
         | alexchantavy wrote:
         | The pie in the sky goal for any security org is to have a cred
         | rotation process that is so smooth that you're able to not
         | worry about leaked creds because it's so fast and easy to
         | rotate them. If the rotation is automated and if it's cheap and
         | frictionless to do so, heck why not just rotate them multiple
         | times a day.
        
         | bongodongobob wrote:
         | No, it's a way to audit and see the modes that your security
         | policy is failing. At least that's how I look at it.
        
         | kmoser wrote:
         | You could make the same argument for any tool that does not
         | provide high security. In fact security is layered, and no
         | single tool should be relied upon to be your one security tool.
         | You said as much yourself: "I see this as a last line of
         | defense," but I don't see how you conclude that this would
         | inherently cause people to be more reckless with secrets.
        
         | lm411 wrote:
         | Welcome to information security :)
        
       | coppsilgold wrote:
       | Note that in an adversarial setting this will only be effective
       | against careless opponents.
       | 
       | If you properly encode your secret it will have the entropy of
       | its surroundings.
       | 
       | For example you can hide a string of entropy (presumably
       | something encrypted) in text as a biased output of an LLM. To
       | recover it you would use the same LLM and measure deviations from
       | next-token probabilities. This will also fool humans examining it
       | as the sentence will be coherent.
        
         | textninja wrote:
         | What you described sounds like a very cool idea - LLM-driven
         | text steganography, basically - but intentional obfuscation is
         | not the problem this tool is trying to solve. To your point
         | about secrets with entropy similar to the surrounding text,
         | however, I wonder if this can pick up BIP39 Seed Phrases or if
         | whole word entropy fades into the background.
        
           | thephyber wrote:
           | The LLM adds no value here. Procedural generation in a loop
           | until some fitness function (perhaps a frequency analysis
           | metric) is satisfied.
        
             | eru wrote:
             | The LLM is the fitness function.
        
         | buildbot wrote:
         | In general (for those unaware) this is called stenography. You
         | can hide an image in the lower bits of another image for
         | example too.
        
           | dragonwriter wrote:
           | _Steganography_ ; stenography is completely different.
        
             | buildbot wrote:
             | Thanks
        
         | textninja wrote:
         | The weights of the LLM become the private key (so it better be
         | a pinned version of a model with open weights), and for most
         | practical applications (i.e. unless you're willing to
         | complicate your setup with fancy applied statistics and error
         | correction) you'd have to use a temperature of 0 as baseline.
         | 
         | Then, having done all that, such steganography may be
         | detectable using this very tool by encoding the difference
         | between the LLM's prediction and ground truth, but searching
         | for substrings with low entropy instead!
        
           | eru wrote:
           | You seem to be making some weird assumptions?
           | 
           | Here's how I would do this:
           | 
           | Use some LLM, the weights need to be know to both parties in
           | the communication.
           | 
           | Producing text with the LLM means repeatedly feeding the LLM
           | with the text-so-far to produce a probability distribution
           | for the next token. You then use a random number generator to
           | pick a token from that distribution.
           | 
           | If you want to turn this into steganography, you first take
           | your cleartext and encrypt it with any old encryption system.
           | The resulting bistream should be random-looking, if your
           | encryption ain't broken. Now you take the LLM-mechanism I
           | described above, but instead of sampling via a random number
           | generator, you use your ciphertext as the source of entropy.
           | (You need to use something like arithmetic coding to convert
           | between your uniformly random-looking bitstream and the
           | heavily weighted choices you make to sample your LLM. See
           | https://en.wikipedia.org/wiki/Arithmetic_coding)
           | 
           | Almost any temperature will work, as long as it is known to
           | both sender and receiver. (The 'temperature' parameter can be
           | used to change the distribution, but it's still effectively a
           | probability distribution at the end. And that's all that's
           | required.)
        
             | textninja wrote:
             | I was imagining the message encoded in clear text, not
             | encrypted form, because given the lengths required to
             | coordinate protocol, keys, weights, and so on, I assumed
             | there would be more efficient ways to disguise a message
             | than a novel form of steganography. As such, I approached
             | it as a toy problem, and considered detection by savvy
             | parties to be a feature, not a bug; I imagined something
             | more like a pirate broadcast than a secure line, and
             | intentionally ignored the presumption about the message
             | being encrypted first.
             | 
             | That being said, yes, some of my assumptions were
             | incorrect, mainly regarding temperature. For practical
             | reasons I was envisioning this being implemented with a
             | third party LLM (i.e. OpenAI's,) but I didn't realize those
             | could have their RNG seeded as well. There is the
             | security/convenience tradeoff to consider, however, and
             | simply setting the temperature to 0 is a lot easier to
             | coordinate between sender and receiver than adding two
             | arbitrary numbers for temperature and seed.
             | 
             | I misspoke, or at least left myself open to
             | misinterpretation when I referred to the LLM's weights as a
             | "secret key"; I didn't mean the weights themselves had to
             | be kept under wraps, but rather I meant that either the
             | weights had to be possessed by both parties (with the
             | knowledge of which weights to use being the "secret") or
             | they'd have to use a frozen version of a third party LLM,
             | in which case the knowledge about which version to use
             | would become the secret.
             | 
             | As for how I might take a first stab at this if I were to
             | try implementing it myself, I might encode the message
             | using a low base (let's say binary or ternary) and make the
             | first most likely token a 0, the second a 1, and so on, and
             | to offset the risk of producing pure nonsense I would
             | perhaps skip tokens with too large a gulf between the
             | probabilities for the 1st and 2nd most common tokens.
        
               | eru wrote:
               | > I was imagining the message encoded in clear text, not
               | encrypted form, [...]
               | 
               | I was considering that, but I came to the conclusion that
               | it would be an exceedingly poor choice.
               | 
               | Steganography is there to hide that a message has been
               | sent at all. If you make it do double duty as a poor-
               | man's encryption, you are going to have a bad time.
               | 
               | > As such, I approached it as a toy problem, and
               | considered detection by savvy parties to be a feature,
               | not a bug; I imagined something more like a pirate
               | broadcast than a secure line, and intentionally ignored
               | the presumption about the message being encrypted first.
               | 
               | That's an interesting toy problem. In that case, I would
               | still suggest to compress the message, to reduce
               | redundancy.
        
               | textninja wrote:
               | > If you make it do double duty as a poor-man's
               | encryption, you are going to have a bad time.
               | 
               | For the serious use cases you evidently have in mind,
               | yes, it's folly to have it do double duty, but at the end
               | of the day steganography is an obfuscation technique
               | orthogonal to encryption, so the question of whether to
               | use encryption or not is a nuanced one. Anyhow, I don't
               | think it's fair to characterize this elaborate
               | steganography tech as a poor-man's encryption -- LLM
               | tokens are expensive!
        
               | eru wrote:
               | > Anyhow, I don't think it's fair to characterize this
               | elaborate steganography tech as a poor-man's encryption
               | -- LLM tokens are expensive!
               | 
               | I guess it's a "rich fool's encryption".
        
               | textninja wrote:
               | Haha, sure, you can call it that if you want, but foolish
               | is cousin to fun, so one application of this tech would
               | be as a comically overwrought way of communicating
               | subtext to an adversary who may not be able to read
               | between the lines otherwise. Imagine using all this
               | highly sophisticated and expensive technology just to
               | write "you're an asshole" to some armchair intelligence
               | analyst who spent their afternoon and monthly token quota
               | decoding your secret message.
               | 
               | Seed for the message above is 42 by the way.
               | 
               | (Just kidding!)
        
         | __MatrixMan__ wrote:
         | I imagine a social media site full of bots chatting about
         | nonsense. Hidden in the nonsense are humans chatting about
         | different nonsense. This way, server costs get paid for by
         | advertisers, but its really only bots that see the ads anyway.
        
           | spullara wrote:
           | if the ads aren't effective people won't buy them
        
             | eviks wrote:
             | People have been buying ineffective ads since the invention
             | of ads
        
               | spullara wrote:
               | zero clicks is a little different
        
               | eviks wrote:
               | Bots do click in real ad fraud, so your moved goalpost
               | isn't all that solid
        
               | spullara wrote:
               | sorry, conversions is really what I meant. if the bots
               | are also buying the stuff then it would work.
        
               | otabdeveloper4 wrote:
               | Not really, advertising is really the only field of human
               | endeavour that is both data-driven and results-oriented.
               | 
               | (Doesn't still stop smart people from committing fraud,
               | but that is a different story.)
        
               | benterix wrote:
               | Unfortunately I beg to differ. I worked for several
               | companies where we the management clearly saw that the
               | results were very poor (for Facebook ads, for example)
               | but continued to invest because there is a defined budget
               | for it and so on. It was like this last year and 20 years
               | ago.
        
               | jazzyjackson wrote:
               | these companies should be outcompeted by firms that don't
               | blow a million dollars a month paying out to click
               | fraudsters but alas the market is not perfectly
               | competitive
               | 
               | is it a cargo cult? it works for coca cola so maybe if we
               | just spend a little more we'll see returns...
        
               | benterix wrote:
               | Yes, I feel it might be cargo cult, at least in part. The
               | argument I usually heard was that "But other companies
               | are doing that, too".
        
               | otabdeveloper4 wrote:
               | Yes, most fraud is inside the corporate structure. Not
               | shady "hacker" types in Romania.
        
           | j16sdiz wrote:
           | It's called Twitter.
           | 
           | It's no nonsense, just catvideo and porns.
        
             | __MatrixMan__ wrote:
             | Hmm yes, sensical things those.
             | 
             | Are you proposing that they're really only posted as a
             | medium for encoding something else that we're not privy to?
             | If so, somebody took my idea.
        
         | dools wrote:
         | I think the opponent in the proposed use case for this tool is
         | the gun you're pointing at your foot, and this tool prevents
         | you from pulling the trigger.
        
       | BeefWellington wrote:
       | See also:
       | 
       | - trufflehog: https://github.com/trufflesecurity/trufflehog
       | 
       | - detect-secrets: https://github.com/Yelp/detect-secrets
       | 
       | - semgrep secrets: https://semgrep.dev/products/semgrep-secrets
       | -- (Paid, but may be included in existing licenses in some cases
        
         | bbno4 wrote:
         | Also see PyWhat for both interesting strings and secrets
         | https://github.com/bee-san/pyWhat
        
         | jonstewart wrote:
         | noseyparker is another good one: https://github.com/praetorian-
         | inc/noseyparker
         | 
         | I think these solutions are all much better for finding secrets
         | than something naive based on entropy. Yes, entropy is more
         | general but these are well established tools that have been
         | through the fire of many, many data sets.
        
       | upg1979 wrote:
       | See also:
       | 
       | https://github.com/gitleaks/gitleaks
        
       | xedeon wrote:
       | ggshield from GitGuardian has been great for us.
       | 
       | Their free service can also auto detect and notify you of leaked
       | secrets, passwords or high entropy lines from your online repos.
       | 
       | https://github.com/GitGuardian/ggshield
        
       | icapybara wrote:
       | Gonna have to explain how a "high entropy line" is calculated and
       | why it might be secrets.
        
         | ngonch wrote:
         | For example: https://complexity-calculator.com/
        
         | daemonologist wrote:
         | Entropy of information is basically how well it can be
         | compressed. Random noise usually doesn't compress much at all
         | and thus has high entropy, whereas written natural language can
         | usually be compressed quite a bit. Since many passwords and
         | tokens will be randomly generated or at least nonsense, looking
         | for high entropy might pick up on them.
         | 
         | This package seems to be measuring entropy by counting the
         | occurrences of each character in each line, and ranking lines
         | with a high proportion of repeated characters as having low
         | entropy. I don't know how closely this corresponds with the
         | precise definition. Source:
         | https://github.com/EwenQuim/entropy/blob/f7543efe130cfbb5f0a...
         | 
         | More:
         | https://en.wikipedia.org/wiki/Entropy_(information_theory)
        
           | eru wrote:
           | Of course, this heuristic fails for weak passwords.
           | 
           | And it fails for passphrases like 'correct battery horse
           | staple', which have a large enough total entropy to be good
           | passwords, but have a low entropy per character.
        
             | dumbo-octopus wrote:
             | 4 diceware words is hardly a good password. It's ~51 bits
             | of entropy, about the same as 8 random ascii symbols. It
             | could be trivially cracked in less than an hour. Your
             | average variable name assigned to the result of an object
             | name with a method name called with a couple parameter
             | names has much more entropy.
        
               | conradludgate wrote:
               | If you can crack a single 52bit password in an hour,
               | that's suggesting you can crack a 40bit password every
               | second. That's 1 trillion hashes per second.
        
               | otabdeveloper4 wrote:
               | Salts and timeouts made that password cracking technique
               | obsolete anyways.
        
               | dumbo-octopus wrote:
               | Only for online access. Offline access is still a thing,
               | and in no way "obsolete".
        
               | dumbo-octopus wrote:
               | 350B H/s was achieved in 2012 on consumer hardware.
               | That's over 12 years ago, and several lifetimes of GPU
               | improvements ago. 4 diceware words is simply not
               | appropriate for anything remotely confidential, and it is
               | bad for the community to pretend otherwise.
               | 
               | https://theworld.com/~reinhold/dicewarefaq.html
        
               | eru wrote:
               | Just imagine my example used 8 words.
        
               | dumbo-octopus wrote:
               | But it didn't. It perpetuated the exceedingly common myth
               | that 52 bits is somehow enough. This has been considered
               | bad practice for well over a decade now.
               | https://theworld.com/~reinhold/dicewarefaq.html
        
               | baq wrote:
               | So you do random capital words, random punctuation and
               | add a number somewhere and you're at 60. Add more for
               | whatever threat model you're trying to be secure against.
               | 
               | https://beta.xkpasswd.net/
        
               | eru wrote:
               | The random punctuation sort-of defeats the point, doesn't
               | it?
               | 
               | Otherwise, I agree.
        
               | baq wrote:
               | Not sure; you can use the same character instead of a
               | space and still get a few bits. Of course different ones
               | would be better, but again, depends on how many bits you
               | actually need.
        
               | eru wrote:
               | I thought the point was to construct a password that's
               | secure enough _and_ easy to remember for humans.
               | 
               | Adding random punctuation helps with the former, but
               | might interfere with the latter. (In the extreme case,
               | you just generate completely random strings character for
               | character. That's the most secure, but the least
               | memorable.)
        
               | baq wrote:
               | > enough
               | 
               | key word here, I think we agree ;)
        
       | hamasho wrote:
       | I didn't know what entropy means in software, so here's the
       | definition[0]:
       | 
       | ----                 Software entropy is a measure of the
       | disorder or complexity of a software system. It is a natural
       | tendency for software entropy to increase over time, as new
       | features are added and the codebase becomes more complex.
       | High entropy in software development means that the code is
       | difficult to understand, maintain, and extend. It is often
       | characterized by:                Duplicated code: The same code
       | or functionality is repeated in multiple places, which can make
       | it difficult to find and fix bugs.           Complex logic: The
       | code is difficult to follow and understand, which can make it
       | difficult to add new features or fix bugs without introducing new
       | ones.           Poor documentation: The code is not well-
       | documented, which can make it difficult for new developers to
       | understand and contribute to the codebase.           Technical
       | debt: The code has been patched and modified over time without
       | proper refactoring, which can lead to a tangled and cluttered
       | codebase.            Low entropy in software development means
       | that the code is well-organized, easy to understand, and
       | maintain. It is often characterized by:                Well-
       | designed architecture: The code is structured in a logical way,
       | with clear separation of concerns.           Consistent coding
       | style: The code follows a consistent coding style, which makes it
       | easy to read and understand.           Comprehensive
       | documentation: The code is well-documented, with clear
       | explanations of the code's purpose and functionality.
       | Minimal technical debt: The code has been refactored regularly to
       | remove technical debt, which makes it easy to add new features
       | and fix bugs without introducing new ones.
       | 
       | [0] https://www.kisphp.com/python/high-and-low-entropy-in-
       | softwa...
        
         | itemize wrote:
         | thanks for the search. this is textual entropy however, I am
         | not sure if definition is applicable
        
           | eru wrote:
           | Yes, it's not applicable. See
           | https://en.wikipedia.org/wiki/Entropy_(information_theory)
           | for something more applicable.
        
           | hamasho wrote:
           | Oh, thanks for pointing out.
        
       | tonyabracadabra wrote:
       | interesting! can the similar measurement be applied to finding
       | redundant code (like low entropy) with extra works?
        
       | cowsaymoo wrote:
       | I transcend this problem by making all my database passwords
       | 'abcd'
        
         | kgeist wrote:
         | The tool found "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstu
         | vwxyz1234567890" in our codebase as a high entropy line :)
        
           | g15jv2dp wrote:
           | Well, it is...
        
             | saurik wrote:
             | I mean, it certainly has a low Kolmogorov complexity (which
             | is what I would really want to be measuring somehow for
             | this tool... note that I am not claiming that is possible:
             | just an ideal); I am unsure whether how that affects the
             | related bounds on Shannon entropy, though.
        
             | jraph wrote:
             | ...a very verbose way to match alphanumeric characters :-)
        
             | ngneer wrote:
             | Then use it as your password ;)
        
           | josephg wrote:
           | You can use LLMs as compressors, and I wonder how it would go
           | with that.
           | 
           | The approach is simple: Turn the file into a stream of
           | tokens. For each token, ask a language model to generate the
           | full set of predictions based on context, and sort based on
           | likelihood. Look where the actual token appears in the sorted
           | list. Low entropy symbols will be near the start of the list,
           | and high entropy tokens near the end.
           | 
           | I suspect most language models would deal with your alphabet
           | example just fine, while still correctly spotting passwords
           | and API keys. It would be a fun experiment to try!
        
         | randomtoast wrote:
         | Reminds me of https://xkcd.com/936/ I think "correct horse
         | battery staple" has a low entropy, since it is just ordinary
         | looking words (strings).
        
           | josephg wrote:
           | A quick Google search suggests English has about 10 bits of
           | entropy per word. Having a long password like that can still
           | have high total entropy I suppose, but it has a low entropy
           | _density_.
        
             | kqr wrote:
             | Maybe 10 bits is the average over the dictionary - which is
             | what matters here, but over normal text it is significantly
             | less. Our best current estimation for relatively high-level
             | text (texts published by the EU) is 6 bits per word[1].
             | 
             | However, as our methods of predicting text improve, this
             | number is revised down. LLMs ought to have made a serious
             | dent in it, but I haven't looked up any newer results.
             | 
             | Anyway, all of this to say is that which words are chosen
             | matters, but how they are put together matters perhaps
             | more.
             | 
             | [1]: http://arxiv.org/pdf/1606.06996
        
         | nvy wrote:
         | Username: postgres
         | 
         | Password: postgres
        
       | krick wrote:
       | Is there any good posts about the use of entropy for tasks like
       | that? I am wondering for quite some time of how do people
       | actually use it and if it is any effective, but never actually
       | got to investigating the problem myself.
       | 
       | First of all, how to define "entropy" for text is a bit unclear
       | in the first place. Here it's as simple as `-Sum(x log(x))` where
       | x = countOccurences(char) / len(text). And that raises a lot of
       | questions about how good this actually works. How long string
       | needs to be for this to work? Is there a [?]constant entropy for
       | natural languages? Is there a better approach? I mean, it seems
       | there _must_ be:  "obviously" "vorpal" must have lower "entropy"
       | than "hJ6&:a". You and I both "know" that because 1) the latter
       | "seems" to use much larger character set than natural language;
       | 2) even if it didn't, the ordering of characters matters, the
       | former just "sounds" like a real word, despite being made up by
       | Carroll. Yet this "entropy" everybody seems to use has no idea
       | about any of it. Both will have exactly the same "entropy". So,
       | ok, maybe this does work good enough for yet-another-github-
       | password-searcher. But is there anything better? Is there more
       | meaningful metric of randomness for text?
       | 
       | Dozens of projects like this, everybody using "entropy" as if
       | it's something obvious, but I've never seen a proper research on
       | the subject.
        
         | hackinthebochs wrote:
         | Entropy is a measure of complexity or disorder of a signal. The
         | interesting part is that the disorder is with respect to the
         | proper basis or dictionary. Something can look complex in one
         | encoding but be low entropy in the right encoding. You need to
         | know the right basis, or figure it out from the context, to
         | accurately determine the entropy of a signal. A much stronger
         | way of building a tool like the OP is to have a few pre-
         | computed dictionaries for a range of typical source texts
         | (source code, natural language), then encode the string against
         | each dictionary, comparing the compressibility of the string. A
         | high entropy string like a secret will compress poorly against
         | all available dictionaries.
        
           | jazzyjackson wrote:
           | bookmarking to think about later... does this hold for
           | representing numbers as one base compared to another?
           | 
           | Regarding a prime as having higher entropy / less structure
           | than say a perfect square or highly divisible number
           | 
           | a prime is a prime in any base, but the number of divisors
           | will differ in non-primes, if the number is divisible by the
           | base then it may appear to have more structure (smaller
           | function necessary to derive, kolmogorov style),
           | 
           | does prime factorization have anything to do with this? i can
           | almost imagine choosing a large non-prime whose divisibity is
           | only obvious with a particular base such that the base
           | becomes the secret key - base of a number is basically
           | specifying your dictionary, no?
        
             | eigenket wrote:
             | I don't think there's any interesting difference with
             | different bases. Usually the base you represent stuff in is
             | a relatively small number (because using very large bases
             | is already wildly inefficient). I think it only usually
             | makes sense to consider constant or logarithmic bases. If
             | your base is scaling linearly with your number then things
             | are going to be weird.
             | 
             | The problem of finding factors is only complex when you're
             | asking about relatively big factors. If you're looking for
             | constant or log sized factors you can just do trial
             | division and find them.
        
             | maCDzP wrote:
             | Also bookmarking to think about it.
             | 
             | My mind drifted towards Fourier transform. Using the
             | transform as a way of describing a system with less
             | entropy?
             | 
             | Or am I butchering all of mathematics by making this
             | comparison?
        
               | hackinthebochs wrote:
               | There's some precedence for that. I'm pretty sure
               | wavelets are SOTA for compression.
        
             | hackinthebochs wrote:
             | Changing the base of number representation with a random
             | basis feels like XORing a string with a random string,
             | which is to say you're adding entropy equal to the random
             | string. My thinking is that for any number representation
             | M, you can get any other number representation N given a
             | well-chosen base. So when presented with the encoded N, the
             | original number could be any other number with the same
             | number of digits. But once you put reasonable bounds on the
             | base, you lose that flexibility and end up adding
             | negligible entropy.
        
               | GTP wrote:
               | > So when presented with the encoded N, the original
               | number could be any other number with the same number of
               | digits
               | 
               | Not necessarily the same number of digits, when changing
               | the base the number of digits may change as well. E.g.,
               | decimal 8 becomes 1000 in binary.
        
             | GTP wrote:
             | > the number of divisors will differ in non-primes
             | 
             | Could you please present an example of this?
        
           | krick wrote:
           | I only briefly browsed the code, but this seems to be roughly
           | what yelp/detect-secrets does.
           | 
           | Anyway, that doesn't really answer my question. To summarize
           | answers in this thread, I think PhilipRoman has captured the
           | essence of it: strictly speaking, the idea of entropy of a
           | _known_ string is nonsense. So, as I suspected, information
           | theory definition isn 't meaningfully applicable to the
           | problem. And as other commenters like you mentioned, what we
           | are _really_ trying to measure is basically Kolmogorov
           | complexity, which, strictly speaking, is incomputable, but
           | measuring the compression rate for some well-known popular
           | compression algorithm (allegedly) seems to be good enough
           | estimate, empirically.
           | 
           | But I think it's still an interesting linguistic question.
           | Meaningful or not, but it's well defined: so does it appear
           | to work? Are there known constants for different kinds of
           | text for any of these (or other) metrics? I would suspect
           | this should have been explored already, but neither me, nor
           | anybody in this thread apparently has ever stumbled upon such
           | article.
        
         | wwalexander wrote:
         | The Kolmogorov complexity of an arbitrary string is
         | uncomputable.
        
         | PhilipRoman wrote:
         | Entropy of a particular string isn't a rigorous mathematical
         | idea, since by definition the string which is known can only
         | take one value, the "entropy" is therefore zero bits. The
         | reason why we can distinguish non-random data from random is
         | that only a small subset of all possible states are considered
         | useful for humans, and since we have an idea what that subset
         | looks like, we can try to estimate what process was used to
         | generate a particular string.
         | 
         | There are of course statistical tests like
         | https://en.wikipedia.org/wiki/Diehard_tests, which are good
         | enough for distinguishing low entropy and high entropy data,
         | but current pseudo-random number generators have no problem
         | passing all of those, even though their actual "entropy" is
         | just the seed plus approximate the complexity of the algorithm.
        
           | josephg wrote:
           | If you're looking for a rigorous mathematical idea, what
           | people are trying to measure is the Kolmogorov complexity of
           | the code. Measuring the compressed length is a rough estimate
           | of that value.
           | 
           | https://en.m.wikipedia.org/wiki/Kolmogorov_complexity
        
             | PhilipRoman wrote:
             | Yes, although (and here my understanding of Kolmogorov
             | complexity ends) it still depends heavily on the choice of
             | language and it seems to me like "aaaaaaaaa" is only less
             | complex than "pSE+4z*K58" due to assuming a sane, human-
             | centric language which is very different from the "average"
             | of all possible languages. Which then leads me to wonder
             | how to construct an adversarial turing-complete language
             | which has unintuitive Kolmogorov complexities.
        
               | kqr wrote:
               | Kolmogorov complexity conventionally refers to the Turing
               | machine as the base for implementation. This indeed makes
               | repeated letters significantly less complex than that
               | other string. (If you want intuition for how much code is
               | needed to do something on a Turing machine, learn and
               | play around a bit with Brainfuck. It's actually quite
               | nice for that.)
        
               | josephg wrote:
               | > due to assuming a sane, human-centric language
               | 
               | There's no requirement that the K-complexity is measured
               | in a human centric language. Arguably all compression
               | formats are languages too, which can be executed to
               | produce the decompressed result. They are not designed to
               | be human centric at all, and yet they do a surprisingly
               | decent job at providing an estimate (well, upper bound)
               | on Kolmogorov complexity. - As we can see in this
               | program.
        
       | g15jv2dp wrote:
       | Why would I need to install go to run this tool? I thought one
       | advantage of go was that devs could just distribute a single
       | binary file that works...
        
         | benterix wrote:
         | Because it's a security tool so trusting a binary upfront
         | defeats the purpose. With source you at least have the option
         | to inspect what it really does.
        
           | menacingly wrote:
           | does the stated purpose of the tool influence whether or not
           | you can trust it?
        
             | spoonjim wrote:
             | If you're trying to improve the security of your product by
             | running random binaries from the Internet you're going to
             | have a bad time
        
               | saagarjha wrote:
               | That's how most people run compilers
        
               | benterix wrote:
               | This is argumentum ad absurdum - there is a reason why
               | trusting your kernel and compiler is a reasonable
               | compromise, even though there might be security issues in
               | them, but random pieces of software downloaded from the
               | Internet is not.
        
               | Ensorceled wrote:
               | Wait ... you download random compilers from the internet?
               | Or are you asserting equivalence between getting go from
               | Google or Xcode from Apple and an random home brew
               | install?
        
             | alias_neo wrote:
             | I think that question is a little backwards.
             | 
             | Certain tools are more likely to be used by people working
             | in spaces where they should/must be less trusting.
             | 
             | If there was a tool (there is) to scan my platform
             | deployment against some NCSC/NSA guidance for platform
             | security, and I wanted to use it, I'm likely operating in a
             | space that should consider being cautious about running
             | random tools I find on the internet.
        
           | g15jv2dp wrote:
           | Uh? OP just released a docker image and wants to release a
           | homebrew thingy. Even assuming that was you say is somehow
           | sensible, it's not the reason, no. You're just grasping at
           | straws.
        
         | lanfeust wrote:
         | I'd love to have it on _homebrew_ but my PR is denied so I 'll
         | have to create my own brew tap or convince them to accept it.
         | 
         | I'll also create a docker image.
         | 
         | I just didn't expect this much popularity so the repo isn't
         | 100% ready te be honest
        
           | drexlspivey wrote:
           | Making a tap is super easy, you just upload a file with 5 LoC
           | to github. I wouldn't even bother with brew core.
        
             | lanfeust wrote:
             | Oh ok I'll try then
        
         | lanfeust wrote:
         | The docker container is now ready to use and documented on the
         | home page
        
           | diggan wrote:
           | Just awaiting the Kubernetes setup/Helm charts now and soon
           | almost anyone can use it!
        
       | kqr wrote:
       | Interesting. If I had to do this, I would have done something
       | like                   perl -lne 'next unless $_; $z = qx(echo
       | "$_" | gzip | wc -c); printf "%5.2f    %s\n", $z/length($_), $_'
       | 
       | on the principle that high entropy means it compresses badly.
       | However, that uses each line as the dictionary, rather than the
       | entire file, so it has a little trouble with very short lines
       | which compress badly.
       | 
       | It did react to this line                   return map { $_ > 1 ?
       | 1 : ($_ < 0 ? 0 : $_) } @vs;
       | 
       | which is valid code but indeed seems kind of high in entropy. I
       | was also able to fool it to not detect a high-entropy line by
       | adding a comment of natural English to it.
       | 
       | I'm on the go but it would be interesting to see comparisons
       | between the Perl command and this tool. The benefit of the Perl
       | command is that it would run out of the box on any non-Windows
       | machine so it might not need to be as powerful to gain adoption.
        
         | blixt wrote:
         | I guess you could take all lines in the file except the one
         | you're testing and measure the filesize, then add the line and
         | measure again. The delta should then be more fair. You could
         | even do this by concatenating all code files and then testing
         | line by line across the entire repo, but that would probably be
         | too slow.
        
         | GuB-42 wrote:
         | I would use a better compressor than gzip but I have done this
         | trick several times.
         | 
         | xz or zstd may be better choices, or you can look at Hutter
         | Prize [1] winners for best compression and therefore best
         | entropy estimate.
         | 
         | [1] http://prize.hutter1.net/
        
           | nequo wrote:
           | > best compression and therefore best entropy estimate
           | 
           | That's a good point. But the Hutter Prize is for compressing
           | a 1 GB file. On inputs as short as a line of code, gzip
           | doesn't do so badly. For a longer line:                 $
           | INPUT='    bool isRegPair() const { return kind() ==
           | RegisterPair || kind() == LateRegisterPair || kind() ==
           | SomeLateRegisterPair; }'       $ echo "$INPUT" | gzip | wc -c
           | 95       $ echo "$INPUT" | bzip2 | wc -c       118       $
           | echo "$INPUT" | xz -F xz | wc -c       140       $ echo
           | "$INPUT" | xz -F lzma | wc -c       97       $ echo "$INPUT"
           | | zstd | wc -c       92
           | 
           | For a shorter line:                 $ INPUT='
           | ASSERT(regHi().isGPR());'       $ echo "$INPUT" | gzip | wc
           | -c       48       $ echo "$INPUT" | bzip2 | wc -c       73
           | $ echo "$INPUT" | xz -F xz | wc -c       92       $ echo
           | "$INPUT" | xz -F lzma | wc -c       51       $ echo "$INPUT"
           | | zstd | wc -c       46
        
         | josephg wrote:
         | I learned Go many years ago doing some advent of code problems.
         | As I solved each problem, my housemate pestered me for a look
         | and then rewrote my solutions (each needing 10-50 lines of go)
         | into Ruby one-liners. All the while making fun of Go and my
         | silly programs. I wasn't intending to, but I ended up learning
         | a lot Ruby that night too.
         | 
         | Thankyou for continuing the tradition.
        
         | crazygringo wrote:
         | Are there any command-line tools for zip or similar that allow
         | you to predefine a dictionary over one or more files, and then
         | use that dictionary to compress small files?
         | 
         | Which would require the dictionary as a separate input when
         | decompressing, of course?
        
           | kqr wrote:
           | gzip (or really DEFLATE) does actually come with a small
           | predefined dictionary (the "fixed Huffman codes" in the RFC)
           | which is somewhat optimised for latin letters in UTF-8, but I
           | have not verified that this is indeed what ends up being used
           | when compressing individual lines of source code.
        
       | crazypython wrote:
       | It would be interesting to see a variant of this that used a
       | small language model to measure entropy.
        
         | saagarjha wrote:
         | Why would you do that when measuring entropy is easy to do with
         | a normal program
        
       | saagarjha wrote:
       | I assume this will have a bad time on compressed files?
        
         | lanfeust wrote:
         | .zip extension is ignored by default along with other binary
         | formats :)
        
           | saagarjha wrote:
           | Right but like .tar.gz, etc. are also a thing
        
             | lanfeust wrote:
             | You can just add your extensions to ignore with --ignore-
             | ext. But I'll add .tar.gz and .tar.bz2 since they are
             | widely used.
        
               | frumiousirc wrote:
               | Or, have the tool recursively read the .tar files'
               | contents.
        
       | seethishat wrote:
       | This reminds me of the program 'ent' (which I have used for a
       | very long time)
       | 
       | https://fourmilab.ch/random/
        
       | MarkMarine wrote:
       | Another way to do this would be to compress the file and compare
       | the compressed size to the uncompressed size.
       | 
       | Encrypted files do not compress well compared to code, I saw a
       | phd thesis that postulated an inverse ratio of compression
       | efficiency to performance data mining, this would be the opposite
        
       | p0w3n3d wrote:
       | xkcd.com/936/
        
       | blixt wrote:
       | I guess a language model like Llama 3 could model surprise on a
       | token-by-token basis and detect the areas that are most
       | surprising, i.e. highest entropy. Because as one example
       | mentioned, the entire alphabet may have high entropy in some
       | regards, but it should be very unsurprising to a code-aware
       | language model that in a codebase you have the Base62 alphabet as
       | a constant.
        
       | weipe-af wrote:
       | It would be useful if it also trawled through the full git
       | history of the project - a secret could have been checked in and
       | later removed, but still exist in the history.
        
       | thomascountz wrote:
       | Thank you DrJones for asking what a high entropy string is
       | several years ago[0] and linking to a good article on it.[1]
       | 
       | [0] https://news.ycombinator.com/item?id=13304641
       | 
       | [1] https://www.splunk.com/en_us/blog/security/random-words-
       | on-e...
        
       | baryphonic wrote:
       | Neat tool.
       | 
       | Would be cool if this CLI could have a flag to read .gitignore
       | and exclude all of the contents automatically.
       | 
       | Also it might be cool to have different strategies for detecting
       | secrets, e.g. Kolmogorov complexity as other comments have noted.
        
       ___________________________________________________________________
       (page generated 2024-06-05 23:02 UTC)