[HN Gopher] Reproducing Hacker News writing style fingerprinting
       ___________________________________________________________________
        
       Reproducing Hacker News writing style fingerprinting
        
       Author : grep_it
       Score  : 77 points
       Date   : 2025-04-16 13:57 UTC (3 hours ago)
        
 (HTM) web link (antirez.com)
 (TXT) w3m dump (antirez.com)
        
       | tptacek wrote:
       | This is an interesting and well-written post but the data in the
       | app seems pretty much random.
        
         | antirez wrote:
         | Thank you, tptacek. I was able to verify, thanks to the
         | Internet Archive caching of "pg" for the post of 3 years ago,
         | that the entries are quite similar in the case of "pg".
         | Consider that it captures just the statistical patterns in very
         | common words, so you are not likely to see users that you
         | believe are "similar" to yourself. Notably: montrose may likely
         | be a really be a secondary account of PG, and was also found as
         | a cross reference in the original work of three years ago.
         | 
         | Also note that vector similarity is not reciprocal, one thing
         | can have a top scoring item, but such item may have much more
         | items nearer, like in the 2D space when you have a cluster of
         | points and a point nearby but a bit far apart.
         | 
         | Unfortunately I don't think this technique works very well for
         | _actual_ duplicated accounts discovery because often times
         | people post just a few comments in fake accounts. So there is
         | not enough data, if not for the exception where one
         | consistently uses another account to cover their identity.
         | 
         | EDIT: at the end of the post I added the visual representations
         | of pg and montrose.
        
           | PaulHoule wrote:
           | If you want to do document similarity ranking in general it
           | works to find nearby points in word frequency space but not
           | as well as: (1) applying an autoencoder or another
           | dimensional reduction technique to the vectors or (2) running
           | a BERT-like model and pooling over the documents [1].
           | 
           | I worked on a search engine for patents that used the first,
           | our evaluations showed it was much better than other patent
           | search engines and we had no trouble selling it because
           | customers could feel the difference in demos.
           | 
           | I tried dimensional reduction on the BERT vectors and in all
           | cases I tried I found this made relevance worse. (BERT has
           | learned a lot already which is being thrown away, there isn't
           | more to learn from my particular documents)
           | 
           | I don't think either of these helps with the "finding
           | articles authored by the same person" because one assumes the
           | same person always uses the same words whereas documents
           | about the topic use synonyms that will be turned up by (1)
           | and (2). There is a big literature on the topic of
           | determining authorship based on style
           | 
           | https://en.wikipedia.org/wiki/Stylometry
           | 
           | [1] With https://sbert.net/ this is so easy.
        
             | antirez wrote:
             | Indeed, but my problem is: all those vector databases
             | (including Redis!) are always thought as useful in the
             | context of learned embeddings, BERT, Clip, ... But I really
             | wanted to show that vectors are very useful and interesting
             | outside that space. Now, I also like encoders very well,
             | but I have the feeling that the Vector Sets, as a data
             | structure, needs to be presented as a general tool. So I
             | really cherry picked a use case that I liked and where
             | neural networks were not present. Btw, Redis Vector Sets
             | support dimensionality reduction by random projection
             | natively in the case the vector is too redundant. Yet, in
             | my experiments, I found that using binary quantization
             | (also supported) is a better way to save CPU/space compared
             | to RP.
        
       | formerly_proven wrote:
       | I'm surprised no one has made this yet with a clustered
       | visualization.
        
         | antirez wrote:
         | Redis supports random projection to a lower dimensionality, but
         | the reality is that projecting a 350d vector into 2d is nice
         | but does not remotely captures the "reality" of what is going
         | on. But still, it is a nice idea to use some time. However I
         | would do that with more than 350 top words, since when I used
         | 10k it strongly captured the interest more than the style, so
         | 2D projection of this is going to be much more interesting I
         | believe.
        
         | layer8 wrote:
         | Given that some matches are "mutual" and others are not, I
         | don't see how that could translate to a symmetric distance
         | measure.
        
           | antirez wrote:
           | Imagine the 2D space, it also has the same property!
           | 
           | You have three points nearby, and a fourth a bit more
           | distant. 4 best match is 1, but 1 best match is 2 and 3.
        
             | layer8 wrote:
             | Good point, but the similarity score between mutual matches
             | is still different, so it doesn't seem to be a symmetric
             | measure?
        
         | PaulHoule wrote:
         | Personally I like this approach a lot
         | 
         | https://scikit-learn.org/stable/modules/generated/sklearn.ma...
         | 
         | I think other methods are more fashionable today
         | 
         | https://scikit-learn.org/stable/modules/manifold.html
         | 
         | particularly multi-dimension scaling, but personally I think
         | tSNE plots are less pathological (they don't have as many of
         | these crazy cusps that make me think it's projecting down from
         | a higher-dimensional surface which is near-parallel to the
         | page)
         | 
         | After processing documents with BERT I really like the clusters
         | generated by the simple and old k-Means algorithm
         | 
         | https://scikit-learn.org/stable/modules/generated/sklearn.cl...
         | 
         | It has the problem that it always finds 20 clusters if you set
         | k=20 and a cluster which really oughta be one big cluster might
         | get treated as three little clusters but the clusters I get
         | from it reflect the way I see things.
        
       | giancarlostoro wrote:
       | I tried my name, and I don't think a single "match" is any of my
       | (very rarely used) throw away alts ;) I guess I have a few people
       | I talk like?
        
         | antirez wrote:
         | When they are rarely used (a small amount of total words
         | produced), they don't have meaningful statistical info for a
         | match, unfortunately. A few users here reported finding actual
         | duplicated accounts they used in the past.
        
         | delichon wrote:
         | I got 3 correct matches out of 20, and I've had about 6
         | accounts total (using one at a time), with at least a fair
         | number of comments in each. I guess that means that my word
         | choices are more outliers than yours or there is just more to
         | match. So it's not really good enough to reliably identify alt
         | accounts, but it is quite suggestive.
        
           | giancarlostoro wrote:
           | I think if you rule out insanely common words, it might get
           | scary accurate.
        
             | lolinder wrote:
             | Actually, the way that these things work is usually by
             | focusing exclusively on the usage patterns of very common
             | (top 500) words. You get better results by ignoring content
             | words in favor of the linking words.
        
       | 38 wrote:
       | this got two accounts that I used to use
        
         | antirez wrote:
         | Great! Thanks for the ACK.
        
       | Boogie_Man wrote:
       | No matches higher than .7something and no mutual matches let's go
       | boys I'm a special unique snowflake
        
       | weinzierl wrote:
       | How does it find the high similarity between "dang" and "dangg"
       | when the "dangg" account has no activity (like comments) at all?
       | 
       | https://antirez.com/hnstyle?username=dang&threshold=20&actio...
        
         | antirez wrote:
         | Probably it used to have when the database was created. Then
         | the comments got removed.
        
       | hammock wrote:
       | The "analyze" feature works pretty well.
       | 
       | My comments underindex on "this" - because I have drilled into my
       | communication style never to use pronouns without clear one-word
       | antecedents, meaning I use "this" less frequently that I would
       | otherwise.
       | 
       | They also underindex on "should" - a word I have drilled OUT of
       | my communication style, since it is judgy and triggers a
       | defensive reaction in others when used. (If required, I prefer
       | "ought to")
       | 
       | My comments also underindex on personal pronouns (I, my). Again,
       | my thought on good, interesting writing is that these are to be
       | avoided.
       | 
       | In case anyone cares.
        
         | antirez wrote:
         | That's very interesting as I noticed that certain outliers
         | seemed indeed conscious attempts.
        
         | croemer wrote:
         | Since you seem to care about your writing, I'm wondering why
         | you used "that" here?
         | 
         | > I use "this" less frequently that I would otherwise
         | 
         | Isn't it "less than" as opposed to "less that"?
        
           | hammock wrote:
           | Typo. Good catch
        
         | Joker_vD wrote:
         | > I prefer "ought to"
         | 
         | I too like when others use it, since a very easy and pretty
         | universal retort against "you ought to..." is "No, I don't owe
         | you anything".
        
         | jcims wrote:
         | I (also?) felt the 'words used less often' were much easier to
         | connect to as a conscious effort. I pointed chatgpt to the
         | article and pasted in my results and asked it what it could
         | surmise about my writing style based on that. It probably
         | connected about as well as the average horoscope but was still
         | pretty interesting!
        
         | tobr wrote:
         | > Again, my thought on good, interesting writing is that these
         | are to be avoided.
         | 
         | You mean, "I think this should be avoided"? ;)
        
           | hammock wrote:
           | Nice one _high five_
        
       | alganet wrote:
       | Cool tool. It's a shame I don't have other accounts to test it.
       | 
       | It's also a tool for wannabe impersonators to hoan their writing
       | style mimic skills!
        
         | shakna wrote:
         | I don't have other accounts, but still matched at 85+% accuracy
         | for a half dozen accounts. Seems I don't have very original
         | thoughts or writing style.
        
       | andrewmcwatters wrote:
       | Well, well, well, cocktailpeanuts. :spiderman_pointing:
       | 
       | I suspect, antirez, that you may have greater success removing
       | some of the most common English words in order to find truly
       | suspicious correlations in the data.
       | 
       | cocktailpeanuts and I for example, mutually share some words
       | like:
       | 
       | because, people, you're, don't, they're, software, that, but,
       | you, want
       | 
       | Unfortunately, this is a forum where people will use words like
       | "because, people, and software."
       | 
       | Because, well, people here talk about software.
       | 
       | <=^)
       | 
       | Edit: Neat work, nonetheless.
        
         | alganet wrote:
         | That seems to be a misconception.
         | 
         | The usage frequency of simple words is a powerful tell.
        
           | cratermoon wrote:
           | Indeed, some writing styles make frequent use of words like
           | "that" and "just".
        
           | andrewmcwatters wrote:
           | I can understand the nuance of your assertion, but looking at
           | the data returned by these results suggests it's not really
           | all that powerful at all.
           | 
           | There are so many people that write like me apparently, that
           | simple language seems more like a way to mask yourself in a
           | crowd.
        
         | cratermoon wrote:
         | I noted the "analyze" feature didn't seem as useful as it could
         | be because the majority of the words are common articles and
         | conjunctions. I'd like to see a version of analyze that filters
         | out at least the following stop words: a, an, and, are, as, at,
         | be, but, by, for, if, in, into, is, it, no, not, of, on, or,
         | such, that, the, their, then, there, these, they, this, to,
         | was, will, with
        
       | chrismorgan wrote:
       | I wonder how much curly quote usage influences things. I type
       | things like curly quotes with my Compose key, and so do most of
       | my top similars; and four or five words with _straight_ quotes
       | show up among the bottom ten in our analyses. (Also etc, because
       | I like to write _& c._)
       | 
       | I'm not going to try comparing it with normalising apostrophes,
       | but I'd be interested how much of a difference it made. It could
       | easily be just that the sorts of people who choose to write in
       | curly quotes are more likely to choose words carefully and thus
       | end up more similar.
        
       | qsort wrote:
       | Have you tried to analyze whether there is a correlation between
       | "closeness" according to this metric and how often users chat in
       | the same thread? I recognize some usernames that are reported as
       | being similar to me, I wonder if there's some kind of self-
       | selection at play.
        
       | xnorswap wrote:
       | I wonder how much accuracy would be improved if expanding from
       | single words to the most common pairs or n-tuples.
       | 
       | You would need more computation to hash, but I bet adding
       | frequency of the top 50 word-pairs and top 20 most common
       | 3-tuples would be a strong signal.
       | 
       | ( The nothing the accuracy is already good of course. I am indeed
       | user eterm. I think I've said on this account or that one before
       | that I don't sync passwords, so they are simply different
       | machines that I use. I try not to cross-contribute or double-
       | vote. )
        
       | Frieren wrote:
       | It works for me. The accounts I used long time ago are there in
       | high positions. I guess that my style is very distinctive.
       | 
       | But I also have seen some accounts that seem to be from other
       | non-native English speakers. They may even have a Latin language
       | as their native one (I just read some of their comments, and, at
       | minimum, some of them seem to also be from the EU). So, I guess,
       | that it is also grouping people by their native language other
       | than English.
       | 
       | So, maybe, it is grouping many accounts by the shared bias of
       | different native-languages. Probably, we make the same type of
       | mistakes while using English.
       | 
       | My guess will be that native Indian or Chinese speakers accounts
       | will also be grouped together, for the same reason. Even more so,
       | as the language is more different to English and the bias
       | probably stronger.
       | 
       | It would be cool that Australians, British, Canadians tried the
       | tool. My guess is that the probability of them finding alt-
       | accounts is higher as the populations is smaller and the writing
       | more distinctive than Americans.
       | 
       | Thanks for sharing the projects. It is really interesting.
       | 
       | Also, do not trust the comments too much. There is an incentive
       | to lie as to not acknowledge alt-accounts if they were created to
       | remain hidden.
        
       ___________________________________________________________________
       (page generated 2025-04-16 17:00 UTC)