[HN Gopher] Reproducing Hacker News writing style fingerprinting
___________________________________________________________________
Reproducing Hacker News writing style fingerprinting
Author : grep_it
Score : 77 points
Date : 2025-04-16 13:57 UTC (3 hours ago)
(HTM) web link (antirez.com)
(TXT) w3m dump (antirez.com)
| tptacek wrote:
| This is an interesting and well-written post but the data in the
| app seems pretty much random.
| antirez wrote:
| Thank you, tptacek. I was able to verify, thanks to the
| Internet Archive caching of "pg" for the post of 3 years ago,
| that the entries are quite similar in the case of "pg".
| Consider that it captures just the statistical patterns in very
| common words, so you are not likely to see users that you
| believe are "similar" to yourself. Notably: montrose may likely
| be a really be a secondary account of PG, and was also found as
| a cross reference in the original work of three years ago.
|
| Also note that vector similarity is not reciprocal, one thing
| can have a top scoring item, but such item may have much more
| items nearer, like in the 2D space when you have a cluster of
| points and a point nearby but a bit far apart.
|
| Unfortunately I don't think this technique works very well for
| _actual_ duplicated accounts discovery because often times
| people post just a few comments in fake accounts. So there is
| not enough data, if not for the exception where one
| consistently uses another account to cover their identity.
|
| EDIT: at the end of the post I added the visual representations
| of pg and montrose.
| PaulHoule wrote:
| If you want to do document similarity ranking in general it
| works to find nearby points in word frequency space but not
| as well as: (1) applying an autoencoder or another
| dimensional reduction technique to the vectors or (2) running
| a BERT-like model and pooling over the documents [1].
|
| I worked on a search engine for patents that used the first,
| our evaluations showed it was much better than other patent
| search engines and we had no trouble selling it because
| customers could feel the difference in demos.
|
| I tried dimensional reduction on the BERT vectors and in all
| cases I tried I found this made relevance worse. (BERT has
| learned a lot already which is being thrown away, there isn't
| more to learn from my particular documents)
|
| I don't think either of these helps with the "finding
| articles authored by the same person" because one assumes the
| same person always uses the same words whereas documents
| about the topic use synonyms that will be turned up by (1)
| and (2). There is a big literature on the topic of
| determining authorship based on style
|
| https://en.wikipedia.org/wiki/Stylometry
|
| [1] With https://sbert.net/ this is so easy.
| antirez wrote:
| Indeed, but my problem is: all those vector databases
| (including Redis!) are always thought as useful in the
| context of learned embeddings, BERT, Clip, ... But I really
| wanted to show that vectors are very useful and interesting
| outside that space. Now, I also like encoders very well,
| but I have the feeling that the Vector Sets, as a data
| structure, needs to be presented as a general tool. So I
| really cherry picked a use case that I liked and where
| neural networks were not present. Btw, Redis Vector Sets
| support dimensionality reduction by random projection
| natively in the case the vector is too redundant. Yet, in
| my experiments, I found that using binary quantization
| (also supported) is a better way to save CPU/space compared
| to RP.
| formerly_proven wrote:
| I'm surprised no one has made this yet with a clustered
| visualization.
| antirez wrote:
| Redis supports random projection to a lower dimensionality, but
| the reality is that projecting a 350d vector into 2d is nice
| but does not remotely captures the "reality" of what is going
| on. But still, it is a nice idea to use some time. However I
| would do that with more than 350 top words, since when I used
| 10k it strongly captured the interest more than the style, so
| 2D projection of this is going to be much more interesting I
| believe.
| layer8 wrote:
| Given that some matches are "mutual" and others are not, I
| don't see how that could translate to a symmetric distance
| measure.
| antirez wrote:
| Imagine the 2D space, it also has the same property!
|
| You have three points nearby, and a fourth a bit more
| distant. 4 best match is 1, but 1 best match is 2 and 3.
| layer8 wrote:
| Good point, but the similarity score between mutual matches
| is still different, so it doesn't seem to be a symmetric
| measure?
| PaulHoule wrote:
| Personally I like this approach a lot
|
| https://scikit-learn.org/stable/modules/generated/sklearn.ma...
|
| I think other methods are more fashionable today
|
| https://scikit-learn.org/stable/modules/manifold.html
|
| particularly multi-dimension scaling, but personally I think
| tSNE plots are less pathological (they don't have as many of
| these crazy cusps that make me think it's projecting down from
| a higher-dimensional surface which is near-parallel to the
| page)
|
| After processing documents with BERT I really like the clusters
| generated by the simple and old k-Means algorithm
|
| https://scikit-learn.org/stable/modules/generated/sklearn.cl...
|
| It has the problem that it always finds 20 clusters if you set
| k=20 and a cluster which really oughta be one big cluster might
| get treated as three little clusters but the clusters I get
| from it reflect the way I see things.
| giancarlostoro wrote:
| I tried my name, and I don't think a single "match" is any of my
| (very rarely used) throw away alts ;) I guess I have a few people
| I talk like?
| antirez wrote:
| When they are rarely used (a small amount of total words
| produced), they don't have meaningful statistical info for a
| match, unfortunately. A few users here reported finding actual
| duplicated accounts they used in the past.
| delichon wrote:
| I got 3 correct matches out of 20, and I've had about 6
| accounts total (using one at a time), with at least a fair
| number of comments in each. I guess that means that my word
| choices are more outliers than yours or there is just more to
| match. So it's not really good enough to reliably identify alt
| accounts, but it is quite suggestive.
| giancarlostoro wrote:
| I think if you rule out insanely common words, it might get
| scary accurate.
| lolinder wrote:
| Actually, the way that these things work is usually by
| focusing exclusively on the usage patterns of very common
| (top 500) words. You get better results by ignoring content
| words in favor of the linking words.
| 38 wrote:
| this got two accounts that I used to use
| antirez wrote:
| Great! Thanks for the ACK.
| Boogie_Man wrote:
| No matches higher than .7something and no mutual matches let's go
| boys I'm a special unique snowflake
| weinzierl wrote:
| How does it find the high similarity between "dang" and "dangg"
| when the "dangg" account has no activity (like comments) at all?
|
| https://antirez.com/hnstyle?username=dang&threshold=20&actio...
| antirez wrote:
| Probably it used to have when the database was created. Then
| the comments got removed.
| hammock wrote:
| The "analyze" feature works pretty well.
|
| My comments underindex on "this" - because I have drilled into my
| communication style never to use pronouns without clear one-word
| antecedents, meaning I use "this" less frequently that I would
| otherwise.
|
| They also underindex on "should" - a word I have drilled OUT of
| my communication style, since it is judgy and triggers a
| defensive reaction in others when used. (If required, I prefer
| "ought to")
|
| My comments also underindex on personal pronouns (I, my). Again,
| my thought on good, interesting writing is that these are to be
| avoided.
|
| In case anyone cares.
| antirez wrote:
| That's very interesting as I noticed that certain outliers
| seemed indeed conscious attempts.
| croemer wrote:
| Since you seem to care about your writing, I'm wondering why
| you used "that" here?
|
| > I use "this" less frequently that I would otherwise
|
| Isn't it "less than" as opposed to "less that"?
| hammock wrote:
| Typo. Good catch
| Joker_vD wrote:
| > I prefer "ought to"
|
| I too like when others use it, since a very easy and pretty
| universal retort against "you ought to..." is "No, I don't owe
| you anything".
| jcims wrote:
| I (also?) felt the 'words used less often' were much easier to
| connect to as a conscious effort. I pointed chatgpt to the
| article and pasted in my results and asked it what it could
| surmise about my writing style based on that. It probably
| connected about as well as the average horoscope but was still
| pretty interesting!
| tobr wrote:
| > Again, my thought on good, interesting writing is that these
| are to be avoided.
|
| You mean, "I think this should be avoided"? ;)
| hammock wrote:
| Nice one _high five_
| alganet wrote:
| Cool tool. It's a shame I don't have other accounts to test it.
|
| It's also a tool for wannabe impersonators to hoan their writing
| style mimic skills!
| shakna wrote:
| I don't have other accounts, but still matched at 85+% accuracy
| for a half dozen accounts. Seems I don't have very original
| thoughts or writing style.
| andrewmcwatters wrote:
| Well, well, well, cocktailpeanuts. :spiderman_pointing:
|
| I suspect, antirez, that you may have greater success removing
| some of the most common English words in order to find truly
| suspicious correlations in the data.
|
| cocktailpeanuts and I for example, mutually share some words
| like:
|
| because, people, you're, don't, they're, software, that, but,
| you, want
|
| Unfortunately, this is a forum where people will use words like
| "because, people, and software."
|
| Because, well, people here talk about software.
|
| <=^)
|
| Edit: Neat work, nonetheless.
| alganet wrote:
| That seems to be a misconception.
|
| The usage frequency of simple words is a powerful tell.
| cratermoon wrote:
| Indeed, some writing styles make frequent use of words like
| "that" and "just".
| andrewmcwatters wrote:
| I can understand the nuance of your assertion, but looking at
| the data returned by these results suggests it's not really
| all that powerful at all.
|
| There are so many people that write like me apparently, that
| simple language seems more like a way to mask yourself in a
| crowd.
| cratermoon wrote:
| I noted the "analyze" feature didn't seem as useful as it could
| be because the majority of the words are common articles and
| conjunctions. I'd like to see a version of analyze that filters
| out at least the following stop words: a, an, and, are, as, at,
| be, but, by, for, if, in, into, is, it, no, not, of, on, or,
| such, that, the, their, then, there, these, they, this, to,
| was, will, with
| chrismorgan wrote:
| I wonder how much curly quote usage influences things. I type
| things like curly quotes with my Compose key, and so do most of
| my top similars; and four or five words with _straight_ quotes
| show up among the bottom ten in our analyses. (Also etc, because
| I like to write _& c._)
|
| I'm not going to try comparing it with normalising apostrophes,
| but I'd be interested how much of a difference it made. It could
| easily be just that the sorts of people who choose to write in
| curly quotes are more likely to choose words carefully and thus
| end up more similar.
| qsort wrote:
| Have you tried to analyze whether there is a correlation between
| "closeness" according to this metric and how often users chat in
| the same thread? I recognize some usernames that are reported as
| being similar to me, I wonder if there's some kind of self-
| selection at play.
| xnorswap wrote:
| I wonder how much accuracy would be improved if expanding from
| single words to the most common pairs or n-tuples.
|
| You would need more computation to hash, but I bet adding
| frequency of the top 50 word-pairs and top 20 most common
| 3-tuples would be a strong signal.
|
| ( The nothing the accuracy is already good of course. I am indeed
| user eterm. I think I've said on this account or that one before
| that I don't sync passwords, so they are simply different
| machines that I use. I try not to cross-contribute or double-
| vote. )
| Frieren wrote:
| It works for me. The accounts I used long time ago are there in
| high positions. I guess that my style is very distinctive.
|
| But I also have seen some accounts that seem to be from other
| non-native English speakers. They may even have a Latin language
| as their native one (I just read some of their comments, and, at
| minimum, some of them seem to also be from the EU). So, I guess,
| that it is also grouping people by their native language other
| than English.
|
| So, maybe, it is grouping many accounts by the shared bias of
| different native-languages. Probably, we make the same type of
| mistakes while using English.
|
| My guess will be that native Indian or Chinese speakers accounts
| will also be grouped together, for the same reason. Even more so,
| as the language is more different to English and the bias
| probably stronger.
|
| It would be cool that Australians, British, Canadians tried the
| tool. My guess is that the probability of them finding alt-
| accounts is higher as the populations is smaller and the writing
| more distinctive than Americans.
|
| Thanks for sharing the projects. It is really interesting.
|
| Also, do not trust the comments too much. There is an incentive
| to lie as to not acknowledge alt-accounts if they were created to
| remain hidden.
___________________________________________________________________
(page generated 2025-04-16 17:00 UTC)