[HN Gopher] Show HN: Using stylometry to find HN users with alte...
___________________________________________________________________
Show HN: Using stylometry to find HN users with alternate accounts
Author here. This site lets you put in a username and get the users
with the most similar writing style to that user. It confirmed
several users who I suspected were alts and after informally asking
around has identified abandoned accounts of people I know from many
years ago. I made this site mostly to show how easy this is and how
it can erode online privacy. If some guy with a little bit of
Python, and $8 to rent a decent dedicated server for a day can make
this, imagine what a company with millions of dollars and a couple
dozen PhD linguists could do. Here's Paul Graham:
https://stylometry.net/user?username=pg Here are some frequent HN
commenters: (EDIT: Removed due to privacy concerns)
Author : costco
Score : 394 points
Date : 2022-11-26 18:03 UTC (4 hours ago)
(HTM) web link (stylometry.net)
(TXT) w3m dump (stylometry.net)
| oblib wrote:
| I've only had one account here. The highest match has a 0.624
| score and the lowest a 0.572. I'm not sure if that means I'm
| unique or common but I'd like to know.
| macintux wrote:
| My nearest match is only at 0.406. It'd be interesting to see who
| the most unique commenters are, but it's also quite possible it
| wouldn't be flattering.
| joisig wrote:
| 0.2506 is my nearest match
| pubby wrote:
| 0.35 is my nearest. In hopes of lowering it even further, here
| are some nonsensical opinions never expressed on HN before: 1)
| Programming peaked with COBOL 2) Paul Graham is responsible for
| 90% of SIDS cases 3) There's no reason to use car when cdr
| exists.
| seydor wrote:
| Well the only solution is too have too many alts so that nobody
| can believe you can possibly have that many
| WalterBright wrote:
| Over in the D language forums, we welcome people who post under a
| pseudonym, and our policy is we won't allow attempts to unmask
| them.
|
| This is to protect high profile users who are secretly enjoying
| programming in D rather than the language they are supposed to
| use.
|
| And, of course, to protect users who feel they might be
| discriminated against if their background was known.
| bo1024 wrote:
| It's very important for those people to be aware of these style
| analysis attacks! Glad this post is raising awareness.
| [deleted]
| [deleted]
| xwolfi wrote:
| Wow... how !
| notduncansmith wrote:
| This has been a great way to find people whose commentary I
| enjoy!
| dsr_ wrote:
| This is interesting.
|
| I'm 0.566 correlated with logfromblammo -- and while we are
| definitely not the same person, I could easily imagine writing a
| sentence such as:
|
| "For some bizarre reason, management has not yet assigned a task
| to their programmer underlings to automated themselves out of
| existence. I can't imagine why."
|
| which is theirs, not mine, from about a year ago. I like that.
|
| On the other hand, I'm nearly as correlated with peterwwillis:
| 0.5485 -- who has no comments and no submissions.
| costco wrote:
| > On the other hand, I'm nearly as correlated with
| peterwwillis: 0.5485 -- who has no comments and no submissions.
|
| This is due to the Firebase API not updating when users ask the
| admins to move their comments to another account.
| lifeisstillgood wrote:
| I had a similar experience finding my most likely alt (.50
| suggesting I am a unique snowflake as I have always thought
| :-), my most likely alt is writing certainly in a style I
| appreciate and on subjects I often mention.
| culi wrote:
| Similar to how they make adversarial fashion[0][1] in order to
| not be tracked by face id AI, I wonder if we can make adversarial
| stylometry tools to run your comments through in order to
| anonymize it
|
| .. [0] https://hackaday.com/2022/10/20/render-yourself-invisible-
| to...
|
| .. [1] https://adversarialfashion.com/
| carewell wrote:
| OP links to a paraphrasing tool on their website.
| pugworthy wrote:
| Strip leading/trailing white space from the name if it says no
| match.
| robertlagrant wrote:
| Clicking on my top match (0.61) - I can see the similarity. I
| also note they quote the same way, with a > symbol. I wonder if
| that helps!
| lostmyacctoops wrote:
| I'd be _very_ curious to know if these algorithms can link very
| different _types_ of text. I 'm not surprised that my style is
| "derivable" on HN, but what if you included my slash-fic pieces,
| my research papers, etc, would it still "catch" me?
|
| Also, talk about a chilling effect. I was already vaguely aware
| of this, and now I'm overthinking every word I'm thinking/typing.
| uberduper wrote:
| I would have expected to be a closer match to myself.
|
| > uberduper: 0.9999999999999991
| birdyrooster wrote:
| sdwr wrote:
| Ooooeeoo oo oooo. Ooooeeooo oooo<<barbra striesand>>
| wizzwizz4 wrote:
| Yes, sadly. In this case, it'd be an arsehole move, but good
| point.
| phnofive wrote:
| If you want to ask HN to remove your data, send a message to
| hn@ycombinator.com.
| CharlesW wrote:
| Not to diminish one bit how you're feeling, but the bright side
| is: Today you know this is easily done (information you didn't
| have yesterday), that the creator had no intention of "outing"
| you specifically, and that you can take steps to obfuscate this
| specific aspect of your posts that connects your public alts.
| dibt wrote:
| Since it looks for similar word usage, false positives seem to
| appear more often when specific topics are talked about, like
| stocks or crypto.
|
| Does this ignore stop words? Or do all words have the same
| weighting? I wonder if only focusing on stop words would give a
| more accurate measure. Maybe we are more comfortable with certain
| stop words more than others?
|
| https://en.wikipedia.org/wiki/Stop_words
|
| "Stop words are the words in a stop list (or stoplist or negative
| dictionary) which are filtered out (i.e. stopped) before or after
| processing of natural language data (text) because they are
| insignificant."
| costco wrote:
| All words have the same weighting. I don't ignore stop words,
| in fact most of the ngrams I use are compromised almost
| entirely of stop words. Maybe it'd be more effective if I
| ignored them.
| afarviral wrote:
| Im tempted to use it to find likeminded friends :)
| SnowHill9902 wrote:
| Anything like this for Reddit?
|
| Would translating to other language and back defend against this
| algorithm?
| costco wrote:
| > Anything like this for Reddit?
|
| No but it would be easily adaptable especially given that
| Pushshift is archiving every Reddit comment. Based on some of
| the feedback I'm getting here I don't know if I should open
| source this even though it really wasn't that hard to make.
|
| > Would translating to other language and back defend against
| this algorithm?
|
| Yes. But then you have to send your original comment to a
| translation company so there are privacy concerns there too.
| operator-name wrote:
| I wouldn't worry about that too much as someone's already
| done something similar for reddit
| (https://towardsdatascience.com/using-nlp-to-identify-
| reddito...), and has released their code publicly
| (https://github.com/jabraunlin/reddit-user-id)
|
| Given the technique used, I don't see why something simple
| and local wouldn't defeat it? The "easiest" technique would
| be to use this weighting as a negative metric in rewriting.
| hcs wrote:
| > But then you have to send your original comment to a
| translation company so there are privacy concerns there too.
|
| There are modern offline translation systems available such
| as Bergamot https://browser.mt/
| EMIRELADERO wrote:
| > Based on some of the feedback I'm getting here I don't know
| if I should open source this even though it really wasn't
| that hard to make.
|
| I'd say you should. I'd rather see this as being publicly and
| freely available to everyone rather than some shady "Big
| Tech" analytics company.
|
| If the "weapons" exist, I would feel more comfortable knowing
| everyone can access them, not just an elite that can use it
| for their own (selfish) purposes.
| A4ET8a8uTh0 wrote:
| I am genuinely torn, because my initial reaction was almost
| the exact opposite, but the comparison to a weapon does
| ring true. And there is indeed an argument to be made for
| level playing field. At the very least, maybe counter-
| measures can be developed.
| Terretta wrote:
| People don't usually understand privacy risks till their
| own curtains fall down.
| [deleted]
| AtlasBarfed wrote:
| What's a high correlation number?
| ThrowawayTestr wrote:
| Haha, you got me and my main account. That's spooky.
| [deleted]
| jonnycomputer wrote:
| Obviously the next thing to do is make this a popup on someone's
| account name when you hover over it.
| psychphysic wrote:
| Hmmm, doesn't seem to work. But you have convinced me (and many
| others?) to search our alts consecutively and so now do know who
| has alts?
| elorant wrote:
| Sounds like a nice tool to find friends. You locate people who
| might think like you.
| RepAgent wrote:
| What's up with cluster of users like:
|
| j_s,password4321,carolinew,colinwright,kuharich etc.
|
| https://stylometry.net/user?username=j_s
| https://stylometry.net/user?username=carolinew
| https://stylometry.net/user?username=colinwright
| https://stylometry.net/user?username=password4321
|
| Lowest match for j_s is 0.80 and all but one is black.
| [deleted]
| saurik wrote:
| Why are some users bold?
| srean wrote:
| The non-bold are dead accounts I think
| saurik wrote:
| It isn't due to a mere property of the user, as, for example,
| cushman is not bold as the #2 result for tptacek but is bold
| as the #2 result for icambron.
| srean wrote:
| Good point.
| stavros wrote:
| FYI, the GP said above that bold usernames are those for
| which symmetry holds (ie they're both in each other's top
| ten).
| costco wrote:
| Say you see user2 listed in bold on user1's page. That means
| that user1 is also in user2's top 20 users. In my experience it
| is often an indicator of a good match (but not always). I
| should probably explain that on the site.
| layer8 wrote:
| Instead of making it binary, you could use a gradient
| indicating the strength of the mutual correlation (like how
| HN colors downvoted comments).
| franze wrote:
| totally on spot
|
| my current and my old account
| jonnycomputer wrote:
| Well, one of the closest on my list is my twin, so there's that.
| Retr0id wrote:
| It didn't find my alt, but the second match is one of my twitter
| mutuals - I wonder if we've inadvertently borrowed style quirks
| from each other.
| [deleted]
| hgsgm wrote:
| [deleted]
| samwillis wrote:
| Sticking myself in (I haven't ever had another account) my
| closest match (at 0.43) is the maintainer of an Open Source
| project which I have occasionally commented about. They are also
| British, as am I.
|
| My guess is that as they commonly mention the project and I have
| on a number of occasions, that has formed the link. Plus maybe
| usage of common British terms, but that seems far less
| significant.
|
| It's super interesting!
|
| It would be good if there were more controls to filter the type
| of words and language that are used for the matching algorithm.
| So you could say exclude words not in the dictionary. I wander
| how that would effect my link with this other person.
| WaitWaitWha wrote:
| I checked a few random user names and I am confused.
|
| - Why is the author costco[0] not in this lookup?
|
| [0]: https://stylometry.net/user?username=costco
| [deleted]
| Aachen wrote:
| - Their first comment and submission were 4 hours ago.
|
| - The text on that page is accurate it seems.
| julienreszka wrote:
| why is my username not exactly equal to 1?
| https://stylometry.net/user?username=julienreszka
| costco wrote:
| Python/floating point rounding error. It doesn't mean anything.
| bhaney wrote:
| Well now I'm self conscious about my closest match being an 0.34
| when so many other people are reporting much closer matches with
| accounts that aren't alts. Do I write weirdly?
| spapas82 wrote:
| Same for me, the closest match is 0.36. But I expected that
| because I don't speak english very well so the pool of
| candidates is small.
| CobaltFire wrote:
| My closest is 0.40, so I'm right there with you.
|
| Native English speaker as well.
| klohto wrote:
| 0.36 here! Out of curiosity, are you a native speaker?
| bhaney wrote:
| I am, yes.
| quink wrote:
| 0.39 for myself, I'm a non-native speaker.
| stephc_int13 wrote:
| What is the threshold to be reasonably confident that two
| accounts are from the same individual?
|
| I ever had only one account here and the closest match is at
| 0.47.
| jefftk wrote:
| Tried my account thinking "I don't have any alts" but it turns
| out I do! In 2018 I changed my username from "cbr" to "jefftk"
| and it pulled that right up:
| https://stylometry.net/user?username=jefftk
| CobaltFire wrote:
| Interesting; I must have a fairly unique style as there are no
| matches over 0.40 for me.
|
| I'm a native English speaker as well, so I'm unsure how to feel
| about that.
| SkyMarshal wrote:
| Oddly, I am not an exact match to myself.
|
| _> Most likely candidates:
|
| skymarshal: 0.9999999999999997_
|
| The other few usernames I tested (pg, dang, some random ones from
| this thread) all matched themselves at 1.0.
| ChrisMarshallNY wrote:
| Interesting, but it gave me 20 accounts, and I _know_ that I only
| have this one.
| notacoward wrote:
| Seems pretty spot-on to me. I tried it with two accounts I was
| already certain were alts - based on other factors like favorite
| topics and common enemies as well as style/tone - and the top
| hits for both were the ones I would have expected.
| nwiswell wrote:
| I don't have an alt but it would be cool to meet my stylometry-
| neighbors. I'm curious whether the writing similarity translates
| to oral communication too
| kiernanmcgowan wrote:
| Love a little NLP project on a public dataset - thanks for
| sharing!
| [deleted]
| jimhi wrote:
| Amazing and I thought my doxxing tool was terrifying -
| https://news.ycombinator.com/item?id=32278871
|
| I am afraid to combine all these methods
| lijogdfljk wrote:
| Yea.. i guess it's time to stop bothering with alt
| accounts/etc. I'll just make one account, maybe differently
| named on different services (makes scraping just a _pinch_
| easier) but aside from that all i can do is modify/remove old
| posts.
|
| Bit of a shame for useful posts/discussions.. but the internet
| is getting really.. finger print laden.
| timeon wrote:
| I had hard time to understand some comments made by my closest
| match. I guess this is good reality check. I need to learn how to
| write more legible posts now.
| FartyMcFarter wrote:
| Sorry, what did you mean? :P
| schappim wrote:
| Interesting that the Op doesn't come up in the search:
| https://stylometry.net/user?username=costco
| Beltalowda wrote:
| Not surprising considering the account had no activity before
| today.
| Aachen wrote:
| Their first comment and submission were 4 hours ago. Text on
| the page is accurate it seems.
| 4qz wrote:
| This is an evil website. We won't have any anonymity soon. The
| highest match is my years old banned account that I forgot about.
| Where did you get the data from?
| JadeNB wrote:
| > This is an evil website. We won't have any anonymity soon.
| The highest match is my years old banned account that I forgot
| about. Where did you get the data from?
|
| I'd way rather have someone tell me "look at all the things I
| can find out about you" so that I can act accordingly (whatever
| that means!) rather than what we've mostly actually got, which
| is companies silently exploiting my data and doing everything
| they can to mumble reassuring but legally ineffective formulas
| assuring me that they deeply respect my privacy.
| costco wrote:
| HN Firebase API. I just wrote a program in C++ with libcurl to
| get https://hacker-news.firebaseio.com/v0/item/1.json,
| https://hacker-news.firebaseio.com/v0/item/2.json,
| https://hacker-news.firebaseio.com/v0/item/3.json, ...
| ufmace wrote:
| I don't know that I'd call this evil. We have no idea who else
| is using this kind of technology but not making the results
| public. Better to know what's possible and take measures to
| make it less effective.
| weinzierl wrote:
| Please don't shoot at the messenger. costco shared this
| voluntarily and I can see no bad intention.
|
| We should see it as an opportunity to learn how easy it is to
| associate different pseudonymous accounts. Nothing drives this
| point home better than a practical demo.
|
| We can be pretty sure stylometry is used widely by bad actors
| already and we should not punish people who help to spread the
| word about these technical possibilities.
| ghaff wrote:
| And this is actually quite a simple approach--which is
| interesting in and of itself. While there would be
| diminishing returns, there are a ton of other techniques you
| could use to make stronger inferences about similarity.
| [deleted]
| vfinn wrote:
| Imagine using this across different platforms :/, and let alone
| using different techniques in addition...
|
| edit: maybe you'd catch some criminals if you tried to match
| reddit against dark web for example
| woodruffw wrote:
| HN has an Algolia-based API. It's also _very_ easy to crawl.
|
| I wouldn't call this evil, however: it's merely demonstrating a
| technique that you _should_ be aware of, if you're a privacy-
| conscious person. It looks like they also provide some
| resources for avoiding stylometric detection.
| nanidin wrote:
| I would bet my bottom dollar that the likes of Reddit and
| Google already have models to turn a corpus of text into
| probable demographic data and models to measure the
| similarity of users.
| faeriechangling wrote:
| It's just statistics. I recall that during his whistleblowing,
| Snowden intentionally took anti-stylometry measures.
| wizofaus wrote:
| What match level would you expect to see between two randomly
| chosen individuals?
| seydor wrote:
| does it use the _most_ used words or least used?
| [deleted]
| yyt554 wrote:
| Fun exercise would be to find all accounts that suddenly stopped
| posting around today and correllate them with new accounts
| created around today.
|
| All those scared folks who naively think that it's not too late
| yet. Busted.
| super256 wrote:
| Ahhh, anyone remembers this hacking crew who leaked BLUEETERNAL
| and other NSA tools and exploits? Shadowbrokers.
|
| They were always communicating in some kind of meme-russian, and
| their texts were funny to read. [1]
|
| I believe their writing mostly defeated this kind of analysis, at
| the cost of looking like idiots (which was probably the reason no
| one sent them crypto-dollars to buy that stuff exclusively).
|
| Here's an excerpt:
|
| "Attention government sponsors of cyber warfare and those who
| profit from it !!!!
|
| How much you pay for enemies cyber weapons? Not malware you find
| in networks. Both sides, RAT + LP, full state sponsor tool set?
| We find cyber weapons made by creators of stuxnet, duqu, flame.
| Kaspersky calls Equation Group. We follow Equation Group traffic.
| We find Equation Group source range. We hack Equation Group. We
| find many many Equation Group cyber weapons. You see pictures. We
| give you some Equation Group files free, you see. This is good
| proof no? You enjoy!!! You break many things. You find many
| intrusions. You write many words. But not all, we are auction the
| best files."
|
| [1]
| https://archive.ph/20160815133924/http://pastebin.com/NDTU5k...
| lettergram wrote:
| Did something similar in 2018 (still running locally) which could
| damask anyone
|
| https://twitter.com/austingwalters/status/104189476543920128...
|
| Made both Metacortex.me and insideropinion.com
|
| The idea being you don't actually need an active directory. It
| would drop in, figure out all the users (provided one account was
| on the AD) and would monitor everyone's skill sets, morale,
| schedule, etc. Worked super well for what it was / is.
| thr0v_awway wrote:
| writing from throwaway:
|
| Holy shit, it works really, really good. It found all of my older
| accounts.
| oliwary wrote:
| Cool! I wonder if it could be run backwards, to identify the
| users on hackernews with the most unique voices.
| Wistar wrote:
| I have only ever had a single account but it returned 19
| possibles with no confidence above .54 but 11 bolded. My own
| account was listed at the top with a confidence of .9999.
| Macha wrote:
| Yeah, I have a bunch of bolded mutuals but none above 0.45. I
| think I have had one or two alts in the past, but probably they
| didn't make the 10000 word threshold for inclusion (nor can I
| remember their names to check if they work in inverse).
| [deleted]
| [deleted]
| karol wrote:
| Are you going to try it on Twitter?
| srean wrote:
| I tried it on a few user-ids that I strongly suspected were owned
| by the same person. My hunches stand corroborated. Not sure who
| is corroborating whom though, me or the script.
|
| Good job.
| msla wrote:
| It puts almost all of my old accounts decently near the top, but
| my original account is almost comically low.
| ALittleLight wrote:
| Of the top ten accounts listed for my name two of them are me.
| zem wrote:
| heh, I looked up the top bold hit for my name and they really do
| sound a bit like me (:
| dibt wrote:
| This doesn't seem to include text from submissions.
|
| I ran it on Brian Armstrong's temp account from here, and it said
| it didn't write 10,000 characters:
|
| https://news.ycombinator.com/item?id=3754664
|
| EDIT: Or maybe it's something else because Brian only wrote less
| than 6k characters. But then why can my account be looked up?
|
| Also, I would guess quoted replies are included, which muddies
| the analysis. Seems to be a very naive implementation. Much more
| can be done, but this was probably just a quick project.
| costco wrote:
| Quoted replies shouldn't be included unless there's a bug on my
| end. Submission text is not included though I probably should
| have.
| anpat wrote:
| This needs to exclude who's hiring post because it confuses me
| with a few of my wonderful former colleagues!
| [deleted]
| antirez wrote:
| writing "antirez" shows accounts with spanish names (none is
| mine). I guess Italian and Spanish speakers write very similarly
| English, but on HN there are a lot more Spanish speakers than
| Italian ones so that's what I get.
| operator-name wrote:
| What does the bold signify? For example when I search for dang
| (https://stylometry.net/user?username=dang) the 4th most likely
| user is not bold whereas the 16th is?
| costco wrote:
| Say you see user2 listed in bold on user1's page. That means
| that user1 is also in user2's top 20 users. In my experience it
| is often an indicator of a good match (but not always).
| operator-name wrote:
| Huh, that's a somewhat non intuitive property.
| silasdavis wrote:
| It is a bit, but if stylometric equality was a thing you'd
| expect it to be symmetric, so if stylometric simmilarity is
| a thing....
| Trouble_007 wrote:
| Nice work! Thank you, of course I plugged in the obvious HN
| usernames
|
| Edit to add;
|
| Would be nice to have the
| https://news.ycombinator.com/user?id=username links included.
| Trouble_007 wrote:
| And perhaps rounding to 3 or 4 decimal places?
| iHateStylo wrote:
| mysterydip wrote:
| I was curious to use this on myself to see if anyone writes like
| me. Closest was a .51 confidence, so I guess not?
| harryvederci wrote:
| My runner-up has a rating of 0.42378790667730715
|
| C'mon guys, work harder. That's not even close! :-D
|
| Btw, I myself am only at 0.9999999999999999 so I guess I need to
| work harder at being myself.
| iambateman wrote:
| This is cool!
|
| If an account returns a high score for many accounts, does that
| also mean they're relatively less original in style?
| medellin wrote:
| How much writing do you need to analyze results? Would changing
| account every X sentences eliminate this?
| costco wrote:
| Current minimum is 10000 characters. In my own tests accuracy
| was still pretty good at 3-5000 but I instituted the 10000
| minimum to reduce false positives. Yes it would, if you read
| the advice page on avoiding detection that is one of the things
| I recommend. Unfortunately HN moderators do not really like
| that.
| rmelhem wrote:
| nice one. are you using gpt3 under the hood?
| costco wrote:
| I'm not that smart - my site is basically just doing some
| calculations on word frequencies. You can read
| https://academic.oup.com/dsh/article-abstract/17/3/267/92927...
| and
| https://www.tandfonline.com/doi/abs/10.1080/09296174.2011.53...
| and https://news.ycombinator.com/item?id=33755898 for more
| information.
| dunham wrote:
| I am curious whether it could pick GPT3 out of the crowd.
| ghaff wrote:
| As you mention on the site, you don't do punctuation. But I'm
| guessing there are some pretty good fingerprints like:
|
| two spaces after a period
|
| Whether someone uses an em-dash/single hyphen/double hyphens
| (which may correspond to house style they're used to)
|
| Whether they use semi-colons
|
| (Presumably harder) but consistent substitutions like loose
| for lose, break for brake, etc.
|
| Use of accents
| sillysaurusx wrote:
| Don't sell yourself short. Simplicity is smart. It's
| astonishing how often the simplest thing turns out to be
| exponentially more effective than the so-called smart thing.
|
| I can't get over how phenomenal this is. Please put every one
| of your side project ideas into production!
| isoprophlex wrote:
| Simplicity is the greatest form of sophistication! Great
| work!
|
| One small nit from a user experience point of view..: it'd be
| easier on the eyes if you just truncated those cosine
| similarity scores (or whatever score you're using) after the,
| say, 5th digit. Showing the entire float is kinda messy to my
| eyes.
| Dma54rhs wrote:
| Its easy to write complicated systems, it takes a genius to
| make it simple.
| rmelhem wrote:
| cool and thanks for the clarification. i ask that mainly
| because of the request limit of openai, which is something
| that makes many scalable ideas unfeasible
| godisdad wrote:
| Can we find Satoshi with this?
| thisisnotapipe wrote:
| Cardano founder Charles Hoskinson believes that only one person
| fits the profile of the mysterious Bitcoin creator, Satoshi
| Nakamoto.
|
| In a surprise Ask Me Anything (AMA) session on YouTube,
| Hoskinson reveals that he has narrowed down his search to one
| person who he believes is the only individual that fits the
| part.
|
| "I've been very vocal lately on this. I think that first, it
| doesn't matter but second, it's probably Adam Back. If you look
| at the preponderance of the evidence, Occam's razor applies and
| the most likely answer usually is, and there's no mystique or
| magic there but he just fits the profile. You're looking for
| somebody who's in their 40s to 50s who created Bitcoin in 2008.
| That would fit Adam. English education, grammar, all that
| stuff, the right computer science background, exactly the right
| credentials you'd look for. You probably can get pretty far
| with code stylometry towards validating that."
|
| (from https://tokenpost.com/Charles-Hoskinson-Believes-One-
| Person-...)
| drpancake wrote:
| A few people have tried that e.g.
| https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1184...
| crecker wrote:
| See also: https://serhack.me/articles/unveiling-anonymous-author-
| stylo...
| costco wrote:
| That post was actually what motivated me to make this. I'm on
| your email list :)
| crecker wrote:
| WOW! It's such a pleasure for me
| neodypsis wrote:
| How do you protect yourself from impersonators?
| nvr219 wrote:
| I only got 0.9999999999999992 for myself :(
| noncoml wrote:
| Naturally Born Imposter
| bumble_bee900 wrote:
| It's accurate enough that I had to create a new account now :)
|
| I guess it's difficult to evade it as the word frequency
| certainly catches all about the countries I frequently refer,
| programming languages, interests etc.
| [deleted]
| [deleted]
| p4bl0 wrote:
| It's funny that I only match at 0.9999999999999982 with myself
| while all other username I tried matched with themselves at 1.0
| ^^.
| srean wrote:
| https://theuijunkie.com/myth-or-fact-did-charlie-chaplin-los...
| [deleted]
| sillysaurusx wrote:
| Wow. This gives a lot of false positives, but it found all ~10 of
| my old accounts over the years.
|
| The most interesting thing is that my writing style changed
| pretty drastically since a decade ago. Searching for my oldest
| account matches my earliest usernames, whereas searching this
| account matched the rest.
|
| The details of the algorithm are fascinating:
| https://stylometry.net/about Mostly because of how simple it is.
| I assumed it would measure word embeddings against a trained ML
| model, but nothing so fancy.
| [deleted]
| FormerBandmate wrote:
| > sillysaurus3
|
| > sillysaurus2
|
| Tbf a human could have found a bunch of them relatively easily
| lettergram wrote:
| Frankly similar to how I was doing in back in 2018 (when you
| and I chatted about it on HN lol)
|
| https://news.ycombinator.com/item?id=17944293
|
| The approach I took was a bit different, but also no ML
| required.
|
| The real trick is pruning and going cross platform. There are
| around 100k active HN accounts (meaning posts a few times a
| year), maybe 200k if you count at least one post a year. But
| <10k that post weekly.
|
| It's a very small space to try to compare so simple methods
| will work fine.
| costco wrote:
| Exactly. HN emphasizes long-form posts much more than other
| forums which makes the commenters here very susceptible to
| this kind of analysis. Plus you can fit every single HN
| comment in RAM on a mid tier gaming laptop so it's even
| easier. I was trying to think of applications of this kind of
| data and the only thing I could think of was moderation
| tools/detecting ban evaders but what you've done seems much
| more profitable lol.
| echelon wrote:
| It works like a charm for me too.
|
| I put in my username and found my pre-echelon alt,
| possibilistic.
|
| (Echelon was taken when I registered possibilistic, but it must
| have been unused and dropped.)
| costco wrote:
| Yeah top 20 is a little excessive because in my own tests I
| found that top 20 is only marginally more accurate than top 10.
| You can get a more academic explanation [here](https://www.tand
| fonline.com/doi/abs/10.1080/09296174.2011.53...). I was amazed
| too because it seemed too easy!
| sillysaurusx wrote:
| FWIW, top 20 was necessary for mine. The bolding was a
| brilliant move. Several of my accounts were ranked 10-20, but
| popped out due to the bolding.
| justusthane wrote:
| What does the bolding indicate?
| sillysaurusx wrote:
| The explanation is here:
| https://news.ycombinator.com/item?id=33755466
|
| As far as I'm concerned, it's the killer feature of the
| app. The top 20 results may be noisy, but the bolded
| results have a signal to noise ratio close to infinity.
| costco wrote:
| The funny thing is that I thought of it while eating
| dinner last night :)
| jsnell wrote:
| The precision of the bolded results looks like maybe 30%
| to me. Significantly better than the non-bolded, but
| nowhere near perfect precision.
| costco wrote:
| False positives become an increasingly difficult problem
| the more and more potential authors you introduce. If I
| had wrote a fancier model it probably wouldn't be as much
| of a problem but what can you do.
| jsnell wrote:
| Yes, this wasn't a criticism of the tool. It is crazy
| good.
|
| But I don't think people should be making the assumption
| that bolded results are definite alts, which sillysaurus'
| comment reads like.
| sillysaurusx wrote:
| Hmm, that wasn't my intent. I see this tool as a
| recommendation engine more than a doxxer. By "signal to
| noise ratio close to infinity," I meant that if you visit
| one of the bolded accounts, they'll probably sound a lot
| like you.
|
| It's one of those ideas that makes the tool substantially
| more effective, yet never would've occurred to me. It's
| like the simplicity of pg's "a plan for spam" algorithm:
| deceptively simple, but (like scrubbing dishes with
| fingers) works really well.
| dragonwriter wrote:
| Of my top 20, 19 are bold, all are above 0.6, and I have
| no alts.
| loeg wrote:
| I have 7 bolded names (0.53-0.62) in the top 20 list, and
| none are alts of mine.
| morsch wrote:
| I'm one of them and I can confirm. But then again that's
| what I'd say if I was.
| loeg wrote:
| Hi style-adjacent friend :-). Just briefly looking at
| your recent comment history, we seem to find different
| kinds of articles interesting, but maybe have a similar
| writing style.
| ghaff wrote:
| Pretty much the exact same. (I do have a throwaway
| account but I rarely use it and it probably hasn't been
| used enough to qualify.)
| User23 wrote:
| I'd figured it would be some kind of n-gram frequency analysis.
| Would be interesting to code that up and compare.
| costco wrote:
| It is. The description on the about page is a little
| simplified but I basically I look at the most common word and
| character ngrams of size 1,2,3 (200 each), put all the
| frequencies in an array and then compare to all the other
| users with https://scikit-
| learn.org/stable/modules/generated/sklearn.me....
| User23 wrote:
| Cool, I only skimmed the description maybe I needed to read
| it more carefully.
|
| Have you considered doing rune rather than word ngrams? I
| can imagine that might be prohibitively expensive, but I
| really don't know. I did something like that long long ago
| in C for automatic document language detection. It was
| quite accurate.
| throwaway5434q wrote:
| Wow. This is insane, it found my old accounts. So throwaway
| obviously (because I'm a bit of an asshole) but this really is
| amazing. It also highlighted another account that's not me, but
| looking through their comments i don't see any resemblance to me
| either.
| stavros wrote:
| Oh wow, it's really sure that I'm stavrosk, which I am:
|
| https://stylometry.net/user?username=stavros
|
| The next person is 30% less certain, that's huge! This would
| basically identify any alt I might have with near certainty.
| jvolkman wrote:
| stavrosk doesn't have any posts/comments? What's it using to
| match?
| stavros wrote:
| It's my old username.
| costco wrote:
| Huh... seems there are some inconsistencies between what's
| presented on news.ycombinator.com and the Firebase API.
| Glad it matches for you though :)
| stavros wrote:
| I guess they just didn't go back and reparse, not a big
| problem. I don't think people change their username
| frequently :P
| rogual wrote:
| Funny thing is, it thinks I'm you, but it doesn't think you're
| me!
|
| https://stylometry.net/user?username=rogual
|
| I'd have thought this stylometry thing would be commutative.
| ed25519FUUU wrote:
| Very cool! And really a shame that you're not allowed to delete
| an old alt account or comments on HN! It follows you forever
| apparently.
| Arathorn wrote:
| It found my old account (ara4n; i lost the password) at 0.63.
| More amusingly it found my cofounder too, who hardly ever posts
| here (at 0.48)
| thot_experiment wrote:
| Maybe this is a good tool to find new friends. :P
| pkos98 wrote:
| Sorry dang, aka sctb: https://stylometry.net/user?username=dang
| Macha wrote:
| In this particular case, it seems to be picking up the stock
| moderation responses as it looks like sctb was a moderator
| account until 2019.
| alpacabag wrote:
| Semaphor wrote:
| My alt accounts (not really, all below 0.5) seem to also be
| European or German Firefox users. Good for us ;)
| nr2x wrote:
| Honeypot to see what accounts are tested in sequence?
|
| ;-)
| costco wrote:
| I turned off nginx logging if that makes you feel any better.
| Of course there's no way for you to verify that because I'm
| just a random guy on the internet but I will tell you that I am
| a civic minded citizen who is concerned about privacy and the
| Internet.
| nr2x wrote:
| Only half kidding, but I'd I were state Intel it's what I'd
| be doing. :D
| atum47 wrote:
| at what threshold is it considering alt account?
| costco wrote:
| There is no threshold. This site does not make any call as to
| whether a user is an alt or not. It just gives the users with
| the most similar word choice and from there it is up to you to
| decide (is there a very specific detail that both accounts
| mention, do they post at similar times, etc). I will say bolded
| accounts are substantially more likely to be alts though. But
| obviously it is not guaranteed that every user has an alt.
| F_r_k wrote:
| Found my phone account; I'm quite impressed, really !
| ufmace wrote:
| I wonder what's a reasonable threshold for "probably the same
| person". I've never had an alt on HN, and when I searched myself,
| it found 3 other users above 0.6, none of whom I've ever heard of
| before.
| [deleted]
| costco wrote:
| If it's >0.9 is you can almost guarantee it's an alt but I've
| seen certain matches at 0.6. The problem is writing styles
| change over time. Another idea I had was converting the scores
| which are just cosine similarity scores into percentiles (so
| 0.99 would be 99th percentile of certainty) to make them more
| human interpretable.
| forgotpwd16 wrote:
| >The problem is writing styles change over time.
|
| Will be interesting if we could plot the writing style
| divergence over time.
| throwdbaaway wrote:
| I got matched with my old account with a score of only 0.45
| bonzini wrote:
| The people at 0.4-0.6 with me do share some interests. That's
| cool on its own.
| throwup wrote:
| I make new accounts every so often and the accounts of mine
| that it found have a score of around 0.3. I'm not actively
| trying to defeat stylometry but it's possible I just have a
| particularly unremarkable writing style.
| xwolfi wrote:
| Well I must be stereotypical myself because it found me at
| 0.8 !
| MBCook wrote:
| I have no alts. The highest match for me is about 0.66.
| dotancohen wrote:
| Interesting. The highest non-me account is under 0.4 on my
| page. I do not believe that I have such a unique writing style
| - especially since half my posting is on mobile and therefore
| possibly slightly different than my desktop posts.
| dwringer wrote:
| My closest is 0.4879. I know I tend to be wordy but I thought
| I had a pretty generic style as well. This is definitely a
| fascinating demonstration.
| drdec wrote:
| Feeling better about my high of 0.49 now
| pyb wrote:
| 0.6 is not high enough to indicate an alt
| Yeahsureok wrote:
| On the how to avoid section: Isn't running comments through a
| randomised translator a few times then back considered a
| countermeasure also?
|
| Also think it's probably poor form to list users as examples
| without their permission.
| costco wrote:
| > On the how to avoid section: Isn't running comments through a
| randomised translator a few times then back considered a
| countermeasure also?
|
| Yes.
|
| > This may be out of line but isn't pg on here with a different
| username, Levenschtein distance of one that's not included? Or
| is that just a very motivated 13yo account who writes a lot of
| admin-esque comments.
|
| What other pg account are you referring to? I want to see it so
| I can see what my algorithm missed.
|
| > Also think it's probably poor form to list users as examples
| without their permission.
|
| You're right. I'll remove that - I just wanted some examples
| especially for people on phones who don't feel like typing.
| Thanks for the feedback.
| jacooper wrote:
| > However, using automated methods like machine translation
| services do not appear to be a viable method of circumvention.
|
| https://www.whonix.org/wiki/Stylometry
| Lichtso wrote:
| I wonder how much this can be improved if metadata is taken into
| account as well. Especially the distribution of common post dates
| and times modulo a week, which also exposes in which timezone
| somebody probably lives.
| 2OEH8eoCRo0 wrote:
| This found an old account that I forgot I even had but with a lot
| of false positives. Neat!
| bscphil wrote:
| The scary thing is that once you have this data, finding HN
| matches for individual targeted users on other sites becomes
| trivial, even if those sites are harder to scrape. I bet most
| people here have an anonymous Reddit account, for example. If you
| wanted to know who was behind a particular Reddit account, you
| could feed it into something like this and compare the results
| with HN, where accounts are less likely to be anonymous. Or build
| a database based on blogs, Github comments, etc.
|
| Also, since this uses only word frequency, there are probably
| relatively easy improvements to make that would make it even more
| powerful, like looking at particular runs of words that are
| unique. Some expressions or figurative language only show up in
| combinations of words, and tend to be highly style specific.
| faeriechangling wrote:
| Thus proving the only actually anonymous community in practice
| is 4chan, and that's why it's so toxic.
| sbierwagen wrote:
| If you define "toxic" as "people disagreeing with you", sure.
| That was what the entire internet was like until maybe 2005.
| philosopher1234 wrote:
| "People disagreeing with you" describes almost none of the
| conversation on 4chan
| ben_w wrote:
| I'm old enough to remember when 4chan was _self
| identifying_ as the Internet 's hate machine, before xkcd
| referenced it as such: https://xkcd.com/591/
|
| Sometimes people insist that's all role-play and irony;
| others insist that if it ever was, it certainly isn't now.
|
| But regardless, I remember pre-2005, and it wasn't all like
| what I saw the two times I looked at 4chan. Bits were. Bits
| were _much worse_. But mostly, _mostly_ , people were
| kinder... at least, unless political tribalism came up.
| costco wrote:
| I could have used a part of speech tagger, looked at time of
| day a user posts, capitalization, spelling errors, etc. From
| what I understand the state of the art is lightyears ahead of
| this, there are even companies with actual linguists who will
| act as expert witnesses in court to say stuff like "we can say
| with 95% certainty that xyz authored this email." Honestly it's
| kind of scary. There are papers that talk about cross platform
| authorship attribution, one I think did it with Twitter,
| Blogspot, G+ and had pretty good results.
| saurik wrote:
| It would be convenient if the usernames linked to the comment
| pages on Hacker News (to avoid having to copy/paste and URL hack,
| which is made even slightly more annoying because for some reason
| when I tap and hold the usernames to copy them your markup--I
| haven't looked at why yet--is causing an extra space character to
| get copied on the left).
| honkler wrote:
| Not today.
|
| You fail, I win.
| costco wrote:
| Nice. Just out of curiosity are you taking any countermeasures
| or varying your writing style across accounts in any way?
| psychphysic wrote:
| My second closest match was 0.35 but searching people where
| they have matches 0.5-0.75 I suspect that's mostly to do with
| number of posts leading to better statistics.
| soneca wrote:
| I have two accounts. This one, "soneca", that is my first one and
| most active by far, and another one that I use sometimes mostly
| for Show HN and few comments.
|
| When I searched the other one, "soneca" was the first guess, with
| 0.4.
|
| But when I searched "soneca", the other one was not in the top
| 20.
| 00F_ wrote:
| ive had maybe a hundred throwaway accounts on HN over the past
| ten years. generally, i make an account, say something that is
| apparently wildly offensive to someone else, get flagged and
| down-voted and then muted or hell-banned. then i make another
| account because i never did anything wrong and start the process
| over again. ive emailed the admins, tried to reason with the
| admins, it never does any good. the power is held by power-users
| who flag people -- most of the power of an admin at the end of
| the day but without any of the accountability. as long as they
| are following the mainstream dogma, its all good.
|
| anyway, this app was able to identify a lot of my accounts. but a
| lot of the matches werent me. bold matches were almost all me.
| but i know there are many more matches than those that were
| listed. it mainly showed my most recent accounts.
|
| i think most people would get a sick feeling in their stomach if
| they tried this app. i dont think people are prepared for a world
| where you can type someones name into an app like this and
| produce everything ever recorded online that was created by that
| person. not only this but everything highlighted and summarized
| to answer any question about that person. this is what advanced
| ai will bring us. an information implosion where the planet-sized
| ocean of data that is just floating all around us suddenly and
| violently coalesces into the objects of our new societal
| calculus. violent is a good word. and this is just the change
| that one can see coming with ai.
| costco wrote:
| You are definitely right. Part of the reason I chose the 10,000
| character minimum was so that people using throwaways in the
| true sense would be entirely excluded. I don't plan on keeping
| this up forever and I too would not feel comfortable if this
| was deployed at scale.
| ayewo wrote:
| Would you be open to open sourcing the code when you decide
| to shutdown the service?
| stupendous_luck wrote:
| You really don't need advanced AI to do it. Just a bunch of
| scrapers and some run of the mill statistics. And guess what,
| it's been done by many companies already. They just don't care
| to create such a site.
| moneywoes wrote:
| What algorithm is being used?
| interroboink wrote:
| It's described here: https://stylometry.net/about
| rglover wrote:
| It's moments like this I'm proud to have my insanity on full
| display without obscurity. Was surprised to see a bunch of ~30%
| matches despite not having any alts.
| kfichter wrote:
| Does anyone here have a reasonably wide variety of similarity
| ratings? I'd love to see the difference between a 0.2 and a 0.8
| for the same account.
| peacelilly wrote:
| This is creepy.
| noncoml wrote:
| I think the word you are looking for is uncanny
| jallasprit wrote:
| Most likely candidates: pg: 1.0
| montrose: 0.604073065373204 mattmaroon:
| 0.5900372458160795 natsu: 0.5519832271289953
| rauljara: 0.5418566694533273 waterlesscloud:
| 0.5378996309342633 damoncali: 0.5292014150349463
| gruseom: 0.5290151637991445 kemiller2002:
| 0.5254174524920762 jfengel: 0.5231938496089998
| jamesaguilar: 0.5229081613163672 houseabsolute:
| 0.5219738531025365 danssig: 0.5195368367601849
| austenallred: 0.519343009683366 loewenskind:
| 0.5177030083877397 baguasquirrel: 0.5153841099708854
| asdfasgasdgasdg: 0.5146704002447524 aptwebapps:
| 0.5144149629369845 allenbrunson: 0.512802806408646
| danielweber: 0.5123620795710832
| [deleted]
| andsoitis wrote:
| we leave fingerprints everywhere
| throwawayhghcj wrote:
| I'd like to request the author takes this offline please until
| the implications can be thought through.
|
| This is breaking anonymity that people incorrectly thought would
| not be revealed.
|
| For some it might be awkward, others it might be quite
| problematic.
| s3000 wrote:
| This is nothing new, e.g:
|
| Analyzing stylistic similarity amongst authors
|
| https://news.ycombinator.com/item?id=10050603
|
| http://markallenthornton.com/blog/stylistic-similarity/
|
| 37 points by lingben on Aug 12, 2015
| kaba0 wrote:
| I would agree with you but the genie is out of the bottle
| already. Nigh everyone can and could have reproduced these
| results, especially that archive.org and similar things exist.
|
| So, I don't think it causes any new harm, if anything it gives
| you future risk aversion.
| silasdavis wrote:
| The top hit for me, though not a very high correlation (0.3 ish),
| is to my surprise someone I have met. I don't appear on their top
| 20 though.
| musicale wrote:
| > I made this site mostly to show how easy this is and how it can
| erode online privacy
|
| looks like it can indeed
|
| > Here are some frequent HN commenters: (EDIT: Removed due to
| privacy concerns)
|
| How surprising that someone might object to being included in a
| demonstration of the erosion of privacy!
|
| Is the site opt-in or opt-out?
| Aachen wrote:
| I doubt they asked 78k users for permission when there's no
| standardized way of reaching out if you're not a site admin.
| It's opt out if anything.
| bee_rider wrote:
| You opt into making your writing publicly available when
| making posts on this site. I'm not sure what Ycombinator's
| user agreement* says about this, but it is pretty obvious
| that they haven't done anything to prevent it (and it isn't
| clear what they could do).
|
| * and I mean they author of the tool is here making posts, so
| I guess they have agreed to the TOS, but clearly someone who
| hasn't agreed to it could also make this tool and scrape out
| publicly available posts without agreeing to anything.
| StrangeDoctor wrote:
| Have you done any data analysis on distributions of similarity?
| How similar you'd expect any 2 people to be given English focused
| around tech? Or any other interesting stats you'd like to share?
|
| Very nice clean site, great work.
| Ros2 wrote:
| I interviewed years ago with someone who let me know that they
| use a pseudonym as an employee and their chosen name even got
| posted as the author for articles they wrote for the company.
| They were very concerned about their privacy.
|
| I know their blog, which is their HN username, and this tool
| found their other account.
|
| Perhaps ironically, this person stood out a lot because of this
| and I didn't forget them.
| zxcvbn4038 wrote:
| How long until this becomes the algorithm for a dating site?
|
| "Find hot single women who write just like you"
| nrp wrote:
| This seems like a great way to hire freelance copywriters/ghost
| writers too. I would absolutely hire someone I knew could match
| my tone well for writing generic unattributed copy.
| forgotpwd16 wrote:
| Wouldn't be surprised if dating sites already used similar
| algorithms.
| bornfreddy wrote:
| Wouldn't be surprised if most of the women on a specific
| dating site had very high similarity scores.
| davebillyhock wrote:
| This found an alt that I created specifically to see if I could
| write artificially to defeat this kind of analysis. I have seen
| other tools like it posted to HN, but none before had found that
| account. I guess I need to up my game.
| [deleted]
| CharlesW wrote:
| If you don't mind sharing, are you "writing artificially"
| purely in your head, or are you using techniques like
| intermediate translations?
| davebillyhock wrote:
| No mechanical means, but I have referred to a thesaurus
| occasionally. Mostly I tried to change my sentence structure,
| not just words. It requires actually thinking differently, in
| a way. Which makes it difficult to know how well I'm
| communicating.
| crtified wrote:
| I imagine this would be quite difficult in practise, due to
| all the subliminal factors behind a person's writing
| choices.
|
| For example, as somewhat illustrated here, your personal
| vocabulary is a kind of fingerprint. As you mention, using
| a thesaurus can somewhat alleviate that, but if a thesaurus
| is only changing a small % of your words, then it will only
| have a suitably small % effect upon analysis.
|
| To go yet further might (I suspect!) entail methods such as
| directly lifting and using other people's sentences to
| convey your own thoughts. But even then, "your own thought
| patterns" are still informing the manner of the post, to
| some extent, so over time increasingly robust analysis may
| still find patterns to hook into.
| neodypsis wrote:
| I wonder if someone will come up with a Grammarly-like
| tool which you can feed with sample writings to help you
| increase/lower the similarity score of a new text you are
| writing.
| ruined wrote:
| didn't find a single one of my alts. nice
| costco wrote:
| I obviously don't expect you to help me but do they have at
| least >10000 characters written and are you varying your
| writing style in any way?
| paulpauper wrote:
| Inserting random Unicode blank, 1/4, 1/2, or zero space
| characters into your writing may help thwart it too, if you are
| paranoid
| UncleEntity wrote:
| Huh, that's how I signal my KGB handler...
| lifeisstillgood wrote:
| How much should we fear de-anonymisation ?
|
| A lot of discussion on the thread are over "how can we prevent
| this". I would like to know why should we not embrace this and
| similar technologies?
|
| The benefits in my view are large - online behaviour tracks back
| to real life - and epidemiology speaking the value of millions of
| test subjects across every question are invaluable - from
| traditional medicine to "mass psychology recommendations"
|
| I can guess some downsides (hiding from abusive exes) but am
| interested in studies, surveys, reports etc - any HN thoughts
| welcome
| headhasthoughts wrote:
| What could possibly be the harm in allowing people to harass
| others based on posts they made decades ago? What could
| possibly be harmful in making a person who for whatever reason
| has changed their online identity easier to track? What could
| be remotely harmful about allowing Marlboro to find the
| accounts of ex-smokers? What could be the harm in tracking
| underaged users site by site?
|
| I'm sure this is completely harmless and will not harm society.
| rejectfinite wrote:
| >online behaviour tracks back to real life
|
| This is good to you?
|
| Okay, let's just make it like China or SK where your login is
| your citizen ID and if you write bad things the bad word police
| will take you away.
|
| Also, no, I have no alts.
| lifeisstillgood wrote:
| So I am asking because my views are only challenged inside my
| own head, hence the need for external thoughts.
|
| But firstly the "governments will come and do bad things"
| argument - yes this is clearly and obviously a major problem
| - but not one solvable by technology in anyway. Fixing
| violent dictatorships is a IRL problem - one that requires
| enormous effort and sacrifices (see Ukraine for obvious
| example). We cannot pretend that a browser extension or a
| ground up rewrite of Twitter will defeat Putin or would have
| stopped Hitler.
|
| As for "free" countries (something like 120+ have open free
| elections), we still have online abuse for voicing opinions
| that some people don't like (anything from pro/anti Trump to
| LGBT and bitcoin etc). Those are real consequences but rarely
| government inspired and honestly I suspect we need better
| support for police in prosecuting such things - I mean a
| death threat is a death threat.
|
| In general my view seems to be we should have the same
| protections online as we do offline - and if those
| protections are "in theory only" that requires us to use our
| voting and other political power to chnage it - not to
| obfuscate IP addresses or so on.
|
| The upside of tech is so great it is worth spending IRL to
| defend agains the downsides
| rejectfinite wrote:
| I am of the generation and mindset that online abuse is not
| real. Straight up. Log out, turn off the screen and watch
| Netflix, take a walk and calm down, block the offending
| user. It's not real.
|
| >I suspect we need better support for police in prosecuting
| such things
|
| We do see that! But mostly people on Facebook. Here we have
| had judgements of people who posted threats on Facebook
| because it is tied to your real name.
|
| And yes, abuse is part of the "fun". Under your system, my
| 10 years old Leauge and CoD chats would have me locked up.
|
| >I mean a death threat is a death threat.
|
| Is it? I would find it more concerning if someone on the
| street tells me he is going to kill me than a kid on xbox
| live.
|
| NOW there is a difference in systematic stalking and
| harassment online if I would get bombarded with DMs and
| messages to kys. I don't know how to solve. But a one-off
| comment is NOT equivalent. Then it feels like I'm just old?
| At 31? Is it really so serious?
| femto113 wrote:
| Fear it happening or fear its consequences? Doxxing already
| happens all the time, but the main tools are things like
| account names or image search, this sort of tool could take it
| to a new level. A simple experiment would be to run this same
| algorithm against another site (say Twitter or Reddit) and see
| if it can reliably pick out the same peoples' accounts there.
| Once anyone on the internet can quickly/easily draw that sort
| of connection it would require incredible diligence to avoid
| de-anonimyzation while still maintaining any sort of "real
| self" presence on the internet. How much we should fear the
| consequences probably depends a lot on how marginalized you are
| within your society, but since just revealing your gender is
| enough to invite harassment in many forums I'm not optimistic.
| CrypticShift wrote:
| Ingenious idea. At the very least, this is just about finding
| people who write like us, the same way we seek those with similar
| tastes (music...)
|
| How long before large commercial indexers start offering an
| efficient (AI based ?) stylometry to agencies and states ?
|
| wait... do you think the NSA is already doing this?
| A4ET8a8uTh0 wrote:
| They would be silly not to ( apart from creepish profiling of
| an entire globe population you also get to potentially identify
| bots ). We all have mannerisms that can easily 'betray us'
| online. I honestly thought my writing style is more unique, but
| as it turns out it is somewhat common.
| CrypticShift wrote:
| > I honestly thought my writing style is more unique
|
| You just showed another possible use case for this kind of
| tools: "How unique is my writing style ?"
| sitkack wrote:
| It isn't writing style, but more of phrase selection. If you
| lean on the same phrases (n-grams), then you will be very
| very close in a high dimensional space. Colloquialisms are
| the biggest tell, you should eschew them.
| woodruffw wrote:
| Stylometry is an old hat technique; you can assume that
| intelligence services around the globe regularly apply it.
|
| (Statistical stylometry is a little newer and more rigorous
| than manual stylometry, which essentially involved a human
| being's judgement call around the similarity of documents.)
| CrypticShift wrote:
| What about "deep leaning" stylometry ?
| woodruffw wrote:
| I don't know, but it wouldn't surprise me if someone has
| tried to apply ML to stylometry. Statistical stylometry is
| already petty effective, as demonstrated by this site.
| weinzierl wrote:
| I played a little bit with it and it is baffling how well it
| finds accounts of people that know each other in real life. So
| it's not only good for finding alternate accounts but could be
| used to find peer groups.
| [deleted]
| the_cat_kittles wrote:
| pretty cool- i think there should be a term for two accounts that
| have each other as the top most similar account. kinda sad i dont
| have one :(
| layer8 wrote:
| Stylotwins?
| philosopher1234 wrote:
| We're pretty close me and you -- closer than my actual alts
| the_cat_kittles wrote:
| hello friend! but... id never use an m dash
| philosopher1234 wrote:
| Well... I would never use a lowercase word after an
| exclamation point!
|
| ...Because I'm on mobile
| balls187 wrote:
| No alt, and the highest match is 0.36
|
| And that accounts last several comments were flagged as dead.
|
| I'm a native speaker, but my english succcccks.
| rcarr wrote:
| One way to get around this legitimately would be by posting a lot
| of quotes/lyrics/excerpts and the like thus fooling the algorithm
| unless it had a way to filter them out
| Fnoord wrote:
| Cool stuff, thank you for sharing your findings!
|
| I don't do throwaway. I either post or STFU. I also STFU on
| darknet. Its why I found it fun to read/lurk on things like I2P
| back when it was new. And I know that on a pseudonymous account
| it is only a matter of time until it can be linked to another
| pseudonymous account. It would not surprise me if stylometry was
| used on Dread Pirate Roberts or the people behind The Pirate Bay
| or the people behind Wikileaks (Assange's sockpuppet accounts).
| Such can also have been used to verify afterwards instead of
| beforehand. Though with TPB since it was on clearweb an advanced
| adversary could have used correlation/timing attack to figure who
| wrote what.
|
| I'm having fun times recognizing other Dutch people though their
| usage of English language. For example, a distinctive word I see
| Dutch people use a lot is 'oke' instead of 'OK' or 'okay'. Its a
| red flag the person is native Dutch. I wonder if there are
| stylometry tools available for figuring if someone used physical
| vs touchscreen keyboard (I used Glider to write this post,
| spellchecker unavailable).
|
| And yes, organizations like secret service and police should use
| such tools as well. It is a known tool, why not use it for good?
| As with any tool, it can be used for good and evil. On HN this
| could be useful for the mod team (AFAIK nowadays only dang) to
| find banned people's sockpuppets. Cross-community could also be a
| fun project: find a HN user's Twitter or Reddit account. And I
| hope this method is also used to find Russian trolls on social
| media.
| ghaff wrote:
| Most people greatly underestimate the power of linkage attacks
| on anonymity. And it doesn't even take fancy ML. In the context
| of healthcare records, I like to trot out this 25 year old
| example of an MIT grad student and the then-governor of MA.
|
| https://ischoolonline.berkeley.edu/blog/anonymous-data/
| dvh wrote:
| Make a fundraiser and start doing it for other sites.
| costco wrote:
| It would be possible for Reddit because Pushshift.io archives
| all the comments there and Reddit is still pretty small. I'd
| probably need to make things a lot faster. Doing it on a
| specific subreddit would be very feasible. I'll think about it
| but I don't actually know if I really want to do that because
| for instance I've been banned from subreddits before but I
| don't want a ban from when I was 12 years old to follow me
| around forever because my writing style hasn't changed.
| Moderation is the most obvious application of this kind of
| software.
| rand_user_100 wrote:
| > I'll think about it but I don't actually know if I really
| want to do that because for instance I've been banned from
| subreddits before but I don't want a ban from when I was 12
| years old to follow me around forever
|
| Insightful that your personal experience and impact on you
| personally affects your decision. I invite you to think about
| the impact of the products you build in your CS career by
| putting yourself in the shoes of other people as well.
|
| Some products should not be built, even though it's easy to
| build them.
| DenisM wrote:
| How about this for countermeasure:
|
| As you're typing out a comment the software gives you a list of
| accounts you're becoming similar to. That way you can adjust your
| writing as you type.
| bornfreddy wrote:
| Sounds great, except there are many different similarity
| measures. Which one does the algorithm use?
| wizzwizz4 wrote:
| Why not all of them? Which metrics are closer would tell you
| which aspects of your writing you need to focus on.
| kaba0 wrote:
| Someone linked it in the thread:
| https://github.com/psal/anonymouth
| pessimizer wrote:
| Forget countermeasures, go covert. Write a comment, have the
| comment be rewritten before submission in order to resemble a
| targeted account.
| Bhurn00985 wrote:
| Just a heads up that for everyone who doesn't like to link their
| alt accounts, maybe not use this tool to see if it works.
|
| Unless the author would run this against all HN user accounts, no
| need to flag the ones "of interest".
| jl6 wrote:
| Rebrand it as a soulmate-finder?
| tomrod wrote:
| Is it weird that my rating is very low compared to alternative
| options? I have no alts, but I'm curious how similar others might
| write to me.
| JKCalhoun wrote:
| The asymmetry is interesting. I have no alts but of course it
| nonetheless reported accounts similar to mine.
|
| Running then the most similar person to my account did not put me
| in _their_ top 20.
| sitkack wrote:
| I believe this is the
| https://en.wikipedia.org/wiki/Friendship_paradox
| throwboi123 wrote:
| That's why I always use throwaway :) everywhere. Reddit. HN.
| Twitter. Everywhere. I'll spam every site with my throwaways.
|
| Long live throwaways.
| kaba0 wrote:
| That's the point of this post, that you are not safe by
| throwaways at all, because all of your throwaways can be linked
| together purely by your textual style.
| silasdavis wrote:
| > imagine what a company with millions of dollars and a couple
| dozen PhD linguists could do.
|
| Could they do much better?
| aaron695 wrote:
| [deleted]
| spaniard89277 wrote:
| I changed my nickname so my employer can't find me here. I'm not
| amused by this.
| bee_rider wrote:
| If this basic implementation can catch you, I'd consider it a
| friendly reminder that changing your account name is not a very
| effective means of adding privacy.
| googlryas wrote:
| New account, then translate your comments to Spanish and then
| back to English using Google translate.
| aryc19 wrote:
| So what are some good tools to obfuscate style?
| setr wrote:
| Forget the alternate accounts -- if two users are close in style,
| there's a decent chance they should be friends. This is an HN
| friendship machine.
| AviationAtom wrote:
| Now I can find my HN doppelganger
| sillysaurusx wrote:
| Ha, gruseom shows up for pg, which is dang's old account. A
| worthy successor.
|
| This is a fascinating way to find similar HN users who aren't the
| same person. It's a surprisingly great recommendation engine. "If
| you like pg, you might also like..."
|
| Sure, the privacy concerns are valid, but the cat's out of the
| boot. Might as well enjoy the benefits.
|
| montrose is almost definitely pg. Someone who talks about ancient
| history, Occam's razor, VCs and startups, uses the phrase "YC
| cos" (relatively uncommon), etc.
| https://news.ycombinator.com/item?id=17112567
|
| Nicely done. One of the best hacks I've seen in a long time.
| costco wrote:
| > motrose is almost definitely pg. Someone who talks about
| ancient history, Occam's razor, VCs and startups, uses the
| phrase "YC cos" (relatively uncommon), etc.
| https://news.ycombinator.com/item?id=17112567
|
| I had this hunch too. It's either pg or someone trying really
| hard to be pg.
| roughly wrote:
| I mean, this is HN -
|
| > someone trying really hard to be pg
|
| describes half the site.
| asveikau wrote:
| > Someone who talks about ancient history, Occam's razor, VCs
| and startups,
|
| I think these are all common topics among HN readers and
| commenters.
| pyb wrote:
| Why would montrose be pg ? The correlation is not that high.
| Looks like a few people have picked up pg's mannerisms.
| VyseofArcadia wrote:
| > but the cat's out of the boot
|
| It's my first time hearing that variant. Usually its, "the
| cat's out of the bag" where I'm from.
|
| Do you mean boot in the UK sense, what Americans would call the
| trunk of a car? Or do you mean a sturdy piece of footwear?
|
| Obligatory xkcd https://xkcd.com/2390/
| sillysaurusx wrote:
| It's a little writing trick I leaned from (I think) Orwell.
| Any time you're about to use a common metaphor, try to tweak
| it. You'll catch readers off guard, which piques their
| curiosity.
|
| It's a fun game, too. I wish I'd used "the cat's out of the
| hat," but I didn't think of it till later.
| UncleEntity wrote:
| Yeah, it's like shooting ducks in a barrel it works so
| well.
|
| Easy to overuse then people just get annoyed though...kind
| of like commas, I suppose.
| esfandia wrote:
| I like mixing metaphors, in this case "the cat's out of the
| tube". ("the toothpaste's out of the bag" doesn't work as
| well though)
| InGoodFaith wrote:
| What you are describing is also known as an eggcorn.
|
| https://en.wikipedia.org/wiki/Eggcorn
| operator-name wrote:
| That's neeto!
|
| The 2nd example also loosely falls under the
| classification of malaphor.
|
| https://en.m.wiktionary.org/wiki/malaphor
| sillysaurusx wrote:
| Thank you! I was trying to find the original essay I
| learned it from. I'm now pretty sure it was by Poe, but
| all I can remember is the main advice: avoid common
| metaphors.
|
| I vaguely remember one of the metaphors in the essay was
| about a chicken coop melting, or something like that. It
| was vivid enough to leave a big impression.
| ewilden wrote:
| I remember this being from Politics and the English
| Language (https://www.orwellfoundation.com/the-orwell-
| foundation/orwel...):
|
| " Dying metaphors. A newly invented metaphor assists
| thought by evoking a visual image, while on the other
| hand a metaphor which is technically 'dead' (e. g. iron
| resolution) has in effect reverted to being an ordinary
| word and can generally be used without loss of vividness.
| But in between these two classes there is a huge dump of
| worn-out metaphors which have lost all evocative power
| and are merely used because they save people the trouble
| of inventing phrases for themselves."
| sillysaurusx wrote:
| Thank you so much! That's the one.
|
| (It's remarkable how often a vague description can yield
| an HN comment with an answer from a clever sleuth like
| yourself. Much appreciated.)
| sdwr wrote:
| I love doing this too, it's fun to write.
| [deleted]
| kevmo314 wrote:
| There's someone (michaelmior if you're around!) with a false
| positive 0.46 match to me.
|
| Maybe we could be friends :)
| drc500free wrote:
| This is a super interesting tool for self reflection. Looking at
| the top 10 similar accounts to mine, it gives me an arms-length
| view of how other people probably interpret my tone.
|
| I appear to be a well-educated, over-confident know-it-all.
| pavlov wrote:
| My #3 match is cstross, and now I'm convinced that my life-long
| secret dream of being a successful sci-fi novelist is basically
| a matter of typing. (Ideas? Character development? Ruthless
| editing? Developing an audience? Having a publisher? What do I
| need of those when the Computer told me I'm practically a
| genius...)
| bee_rider wrote:
| I also enjoyed reading one of my style-partner's posts.
|
| The most noticeable similarity is that we both clearly have
| strong opinions about some things, and like to share
| information, but also like to be clear about our unknowns or
| opinions. So, lots of "sounds likes," "probably," "could be"
| and so on.
|
| The downside is, I guess, this could be seen as a bit weasel-
| word-y or indirect.
| seydor wrote:
| we must be a good match
| bhaney wrote:
| > I appear to be a well-educated, over-confident know-it-all.
|
| Don't we all?
| sdwr wrote:
| I hate us insufferable nerds. !
| closeparen wrote:
| That's what we all come to HN for...
| interroboink wrote:
| This is one reason why I like legal doctrines such as "beyond a
| reasonable doubt." Even a 0.9 match in a tool like this could be
| a coincidence, if there are millions of users. But that won't
| stop people from casually believing "aha it must be an alt
| account", based on some anecdata.
|
| It's so easy for something like this to be turned into a tool for
| a witch hunt, targeting innocents.
| dsr_ wrote:
| I like the way some usernames are only 0.9999999 correlated with
| themselves.
|
| Perhaps 6 or 7 digits is enough?
| rcarr wrote:
| This is somewhat similar to how they ended up catching the
| Unabomber. The FBI were literally at a dead end. They ended up
| posting one of his letters/manifestos in the paper, somebody
| recognised a turn of phrase the unabomber used that was unusual
| and reported it as possibly being their brother, FBI investigated
| the lead and it lead them straight to him.
|
| Excerpts from wiki:
|
| > Before the publication of Industrial Society and Its Future,
| Kaczynski's brother, David, was encouraged by his wife to follow
| up on suspicions that Ted was the Unabomber.[91] David was
| dismissive at first, but he took the likelihood more seriously
| after reading the manifesto a week after it was published in
| September 1995. He searched through old family papers and found
| letters dating to the 1970s that Ted had sent to newspapers to
| protest the abuses of technology using phrasing similar to that
| in the manifesto.[92]
|
| > In early 1996, an investigator working with Bisceglie contacted
| former FBI hostage negotiator and criminal profiler Clinton R.
| Van Zandt. Bisceglie asked him to compare the manifesto to
| typewritten copies of handwritten letters David had received from
| his brother. Van Zandt's initial analysis determined that there
| was better than a 60 percent chance that the same person had
| written the manifesto, which had been in public circulation for
| half a year. Van Zandt's second analytical team determined a
| higher likelihood. He recommended Bisceglie's client contact the
| FBI immediately.[96]
|
| > In February 1996, Bisceglie gave a copy of the 1971 essay
| written by Ted Kaczynski to Molly Flynn at the FBI.[87] She
| forwarded the essay to the San Francisco-based task force. FBI
| profiler James R. Fitzgerald[98][99] recognized similarities in
| the writings using linguistic analysis and determined that the
| author of the essays and the manifesto was almost certainly the
| same person. Combined with facts gleaned from the bombings and
| Kaczynski's life, the analysis provided the basis for an
| affidavit signed by Terry Turchie, the head of the entire
| investigation, in support of the application for a search
| warrant.[87]
|
| https://en.m.wikipedia.org/wiki/Ted_Kaczynski
| googlryas wrote:
| It was actually his brother.
| fbdab103 wrote:
| So is the lesson you should have GPT rewrite your manifesto so
| as to obscure your personal idioms?
| CharlesW wrote:
| Or something purpose-built like Anonymouth
| (https://github.com/psal/anonymouth), although it seems to be
| both unique and dead.
|
| Also interesting:
|
| > _Ross Ulbricht aka Dread Pirate Roberts, the mastermind
| behind the infamous Silk Road site which served as a black
| market for drugs, weapons and fake documents was also well
| aware of the potential danger of stylometry being used
| against him. At the time of his arrest in a San Francisco
| public library, the FBI captured images of his laptop screen
| as evidence. Guess what what he had bookmarked -- "Science of
| Stylometry."_
|
| https://medium.com/svilenk/the-case-for-
| anonymity-12db114f0c...
| rejectfinite wrote:
| I mean he used an forum account with an email that had his
| name in it.
| fbdab103 wrote:
| That's the problem - it only takes a single slip and it
| is recorded forever. Perfect opsec is an impossibly high
| bar if you are maintaining an active online presence.
| elteto wrote:
| Incredible! There was a very active throwaway account here a
| while back that I always enjoyed interacting with. I suspected
| the person had more than one account and this found one that is
| incredibly close, down to the topics.
| DrStrangeLoop wrote:
| I tried dang's old account (gruseom) expecting to see his dang
| account listed. Nothing. Tried dang, sctb (a previous admin) was
| listed as closest match.
|
| I wouldn't rely on these results
|
| https://stylometry.net/user?username=gruseom
|
| https://stylometry.net/user?username=dang
| pvg wrote:
| _I wouldn 't rely on these results_
|
| You picked a user who posts a massive volume of repeat,
| template-y comments and found their former colleague who also
| posted piles of repeat, template-y comments, that being part of
| both of their jobs.
| DrStrangeLoop wrote:
| There are a few close matches to dang's style of template-y
| comments in the results. Afaik none of the listed accounts
| are Daniel.
|
| I picked dang as he is the figurehead of hn, and didn't want
| to inadvertently reveal some other user's identity.
| dragonwriter wrote:
| > There are a few close matches to dang's style of
| template-y comments in the results.
|
| At least the #1 close match (sctb) was a comoderator with
| dang, so they were kind of alts as the official voice of
| HN.
| woodruffw wrote:
| Neat work!
|
| Out of curiosity: do you filter sentences than begin with '>',
| indicating a block quote from another user? That might improve
| the accuracy a little here, if you don't already.
| costco wrote:
| Yep!
| jsnell wrote:
| After a few tries on boring accounts, I thought to try the
| account of somebody who was notorious for an incident outside of
| HN, and had a (deservedly) bad time at HN for a couple of years
| before the account went dark.
|
| And yeah, there's a bunch of high confidence (.6-.8) hits for
| that account, and from a quick browse of the comments of the
| recently active ones, they look really likely to be alts. Like,
| all three that I looked at had comments that made it very clear
| it was this person writing pseudonymously. (E.g. writing on their
| signature issue, and saying they couldn't go into more detail due
| to fear of self-doxxing; or somebody literally saying that the
| alt's claims reminded them of the public writings of the
| notorious guy years ago).
|
| Obviously I'm not naming the account, but this functionality
| turned out way creepier than I thought the moment I tried it on
| the account of somebody who has a reason to disassociate from an
| existing public persona, but still wants to participate here.
| Animats wrote:
| 0.6 isn't much. I have 3 matches above 0.6, and they're not me.
| 20 or so over 0.5.
| input_sh wrote:
| I get one 0.68 match, which... fair enough. It is an account
| I've abandoned some years ago, no secrets there.
|
| No other hits above 0.5, so I guess that either makes me
| pretty unique as a commentator or my English is broken in a
| unique way.
| jsnell wrote:
| That's why you manually evaluate the matches. And like I
| wrote in that comment, I did that manual eval, and these
| clearly are alts of that main account, not spurious.
| Narrowing down the pool of accounts you'd need to do this
| kind of manual evals for by a factor of 100000 is a pretty
| significant change in capabilities.
| tqi wrote:
| > quick browse of the comments of the recently active ones,
| they look really likely to be alts.
|
| Hmm isn't a spot check of comments somewhat tautological, since
| that is how the tool identifies alts (rather than something
| like IP address or time of day)? If this had been promoted as
| "find accounts with similar writing style to yours" would
| people immediately assume alts?
| margalabargala wrote:
| I would presume that OP is referring to the actual content of
| the comments. This just does stylometric analysis, which
| looks at word choice, but not what the arrangement of the
| words _mean_.
|
| If some accounts are found to be stylometrically similar, and
| then a visual inspection also shows them all stating similar
| opinions, that latter piece of data is a strong signal.
| thesz wrote:
| I keep no alternate accounts, but this tool reports best
| matches for me that appear to be Slavic or just Russian - and I
| am Russian. Best match score in my list is just above 0.5.
| There are some clearly alternate accounts on the list, their
| match scores with this tool are well above 0.7.
|
| It is probable that persons of same cultural origin will have
| similar writing style and vocabulary. It is also probable that
| persons of same cultural origin would have same relationships
| with the world as a whole, they would like same things and
| dislike other same things.
|
| So, in my opinion, it is possible that you have found not only
| alternate accounts (score above 0.7), but accounts of people
| with same cultural origin (ones that are around 0.6).
| ricardobayes wrote:
| My highest was 0.41 and the person writes nothing like me. I
| guess I'm a unique snowflake after all.
| jrumbut wrote:
| I have a few in the low 0.5's and, honestly, they seem cool
| and I want to meet them.
| gilleain wrote:
| my second highest hit (ie, third in the list) is gwern at
| 0.45 who i'm fairly sure is not me.
| scarmig wrote:
| I was actually just looking at near hits for gwern and
| found what's almost definitely a defunct alt for him.
| gilleain wrote:
| Well is certainly NOT me, that's for sure.
|
| On an unrelated topic, I'm starting a service to write
| comments in the style of others to provide plausible
| deniability for other alt accounts. Rates negotiable.
| vbezhenar wrote:
| There're 19 other accounts this tool finds similar to me.
| Those are not my accounts. 0.46 - 0.56 are numbers.
| bbarnett wrote:
| You are fools, one and all! This tool's only purpose, is to
| tag people who use it!
|
| Now they know just who cares about which alternate
| accounts. They _know_!
|
| They freaking know, man!
|
| You have all fallen for their ploy. Fools!
| thesz wrote:
| I have no alternate accounts and visited the site out of
| curiosity, because I used to worked in the domain like
| this.
|
| What I found was worth visiting the site. Somehow notably
| many accounts with (relatively) high similarity to mine's
| are sharing at least one of my personal traits.
|
| Which is fascinating, to me.
|
| And I think is worth to be noticed by others - what and
| how you write can disclose who you are.
| TheOtherHobbes wrote:
| It knows my IP now.
|
| (Or does it?)
| neodypsis wrote:
| It offers no privacy policy, so can't tell.
| csa wrote:
| Fwiw, and as gp mentioned, > 0.7 seems more likely to be
| alt territory.
| costco wrote:
| I think people are sort of confused at what this tool is
| supposed to be which I will concede is partially my fault.
| The results of this tool are by themselves not indicative
| of having an alternative account. It generates the 20 most
| similar users for every single user on the site, regardless
| of whether they have an alt or not (there's obviously no
| way for me to know that for every single user). In your
| case further investigation would reveal that none of those
| accounts are yours.
| thesz wrote:
| It is a fun tool, I can assure you. It is just people
| have found use case you haven't foreseen yourself.
|
| I think your tool should have internal embeddings for
| each of the user. Also, most probably your tool uses
| cosine similarity for a search.
|
| Thus, I would like to suggest a feature: recognize simple
| arithmetic operations over user's embeddings, such as
| "thesz - 2 * patio11". It will make things even more fun,
| this way we can find users who are like me and much not
| like patio11. Even simple additions and subtractions
| would suffice.
|
| (an idea is taken from properties of word2vec embeddings)
|
| Your tool is thought provoking. What I discovered with it
| made me think about my use of language and what other
| languages (body, imagery, etc) I use differently because
| of who I am. Which made me think about my favorite
| underrated superhero Cypher [1] - would his innate
| ability to understand languages make him best detective
| ever?
|
| [1] https://en.wikipedia.org/wiki/Cypher_(Marvel_Comics)
|
| Thank you!
| phreeza wrote:
| MD5 of the username is 9abc27e93b7e3c04b7c599017c1cfe5f ? The
| top one seems an odd one out in that case?
| Aachen wrote:
| Usernames aren't random enough to be safe as a simple MD5.
| Perhaps with a strong bcrypt, but similar to PIN codes, it
| might be better to give partial information like "is the
| second character an ...", assuming nobody else made similar
| statements. Or give the first ~two hex characters of the
| hash, so that it would match 1/(162)rd of the usernames. I'm
| sure there's also a clever way for a zero-knowledge proof
| here, probably something with diffie-hellman using the name
| as your random integer or something, but I'm too sick to
| think about this stuff right now. Privately sharing data
| publicly is hard.
| lzooz wrote:
| Good point - I've been running john on that md5 for a
| couple minutes :)
| wizzwizz4 wrote:
| Why use John? Just run down the list of Hacker News
| usernames; it'll take less time. (Or, better still,
| don't; just because the privacy's theoretically
| compromised doesn't mean we have to exploit that.)
| lzooz wrote:
| I don't think there's a public list of all HN usernames
| is there?
|
| Found this, it includes 250k usernames, but it's not
| there. https://www.kaggle.com/datasets/hacker-
| news/hacker-news-corp...
| [deleted]
| meta2023 wrote:
| The username in question isn't in this dataset but maybe
| it was created in the past 10 days, as the max(timestamp)
| is Nov 16th, 2022.
|
| https://console.cloud.google.com/marketplace/details/y-co
| mbi...
| ahmedalsudani wrote:
| Another problem is that it's a small set. If you had a list
| of all HN users, you could compute md5 for all of them in
| seconds.
| [deleted]
| kcarter80 wrote:
| Could you elaborate on why it's obvious why you won't name the
| account?
| notduncansmith wrote:
| Maybe to avoid attracting any extra attention to this user?
| Also, as someone who's read HN for a few years, it only took
| me 2 guesses to find an account that the above comment
| describes (and not necessarily the same person).
| [deleted]
| sillysaurusx wrote:
| It was a classy move by jsnell, too. Thank you.
|
| (I don't know who the comment is talking about, which is
| how it should be. There's no need to blow someone's cover
| in a highly visible way. Even if they were satan, they'd
| still be welcome on HN as long as they're writing
| substantive, interesting comments that follow the
| guidelines.)
| Normal_gaussian wrote:
| Such quality comments would track with most thorough
| Satan representations.
| Aachen wrote:
| They obviously don't want it to be known, seeing as they've
| got alts to post under and avoid going into too much detail.
| Being able to go out and do your own research is different
| than posting the information open for everyone to see at a
| glance.
|
| I would say it's obvious why one might respect that wish (do
| unto others...), but I'm also aware that my and my culture's
| sense of privacy goes further than many others'.
| tbrownaw wrote:
| > _but this functionality turned out way creepier than I
| thought the moment I tried it_
|
| Hopefully this raised awareness means that people who actually
| need anonymity will be more likely to know to take precautions.
| kaba0 wrote:
| Genuinely asking, what way is there to combat this? Is there
| a tool that takes out stylistic elements of your comment?
| thedragonline wrote:
| I wonder if gpt3 has a use case here?
| marbu wrote:
| One way would be to run such tool before posting and then
| based on the results, tweak the post and repeat until the
| similarities are not statistically significant. Or instead
| of tweaking, start posting under a new throwaway account.
| But this won't save you when some new way to analyze style
| appears in the future. Moreover there are other types of
| meta data which can be taken into account to narrow down
| the search space a bit such as timestamps. And obviously
| more you write, harder it is to control these things.
| paulgb wrote:
| The site mentions a service called Quillbot which
| apparently does just that. https://stylometry.net/avoid
| birdyrooster wrote:
| UncleEntity wrote:
| You know everyone going to put your username in that tool
| after a rant like that.
|
| If ever there were a good use for a throwaway account I'm
| thinking this is it...
| irrational wrote:
| .6 is high confidence? I did my own username, wondering what it
| would return, since I know I don't have any alt accounts. The
| top results are in the .6-.7 range. If they aren't alt
| accounts, is it just coincidence that we have similar writing
| styles?
| bee_rider wrote:
| I think so.
|
| A funny thought -- my "matches" cap out at around .56. Having
| false positives* in a tool like this might feel like a "bad
| result" but actually I think it just means that if someone
| were running this sort of tool across the whole internet, I'd
| be relatively easy to correlate, while your identity would be
| intermingled with your .6-.7 partners.
|
| *actually they aren't really even false positives because the
| tool doesn't promise to detect alts in the first place, just
| find similar styles.
| joxel wrote:
| ColinWright is Dang?
|
| Woah
| McDyver wrote:
| Would this work for Fernando Pessoa and all his heteronyms? :)
| jll29 wrote:
| The method used, i.e. to calculate the cosine of the two authors'
| word vectors, is poorly suited for stylometric analysis because
| it is based on a poster's lexicon and the word frequencies of
| each word, but ignoring stylistically relevant factors like word
| order.
|
| Also, the cosine of the vectors of word frequencies conflates
| author-specific vocabulary and topics; in other words, my account
| is grouped (with >51% similarity, according to the demo) with
| someone probably because we wrote about similar things. A strong
| stylometric matcher ought to be robust against topic shifts (our
| personal writing style is what stays constant when we move from
| writing about one topic to writing about another topic, just like
| our personality is what stays constant about our behavior over
| time - of course styles do change, but the premise then has to be
| that such changes happen very slowly).
|
| Stylometrics/authorship identification is interesting and has led
| to some surprising findings, e.g. in forensic linguistics
| (Malcolm Coulthard wrote several good books about the topic).
|
| This paper lists some other features that could be used and
| compares a bunch of techniques:
| https://research.ijcaonline.org/volume86/number12/pxc3893384...
| agumonkey wrote:
| Oh god, that thing starts with direct focus on the search field,
| opening it showed a bunch of old nicknames, I thought it was the
| result of some study.
| rand_user_100 wrote:
| On one hand, thank you for showing us all how easy it is to make
| something like this. No doubt organizations with more resources
| already have more sophisticated systems in the same vein.
|
| On the other hand, can we agree that this product is unethical?
|
| In many cases, when a person uses an alt, it is a direct and
| strong signal that they do not wish their other posts to be
| associated.
|
| So this product is circumventing the explicit will of the person,
| and making it available to anyone with zero effort i.e. there is
| no barrier to getting this info.
|
| I met someone about 10 years ago who said they built this at a
| university. And their argument also was "actually this enhances
| privacy because it lets you know something something something".
| And yet their research grants were coming from one source only.
|
| It _can_ be used for good, but most often it won 't.
| A4ET8a8uTh0 wrote:
| << On the other hand, can we agree that this product is
| unethical?
|
| It does create a high level of discomfort, because it
| illustrates well what privacy advocates try talking about to
| the population at large, but all that said.. how is it any
| different from regular scraping and analyzing it any other way?
|
| This is a real question.
| rand_user_100 wrote:
| It's different because you're removing all barriers to access
| and making it easy and convenient to stalk/dox people.
|
| Imagine you get the urge to track someone, but in order to do
| that you have to spend a week writing some new software.
| That's a barrier. And because of it you may change your mind
| because it's a lot of work with little payoff.
|
| But if that info is just one click away, it's a whole
| different ballgame.
| [deleted]
| dragonwriter wrote:
| > On the other hand, can we agree that this product is
| unethical?
|
| No.
| gus_massa wrote:
| It would be nice to make the names clickable.
|
| I don't think the list of pg alternate account is accurate. I
| checked a few. They have many oneliners that is typical of pg,
| but the topics and style don't look similar.
|
| I searched a few more and got better results. :)
|
| I searched myself (that I know that I have no alternate
| accounts). I recognize a few users that are interested in similar
| topics, and I discuss/upvote them many times. But I didn't
| recognize most of the user of the list.
| costco wrote:
| > I searched myself (that I know that I have no alternate
| accounts). I recognize a few users that are interested in
| similar topics, and I discuss/upvote them many times. But I
| didn't recognize most of the user of the list.
|
| It's based purely off frequency of the 200 most common English
| 1 word phrases, 2 word phrases, 3 word phrases, 1 character
| sequences, 2 character sequences, and 3 character sequences.
| Topic does not really have anything to do with it. If I had
| more time I probably would've done a smarter model that
| accounted for things like that.
| gus_massa wrote:
| One is also a mathematician. It's trivial that we overuse
| some technical words even if it's unnecessary.
|
| Another is form Argentina, so I guess the native language
| leaks, for example using words derived from latin that are
| not idiomatic.
|
| And there are a few more, that is a honor to be "confused"
| with, but I have no clue why.
| gavinray wrote:
| I've complained a lot about Haskell and now it thinks I like
| Haskell =(
|
| Needs sentiment analysis IMO, otherwise you'll get "Here's a
| bunch of people who are JUST LIKE YOU", except they use a similar
| grammar style but hold opposite opinions on the same nouns.
| ahmedalsudani wrote:
| Serves you right for disparaging The One True Language!
|
| Ok, fine, we'll present Idris with a fig leaf.
| layer8 wrote:
| It just thinks you engage a lot with Haskell. These are people
| with who you have something to talk about. :)
| chronogram wrote:
| Well done, it found my ancient old account.
| [deleted]
| scotty79 wrote:
| Funny thing would be to find most unique user account
| stylistically.
|
| Which user has lowest best match?
|
| Mine is 0.58 so I'm really not that unique.
| ggerganov wrote:
| I really liked the informative and straight-to-the-point about
| page - describing how the algorithm works in a way that is easy
| to understand. All the important details are summarised there.
| Well done!
|
| Edit: From the "How to avoid .." page, there is the following
| sentence:
|
| > Also, most authorship identification algorithms have poor
| accuracy when working with small amounts of words. This means the
| optimal strategy would be discarding an account either after
| every comment or after a small number of comments. Unfortunately,
| this is against HN rules and may result in a ban.
|
| Can you clarify what this means and why it would result in a ban?
| costco wrote:
| > Can you clarify what this means and why it would result in a
| ban?
|
| I have seen dang respond to users multiple times asking them to
| stop making new accounts especially but not always if it's to
| avoid rate limiting. I don't know if there's an official policy
| but it's definitely something I recall.
| krisoft wrote:
| > Can you clarify what this means
|
| Imagine that for every new comment you want to post you would
| create a brand new account which you would use precisely once
| and never again. Then the stylometry would have just a few
| words and wouldn't have enough corpus to get a reliable
| signature. If a lot of people does this it would be hard to
| figure out which account belongs with which human. ( Of course
| if you alone do this, your messages will stick out like a sore
| thumb. See xkcd 1105 )
|
| > why it would result in a ban?
|
| Because this practice is especially discouraged in the
| guidelines: "please don't create accounts routinely. HN is a
| community--users should have an identity that others can relate
| to."
| stupendous_luck wrote:
| At the same time, HN doesn't let you delete comments.
|
| Maybe with some GDPR magic.
| krisoft wrote:
| Not sure what is your point, or how does that connect with
| my comment. Care to elaborate?
| stupendous_luck wrote:
| Your comment quotes an HN guideline, and my point relates
| to it. Some users may feel the need to create throwaway
| accounts in order to post comments that in an alternative
| reality they could post under their primary account and
| later delete if desired. It may not stop a scrupulous
| collector of data, but such a scenario may not be the
| object of their worry.
|
| Drawing this into the logical conclusion, a user may opt
| to always post under a throwaway account, to avoid any
| possible tainting associated with a primary account.
| jaredsohn wrote:
| Amusingly can't run it on the author since not enough comments
| joshstrange wrote:
| Very interesting, .59 is my lowest, .64 is my highest match, none
| of these accounts are one of my alts. Though to be fair the
| handful of times I've used a throwaway I used it for a single
| comment so I didn't give it much to go off.
| sedatk wrote:
| I have no alternate accounts, and all my matches are below 0.4
| for whatever it's worth.
| SevenNation wrote:
| > ... This site works primarily by analyzing for each user the
| frequencies of the most common words and phrases in the English
| language. Accordingly, the easiest way to avoid being identified
| is to simply use different words than you ordinarily would when
| writing. More sophisticated models than the one I made can use
| punctuation, comma usage, and capitalization to identify you so
| try alternating those as well. Services like Quillbot can help
| with you this but depending on your circmstances you may not want
| to send your writings to a third party service.
|
| HN offers many other threads which could be tied together,
| including:
|
| - time of posting
|
| - ratio of replies to top-level comments
|
| - comments being mainly upvoted or downvoted
|
| - sentiment (mostly angry, dismissive, questioning, etc.)
|
| - most common topics (keyword analysis of post being replied to)
|
| - ratio of new posting to post replies
|
| - first-to-comment on a post
|
| - lone comment on a post
|
| - etc...
|
| It seems very likely that sooner or later every pseudonym for
| posting content will get discovered and linked. The lesson here
| is don't post anything that would cause you undue shame or harm
| if linked directly to your legal name.
___________________________________________________________________
(page generated 2022-11-26 23:00 UTC)