[HN Gopher] Show HN: Semantic clusters and embeddings for 500k H...
       ___________________________________________________________________
        
       Show HN: Semantic clusters and embeddings for 500k Hacker News
       comments
        
       Author : josh-sematic
       Score  : 71 points
       Date   : 2024-06-12 16:35 UTC (6 hours ago)
        
 (HTM) web link (app.airtrain.ai)
 (TXT) w3m dump (app.airtrain.ai)
        
       | calebheinzman wrote:
       | Interesting breakdown of the dataset. Are you guys manually
       | assigning labels to the clusters after the fact or is this using
       | some kind of LLM to create a cluster name?
        
         | josh-sematic wrote:
         | Everything is automatic and non-manually curated; we basically
         | just uploaded the data without doing anything dataset-specific
         | as far as the clusters are concerned. First we create an
         | embedding of the rows, then we run projection & clustering on
         | it. The first clusters we generate are narrow. After we
         | generate the first round of clusters we label them with an LLM,
         | and then cluster those to create the more generalized clusters.
         | Have an LLM label those, and then we're done.
        
       | GeoAtreides wrote:
       | I did not consent on my HN data being used by entities other than
       | HN[1]. Please remove all my comments and data from this dataset.
       | 
       | [1] As per HN license: https://www.ycombinator.com/legal/
        
         | austinjp wrote:
         | Yep indeed, same here.
         | 
         | I know "me too" comments aren't generally acceptable here, but
         | I think the wholesale pillaging of the web for LLMs etc is way
         | out of hand.
        
           | shadowgovt wrote:
           | Back in the day, we called that "indexing" and it was
           | fundamental to making the web in any way usable; without
           | search engines, the whole thing was data with no ability to
           | locate it.
           | 
           | I don't know precisely what changed that people decided that
           | analysis is a bridge too far.
        
         | neutralino1 wrote:
         | The original dataset is located at [1] (not our HF account). HN
         | data is directly available via the HN API [2]. The privacy
         | policy you point to does not cover HN posts.
         | 
         | [1] https://huggingface.co/datasets/OpenPipe/hacker-news [2]
         | https://github.com/HackerNews/API
        
           | GeoAtreides wrote:
           | Well, then the original dataset should also remove my data.
           | Just because it's there doesn't mean it's there legally.
           | 
           | It covers all User Content, which is comments and post titles
           | (I think) at a minimum.
        
             | vidarh wrote:
             | That policy explicitly excludes HN comments.
        
             | maleldil wrote:
             | From your link:
             | 
             | > Hacker News Information: If you create a Hacker News
             | account (ID and profile), we do not collect any Personal
             | Information unless you choose to provide your email address
             | and/or information in the "about" field ("HN Information").
             | Your submissions to, and comments you make on, the Hacker
             | News site are not Personal Information and are not "HN
             | Information" as defined in this Privacy Policy.
        
               | shadowgovt wrote:
               | Correct.
               | 
               | I'm a little surprised people don't know the ownership
               | story on HN. Didn't it raise questions when they realized
               | they can't delete their posts without mother-may-I'ing
               | the mods?
               | 
               | HN is pretty up-front that when you post here you are
               | providing them content for free to more-or-less consume
               | as they please.
        
               | GeoAtreides wrote:
               | I disagree on the up-front... front. Neither of the
               | things you mentioned are made clear when signing up for
               | an account.
        
         | thenewnewguy wrote:
         | You did, although I have no idea if OP is an affiliate of YC
         | (IANAL but maybe you could argue the public API is a form of
         | sublicensing the data?):
         | 
         | > By uploading any User Content you hereby grant and will grant
         | Y Combinator and its affiliated companies a nonexclusive,
         | worldwide, royalty free, fully paid up, transferable,
         | sublicensable, perpetual, irrevocable license to copy, display,
         | upload, perform, distribute, store, modify and otherwise use
         | your User Content for any Y Combinator-related purpose in any
         | form, medium or technology now known or later developed.
        
           | GeoAtreides wrote:
           | > Y Combinator and its affiliated companies
           | 
           | My data is licensed only to Y Combinator and its affiliated
           | companies (I would have prefer it to be licensed it only for
           | news.ycombinator,com), not any other rando that can access it
           | via a browser or an API
        
             | daemonologist wrote:
             | Sematic (OP's employer; rebranded to Airtrain?) is a YC
             | company. Not a lawyer but I assume that would be included
             | in "affiliated" since YC presumably has some ownership of
             | them.
             | 
             | https://www.ycombinator.com/companies/airtrain-ai
        
               | GeoAtreides wrote:
               | Welp, there it is then.
               | 
               | If they are YC affiliated, nothing I can do.
               | 
               | But I do feel saddened and personally betrayed; I thought
               | the licence I gave to YN was just for
               | news.ycombinator.com to store and show my comments, not
               | for any other purposes.
               | 
               | Silly, silly me.
        
               | senordevnyc wrote:
               | All HN _did_ do was store and show your (public)
               | comments.
               | 
               | What this other company did with that (public) data seems
               | to me to be a separate issue that you should take up with
               | that company, just like the fact that your public
               | comments (which you explicitly gave permission to HN to
               | show) have been indexed by Google, Bing, and probably
               | thousands of other spiders, bots, scrapers, etc.
               | 
               | I'm curious how you expected this to work. Like if you
               | only give HN permission to store and show your comments
               | on the public web, then somehow no other entity out there
               | will be able to do anything with them?
        
               | GeoAtreides wrote:
               | Yes, I expect it to work in the same way instagram works,
               | for example. If a commercial entity started yoinking
               | photos from instagram and using them for commercial
               | purposes, shit will hit the fan.
               | 
               | Again, the fact my user data has been scrapped already
               | doesn't mean it was scrapped legally. I'm ok with HN
               | showing my comments, I'm not ok with anyone else than HN
               | using my user data.
        
               | averageRoyalty wrote:
               | > ommercial entity started yoinking photos from instagram
               | and using them for commercial purposes, shit will hit the
               | fan.
               | 
               | Will it though? I would imagine Meta would block them and
               | then posture with a C&D or a frivolous lawsuit, but if
               | they share the phones you gave them on the public
               | internet, they're publicly consumable right? What law do
               | you feel is broken there?
        
               | GeoAtreides wrote:
               | It's not Meta that would sue them (although they would),
               | it's the copyright owners (the users) that will. Photos
               | or comments, the User retains copyright on their content,
               | and only license it to Meta or YC for specific purposes.
               | Yes, that means Meta/YC and their affiliated companies
               | can use the content for other purposes than displaying it
               | in a browser, but 3rd parties 100% can't.
        
               | shadowgovt wrote:
               | Shit hits the fan because Instagram considers user
               | content a golden goose, and _they_ have a vested interest
               | in not letting it get outside their control. Not because
               | they feel a particular obligation to protect user
               | privacy. That 's generally been status quo for every
               | social network.
               | 
               | HN cares a lot less; they're a tech comment site and
               | don't actively discourage people using the dataset
               | gleanable from the contents of the site for novel
               | experimentation.
               | 
               | (Sidebar: I see "scrapped" coming up a lot in these
               | conversations these days. Where is that neologism coming
               | from? I'm familiar with people calling it "scraping" but
               | it seems like the term has drifted for some reason?)
        
           | lolinder wrote:
           | > IANAL but maybe you could argue the public API is a form of
           | sublicensing the data?
           | 
           | Given that the license is explicitly identified as both
           | sublicensable and transferable and includes the right to
           | distribute, I have a very hard time seeing how anyone could
           | argue that the recipient of data that YC explicitly exposes
           | through their "Official HN API" isn't allowed to use it.
        
         | donpark wrote:
         | That's a contract between users and HN. Airtrain is a 3rd-
         | party.
         | 
         | If HN API exposes personal information publicly through their
         | API then there is a problem.
         | 
         | And AFAICT the only way for HN to prevent user comments from
         | being used by 3rd-party is preventing access to those comments,
         | meaning a) sign-up will have to be more stringent and b)
         | visitors will have to sign-in just to read (or scrape)
         | comments.
        
           | senordevnyc wrote:
           | And that doesn't really prevent anything, it just (mildly)
           | slows it down.
        
         | renewiltord wrote:
         | This is pretty clever. Reminds me of Larry Philpot's time-bomb
         | CC-licensed images
         | https://commons.wikimedia.org/wiki/File:Flaming_Lips.jpg
         | 
         | You put things in a place with an expectation of a certain
         | standard of use and then go after people hammer and tongs with
         | a strict interpretation. Sometimes, the strict interpretation
         | need not be valid. You can just shake them down.
        
         | Lerc wrote:
         | If they are an entity other than HN, how can they act upon this
         | request when the request is itself data on HN?
        
         | olivierduval wrote:
         | Actually, another problem might be GDPR: I have found my
         | username... and that is a clearly a PII because it's directly
         | and univocally bound to me
         | 
         | I dont really care (for now) about this... but on the
         | principle, I'm a bit fed up too by companies just crawling
         | anything to train anymodel without any care about the datas,
         | the people that produced them, and the consequences on people's
         | life.
         | 
         | Maybe I could use the new European AI Act too
         | (https://artificialintelligenceact.eu/fr/high-level-summary/)
         | ... although I'm not sure because I didn't read it yet
        
           | shadowgovt wrote:
           | I look forward to the not-too-distant future where the EU
           | protections grow stronger and places like HN have to respond
           | by banning all European users lest they run afoul of a
           | draconic legal framework.
           | 
           | It'll kill a lot of experiments (Mastodon immediately comes
           | to mind; can't be pulling comments from other people's
           | servers if those comments are attached to personal data like
           | the commenter's username, right?).
        
             | olivierduval wrote:
             | Well, maybe you should think about the real
             | responsabilities: Europe make laws in reaction to ABUSES.
             | So dont blame Europe for the legislation, but the abusers
             | that made this legislation mandatory to defend european
             | citizen ;-)
             | 
             | Actually, Europe is so slow that a lot of experiments may
             | take place. And there wont be any legislation if there no
             | abuse...
             | 
             | It took a loooonnnng time for Europe to react to Facebook,
             | Google & co abuse with users datas. Same for OpenAI using a
             | awful lot of copyrighted material without giving anything
             | back... So thank'em for Europe legislation :-)
        
           | averageRoyalty wrote:
           | I'm not up on the nuance of the GDPR, but has it been tested
           | that your public profile name - which you set knowing it will
           | be displayed publicly - is PII?
           | 
           | I'd be very surprised if that were the case.
        
             | shadowgovt wrote:
             | It's not PII (an American term) but it is personal data (a
             | GDPR term).
             | 
             | Personal data is (broadly) considered to be data that could
             | be used to track or tie your behavior online together into
             | a profile. The UK's ICO calls out usernames specifically as
             | an example of such data. https://ico.org.uk/for-
             | organisations/uk-gdpr-guidance-and-re....
             | 
             | (For those of us who have been around on the Internet long
             | enough to remember the era where people intentionally chose
             | handles to remain pseudonymous and separate from their IRL
             | personas, this seems counter-intuitive and a little
             | preposterous, but the GDPR doesn't care what "netizens"
             | think about privacy; it's a broad attempt to impose a "non-
             | native" concept of privacy over the preexisting net
             | culture).
        
             | olivierduval wrote:
             | Well, you choose a username in a specific context, even if
             | it's public.
             | 
             | For example, you may agree to have your linkedIn profile
             | name next to your HN username... maybe. But I'm not sure
             | that you would agree to have your LinkedIn profile name
             | next to your Tinder username.
             | 
             | And you sure don't want that to happen without your
             | agreement and even without you knowing about it (but
             | learning about it from a colleague for example).
             | 
             | That's why GDPR has some right to deletion or modification.
             | And why some days, Europe may go after data brocker
             | 
             | (as a side note: not sure why my comments were downvoted. I
             | didn't say that I would go after anybody - and surely not
             | HN - I only said that uncontrolled use of any data without
             | any anonymization and without consent might be the source
             | of problems with regard to legislation decided BECAUSE too
             | many shady business abused of it. You may not like it but
             | then... well... downvote the abusers)
        
         | eddd-ddde wrote:
         | I am an entity other than HN.
         | 
         | Am I not allowed to read your comments?
         | 
         | Am I not allowed to learn from reading your comments?
        
           | GeoAtreides wrote:
           | You are allowed all rights given to you by (copyright) law,
           | while respecting the YC license:
           | https://www.ycombinator.com/legal/
        
       | politelemon wrote:
       | I'm not sure how to view the embedding, I clicked on the graph,
       | narrowed down to a comment, but it only shows the row and not the
       | raw array? (Or I've misunderstood)
        
         | neutralino1 wrote:
         | Oh you cannot actually view the raw embedding vector, only the
         | corresponding row.
         | 
         | Any particular use-case to view the raw embedding?
        
           | politelemon wrote:
           | None at all, pure curiosity thanks!
        
         | Alifatisk wrote:
         | It would be very cool if one could view the whole 3D embedding
         | space
        
           | neutralino1 wrote:
           | We debated doing 2D vs 3D and 3D brought a bunch of usability
           | issues. We also noticed most SOTA embedding visualizations
           | were 2D and already yielded good insights.
        
       | costco wrote:
       | Some users comment frequently and uniquely enough that they get
       | their own cluster. I also like how there's "Deno vs Node" lol.
       | It'd be cool if you could click on a comment and get the most
       | similar ones by whatever distance metric you used.
        
         | josh-sematic wrote:
         | It's on our roadmap to be able to click on rows and see similar
         | ones; thanks for the feedback!
        
       | LordGrey wrote:
       | I'm not sure about others here, but I occasionally spend time
       | typing out a long reply to someone and then simply deleting the
       | reply without posting. Most of the time I conclude -- a little
       | too late -- that the effort was not worth it.
       | 
       | How nice it would be to have an LLM trained on all of my previous
       | writings and simply be able to click a button to indicate "reply
       | to this person, please." I know I don't have enough training data
       | from HN, and maybe not even from all of the sites I contribute,
       | combined. It is still a nice thought, though.
       | 
       | But: Let's say I do acquire enough training data to have a local
       | LLM do exactly what I describe. My volume of "replies" would
       | certainly increase. Is that a good thing, on average? If the tool
       | became ubiquitous, would it be a good thing for the average
       | social media user? Or more pointedly, would it be a good thing
       | for consumers of that social media? The cynic in me thinks "no"
       | -- the effort required today surely weeds out _some_ idiots....
       | 
       | (Full disclosure, I nearly closed this window without clicking
       | the "add comment" button.)
        
       | giancarlostoro wrote:
       | Rough view on mobile the navigation isnt auto hiding and in the
       | way.
        
         | neutralino1 wrote:
         | Yeah sorry about that, this is a data dashboard. Not optimized
         | for mobile viewing.
        
       ___________________________________________________________________
       (page generated 2024-06-12 23:02 UTC)