[HN Gopher] Show HN: Semantic clusters and embeddings for 500k H...
___________________________________________________________________
Show HN: Semantic clusters and embeddings for 500k Hacker News
comments
Author : josh-sematic
Score : 71 points
Date : 2024-06-12 16:35 UTC (6 hours ago)
(HTM) web link (app.airtrain.ai)
(TXT) w3m dump (app.airtrain.ai)
| calebheinzman wrote:
| Interesting breakdown of the dataset. Are you guys manually
| assigning labels to the clusters after the fact or is this using
| some kind of LLM to create a cluster name?
| josh-sematic wrote:
| Everything is automatic and non-manually curated; we basically
| just uploaded the data without doing anything dataset-specific
| as far as the clusters are concerned. First we create an
| embedding of the rows, then we run projection & clustering on
| it. The first clusters we generate are narrow. After we
| generate the first round of clusters we label them with an LLM,
| and then cluster those to create the more generalized clusters.
| Have an LLM label those, and then we're done.
| GeoAtreides wrote:
| I did not consent on my HN data being used by entities other than
| HN[1]. Please remove all my comments and data from this dataset.
|
| [1] As per HN license: https://www.ycombinator.com/legal/
| austinjp wrote:
| Yep indeed, same here.
|
| I know "me too" comments aren't generally acceptable here, but
| I think the wholesale pillaging of the web for LLMs etc is way
| out of hand.
| shadowgovt wrote:
| Back in the day, we called that "indexing" and it was
| fundamental to making the web in any way usable; without
| search engines, the whole thing was data with no ability to
| locate it.
|
| I don't know precisely what changed that people decided that
| analysis is a bridge too far.
| neutralino1 wrote:
| The original dataset is located at [1] (not our HF account). HN
| data is directly available via the HN API [2]. The privacy
| policy you point to does not cover HN posts.
|
| [1] https://huggingface.co/datasets/OpenPipe/hacker-news [2]
| https://github.com/HackerNews/API
| GeoAtreides wrote:
| Well, then the original dataset should also remove my data.
| Just because it's there doesn't mean it's there legally.
|
| It covers all User Content, which is comments and post titles
| (I think) at a minimum.
| vidarh wrote:
| That policy explicitly excludes HN comments.
| maleldil wrote:
| From your link:
|
| > Hacker News Information: If you create a Hacker News
| account (ID and profile), we do not collect any Personal
| Information unless you choose to provide your email address
| and/or information in the "about" field ("HN Information").
| Your submissions to, and comments you make on, the Hacker
| News site are not Personal Information and are not "HN
| Information" as defined in this Privacy Policy.
| shadowgovt wrote:
| Correct.
|
| I'm a little surprised people don't know the ownership
| story on HN. Didn't it raise questions when they realized
| they can't delete their posts without mother-may-I'ing
| the mods?
|
| HN is pretty up-front that when you post here you are
| providing them content for free to more-or-less consume
| as they please.
| GeoAtreides wrote:
| I disagree on the up-front... front. Neither of the
| things you mentioned are made clear when signing up for
| an account.
| thenewnewguy wrote:
| You did, although I have no idea if OP is an affiliate of YC
| (IANAL but maybe you could argue the public API is a form of
| sublicensing the data?):
|
| > By uploading any User Content you hereby grant and will grant
| Y Combinator and its affiliated companies a nonexclusive,
| worldwide, royalty free, fully paid up, transferable,
| sublicensable, perpetual, irrevocable license to copy, display,
| upload, perform, distribute, store, modify and otherwise use
| your User Content for any Y Combinator-related purpose in any
| form, medium or technology now known or later developed.
| GeoAtreides wrote:
| > Y Combinator and its affiliated companies
|
| My data is licensed only to Y Combinator and its affiliated
| companies (I would have prefer it to be licensed it only for
| news.ycombinator,com), not any other rando that can access it
| via a browser or an API
| daemonologist wrote:
| Sematic (OP's employer; rebranded to Airtrain?) is a YC
| company. Not a lawyer but I assume that would be included
| in "affiliated" since YC presumably has some ownership of
| them.
|
| https://www.ycombinator.com/companies/airtrain-ai
| GeoAtreides wrote:
| Welp, there it is then.
|
| If they are YC affiliated, nothing I can do.
|
| But I do feel saddened and personally betrayed; I thought
| the licence I gave to YN was just for
| news.ycombinator.com to store and show my comments, not
| for any other purposes.
|
| Silly, silly me.
| senordevnyc wrote:
| All HN _did_ do was store and show your (public)
| comments.
|
| What this other company did with that (public) data seems
| to me to be a separate issue that you should take up with
| that company, just like the fact that your public
| comments (which you explicitly gave permission to HN to
| show) have been indexed by Google, Bing, and probably
| thousands of other spiders, bots, scrapers, etc.
|
| I'm curious how you expected this to work. Like if you
| only give HN permission to store and show your comments
| on the public web, then somehow no other entity out there
| will be able to do anything with them?
| GeoAtreides wrote:
| Yes, I expect it to work in the same way instagram works,
| for example. If a commercial entity started yoinking
| photos from instagram and using them for commercial
| purposes, shit will hit the fan.
|
| Again, the fact my user data has been scrapped already
| doesn't mean it was scrapped legally. I'm ok with HN
| showing my comments, I'm not ok with anyone else than HN
| using my user data.
| averageRoyalty wrote:
| > ommercial entity started yoinking photos from instagram
| and using them for commercial purposes, shit will hit the
| fan.
|
| Will it though? I would imagine Meta would block them and
| then posture with a C&D or a frivolous lawsuit, but if
| they share the phones you gave them on the public
| internet, they're publicly consumable right? What law do
| you feel is broken there?
| GeoAtreides wrote:
| It's not Meta that would sue them (although they would),
| it's the copyright owners (the users) that will. Photos
| or comments, the User retains copyright on their content,
| and only license it to Meta or YC for specific purposes.
| Yes, that means Meta/YC and their affiliated companies
| can use the content for other purposes than displaying it
| in a browser, but 3rd parties 100% can't.
| shadowgovt wrote:
| Shit hits the fan because Instagram considers user
| content a golden goose, and _they_ have a vested interest
| in not letting it get outside their control. Not because
| they feel a particular obligation to protect user
| privacy. That 's generally been status quo for every
| social network.
|
| HN cares a lot less; they're a tech comment site and
| don't actively discourage people using the dataset
| gleanable from the contents of the site for novel
| experimentation.
|
| (Sidebar: I see "scrapped" coming up a lot in these
| conversations these days. Where is that neologism coming
| from? I'm familiar with people calling it "scraping" but
| it seems like the term has drifted for some reason?)
| lolinder wrote:
| > IANAL but maybe you could argue the public API is a form of
| sublicensing the data?
|
| Given that the license is explicitly identified as both
| sublicensable and transferable and includes the right to
| distribute, I have a very hard time seeing how anyone could
| argue that the recipient of data that YC explicitly exposes
| through their "Official HN API" isn't allowed to use it.
| donpark wrote:
| That's a contract between users and HN. Airtrain is a 3rd-
| party.
|
| If HN API exposes personal information publicly through their
| API then there is a problem.
|
| And AFAICT the only way for HN to prevent user comments from
| being used by 3rd-party is preventing access to those comments,
| meaning a) sign-up will have to be more stringent and b)
| visitors will have to sign-in just to read (or scrape)
| comments.
| senordevnyc wrote:
| And that doesn't really prevent anything, it just (mildly)
| slows it down.
| renewiltord wrote:
| This is pretty clever. Reminds me of Larry Philpot's time-bomb
| CC-licensed images
| https://commons.wikimedia.org/wiki/File:Flaming_Lips.jpg
|
| You put things in a place with an expectation of a certain
| standard of use and then go after people hammer and tongs with
| a strict interpretation. Sometimes, the strict interpretation
| need not be valid. You can just shake them down.
| Lerc wrote:
| If they are an entity other than HN, how can they act upon this
| request when the request is itself data on HN?
| olivierduval wrote:
| Actually, another problem might be GDPR: I have found my
| username... and that is a clearly a PII because it's directly
| and univocally bound to me
|
| I dont really care (for now) about this... but on the
| principle, I'm a bit fed up too by companies just crawling
| anything to train anymodel without any care about the datas,
| the people that produced them, and the consequences on people's
| life.
|
| Maybe I could use the new European AI Act too
| (https://artificialintelligenceact.eu/fr/high-level-summary/)
| ... although I'm not sure because I didn't read it yet
| shadowgovt wrote:
| I look forward to the not-too-distant future where the EU
| protections grow stronger and places like HN have to respond
| by banning all European users lest they run afoul of a
| draconic legal framework.
|
| It'll kill a lot of experiments (Mastodon immediately comes
| to mind; can't be pulling comments from other people's
| servers if those comments are attached to personal data like
| the commenter's username, right?).
| olivierduval wrote:
| Well, maybe you should think about the real
| responsabilities: Europe make laws in reaction to ABUSES.
| So dont blame Europe for the legislation, but the abusers
| that made this legislation mandatory to defend european
| citizen ;-)
|
| Actually, Europe is so slow that a lot of experiments may
| take place. And there wont be any legislation if there no
| abuse...
|
| It took a loooonnnng time for Europe to react to Facebook,
| Google & co abuse with users datas. Same for OpenAI using a
| awful lot of copyrighted material without giving anything
| back... So thank'em for Europe legislation :-)
| averageRoyalty wrote:
| I'm not up on the nuance of the GDPR, but has it been tested
| that your public profile name - which you set knowing it will
| be displayed publicly - is PII?
|
| I'd be very surprised if that were the case.
| shadowgovt wrote:
| It's not PII (an American term) but it is personal data (a
| GDPR term).
|
| Personal data is (broadly) considered to be data that could
| be used to track or tie your behavior online together into
| a profile. The UK's ICO calls out usernames specifically as
| an example of such data. https://ico.org.uk/for-
| organisations/uk-gdpr-guidance-and-re....
|
| (For those of us who have been around on the Internet long
| enough to remember the era where people intentionally chose
| handles to remain pseudonymous and separate from their IRL
| personas, this seems counter-intuitive and a little
| preposterous, but the GDPR doesn't care what "netizens"
| think about privacy; it's a broad attempt to impose a "non-
| native" concept of privacy over the preexisting net
| culture).
| olivierduval wrote:
| Well, you choose a username in a specific context, even if
| it's public.
|
| For example, you may agree to have your linkedIn profile
| name next to your HN username... maybe. But I'm not sure
| that you would agree to have your LinkedIn profile name
| next to your Tinder username.
|
| And you sure don't want that to happen without your
| agreement and even without you knowing about it (but
| learning about it from a colleague for example).
|
| That's why GDPR has some right to deletion or modification.
| And why some days, Europe may go after data brocker
|
| (as a side note: not sure why my comments were downvoted. I
| didn't say that I would go after anybody - and surely not
| HN - I only said that uncontrolled use of any data without
| any anonymization and without consent might be the source
| of problems with regard to legislation decided BECAUSE too
| many shady business abused of it. You may not like it but
| then... well... downvote the abusers)
| eddd-ddde wrote:
| I am an entity other than HN.
|
| Am I not allowed to read your comments?
|
| Am I not allowed to learn from reading your comments?
| GeoAtreides wrote:
| You are allowed all rights given to you by (copyright) law,
| while respecting the YC license:
| https://www.ycombinator.com/legal/
| politelemon wrote:
| I'm not sure how to view the embedding, I clicked on the graph,
| narrowed down to a comment, but it only shows the row and not the
| raw array? (Or I've misunderstood)
| neutralino1 wrote:
| Oh you cannot actually view the raw embedding vector, only the
| corresponding row.
|
| Any particular use-case to view the raw embedding?
| politelemon wrote:
| None at all, pure curiosity thanks!
| Alifatisk wrote:
| It would be very cool if one could view the whole 3D embedding
| space
| neutralino1 wrote:
| We debated doing 2D vs 3D and 3D brought a bunch of usability
| issues. We also noticed most SOTA embedding visualizations
| were 2D and already yielded good insights.
| costco wrote:
| Some users comment frequently and uniquely enough that they get
| their own cluster. I also like how there's "Deno vs Node" lol.
| It'd be cool if you could click on a comment and get the most
| similar ones by whatever distance metric you used.
| josh-sematic wrote:
| It's on our roadmap to be able to click on rows and see similar
| ones; thanks for the feedback!
| LordGrey wrote:
| I'm not sure about others here, but I occasionally spend time
| typing out a long reply to someone and then simply deleting the
| reply without posting. Most of the time I conclude -- a little
| too late -- that the effort was not worth it.
|
| How nice it would be to have an LLM trained on all of my previous
| writings and simply be able to click a button to indicate "reply
| to this person, please." I know I don't have enough training data
| from HN, and maybe not even from all of the sites I contribute,
| combined. It is still a nice thought, though.
|
| But: Let's say I do acquire enough training data to have a local
| LLM do exactly what I describe. My volume of "replies" would
| certainly increase. Is that a good thing, on average? If the tool
| became ubiquitous, would it be a good thing for the average
| social media user? Or more pointedly, would it be a good thing
| for consumers of that social media? The cynic in me thinks "no"
| -- the effort required today surely weeds out _some_ idiots....
|
| (Full disclosure, I nearly closed this window without clicking
| the "add comment" button.)
| giancarlostoro wrote:
| Rough view on mobile the navigation isnt auto hiding and in the
| way.
| neutralino1 wrote:
| Yeah sorry about that, this is a data dashboard. Not optimized
| for mobile viewing.
___________________________________________________________________
(page generated 2024-06-12 23:02 UTC)