[HN Gopher] Show HN: Exploring HN by mapping and analyzing 40M p...
       ___________________________________________________________________
        
       Show HN: Exploring HN by mapping and analyzing 40M posts and
       comments for fun
        
       Author : wilsonzlin
       Score  : 286 points
       Date   : 2024-05-09 12:31 UTC (10 hours ago)
        
 (HTM) web link (blog.wilsonl.in)
 (TXT) w3m dump (blog.wilsonl.in)
        
       | thyrox wrote:
       | Very nice. Since Hn data spawns so many such fun projects, there
       | should be a monthly or weekly updates zip file or torrent with
       | this data, which hackers can just download instead of writing a
       | scraper and starting from scratch all the time.
        
         | average_r_user wrote:
         | that's a nice idea
        
         | noman-land wrote:
         | I very much support this idea. Put them on ipfs and/or
         | torrents. Put them on HuggingFace.
        
           | pfarrell wrote:
           | I've had this same thought but was unsure what the licensing
           | for the data would be.
        
         | pfarrell wrote:
         | I have a daily updated dataset that has the HN data split out
         | by months. I've published it on my web page, but it's served
         | from my home server so I don't want to link to it directly.
         | Each month is about 30mb of compressed csv. I've wanted to
         | torrent it, but don't know how to get enough seeders since each
         | month will produce a new torrent file (unless I'm mistaken). If
         | you're interested, send me a message. My email is mrpatfarrell.
         | Use gmail for the domain.
        
         | minimaxir wrote:
         | There is a public dataset of Hacker News posts on BigQuery, but
         | it unfortunately has only been updated up to November 2022:
         | https://news.ycombinator.com/item?id=19304326
        
         | zX41ZdbW wrote:
         | It is very easy to get this dataset directly from HN API. Let
         | me just post it here:
         | 
         | Table definition:                   CREATE TABLE
         | hackernews_history         (             update_time DateTime
         | DEFAULT now(),             id UInt32,             deleted
         | UInt8,             type Enum('story' = 1, 'comment' = 2, 'poll'
         | = 3, 'pollopt' = 4, 'job' = 5),             by
         | LowCardinality(String),             time DateTime,
         | text String,             dead UInt8,             parent UInt32,
         | poll UInt32,             kids Array(UInt32),             url
         | String,             score Int32,             title String,
         | parts Array(UInt32),             descendants Int32         )
         | ENGINE = MergeTree(update_time) ORDER BY id;
         | 
         | A shell script:                   BATCH_SIZE=1000
         | TWEAKS="--optimize_trivial_insert_select 0
         | --http_skip_not_found_url_for_globs 1 --http_make_head_request
         | 0 --engine_url_skip_empty_files 1 --http_max_tries 10
         | --max_download_threads 1 --max_threads $BATCH_SIZE"
         | rm -f maxitem.json         wget --no-verbose https://hacker-
         | news.firebaseio.com/v0/maxitem.json              clickhouse-
         | local --query "             SELECT
         | arrayStringConcat(groupArray(number), ',') FROM numbers(1,
         | $(cat maxitem.json))             GROUP BY number DIV
         | ${BATCH_SIZE} ORDER BY any(number) DESC" |         while read
         | ITEMS         do             echo $ITEMS
         | clickhouse-client $TWEAKS --query "                 INSERT INTO
         | hackernews_history SELECT * FROM url('https://hacker-
         | news.firebaseio.com/v0/item/{$ITEMS}.json')"         done
         | 
         | It takes a few hours to download the data and fill the table.
        
           | zX41ZdbW wrote:
           | Also, a proof that it is updated in real-time: https://play.c
           | lickhouse.com/play?user=play#U0VMRUNUICogRlJPT...
        
       | graiz wrote:
       | Would be cool to see member similarity. Finding like-minded
       | commentors/posters may help discover content that would be of
       | interest.
        
         | vsnf wrote:
         | Reminds me of a similar project a few months ago whose purpose
         | was to unmask alt accounts. It wasn't well received as I
         | recall.
        
         | noman-land wrote:
         | Accidental dating app.
        
           | internetter wrote:
           | > Accidental dating app.
           | 
           | Possibly the greatest indicator of social startup success.
        
         | naveen99 wrote:
         | We implemented member similarity in our hacker read app:
         | https://apps.apple.com/in/app/hacker-read/id6479697844
         | 
         | Once you register on ios, you can also login through webapp:
         | https://hn.garglet.com
         | 
         | probably not ready for a hacker news hug of death yet, but you
         | can try.
        
       | ed_db wrote:
       | This is amazing, the amount of skill and knowledge involved is
       | very impressive.
        
         | wilsonzlin wrote:
         | Thank you for the kind words!
        
       | seanlinehan wrote:
       | It was not obvious at first glance to me, but the actual app is
       | here: https://hn.wilsonl.in/
        
         | uncertainrhymes wrote:
         | I'm curious if the link to the landing page was intentionally
         | near the end. Only the people who actually read it would go to
         | the site.
         | 
         | (That's not a dig, I think it's a good idea.)
        
         | bravura wrote:
         | 1) it doesn't appear search links are shareable or have the
         | query terms are in it
         | 
         | 2) are you embedding the search phrases word by word? And using
         | the same model as the documents used? Because I searched for
         | ,,lead generation" which any decent non-unigram embedding
         | should understand, but I got results for lead poisoning.
        
         | oschvr wrote:
         | I found me and my post there ! Nice
        
       | freediver wrote:
       | If you have a blog, add an RSS feed :)
        
         | breck wrote:
         | I tried to fetch his RSS too! :)
         | 
         | Turns out, there's only 1 post so far on his blog.
         | 
         | Hoping for more! This one is great.
        
       | CuriouslyC wrote:
       | Good example of data engineering/MLops for people who aren't
       | familiar.
       | 
       | I'd suggest using HDBScan to generate hierarchical clusters for
       | the points, then use a model to generate names for interior
       | clusters. That'll make it easy to explore topics out to the
       | leaves, as you can just pop up refinements based on the
       | connectivity to the current node using the summary names.
       | 
       | The groups need more distinct coloring, which I think having
       | clusters could help with. The individual article text size should
       | depend on how important or relevant the article is, either in
       | general or based on the current search. If you had more interior
       | cluster summaries that'd also help cut down on some of the text
       | clutter, as you could replace multiple posts with a group summary
       | until more zoomed in.
        
         | wilsonzlin wrote:
         | Thanks for the great pointers! I didn't get the time to look
         | into hierarchical clustering unfortunately but it's on my TODO
         | list. Your comment about making the map clearer is great and
         | something I think there's a lot of low-hanging approaches for
         | improving. Another thing for the TODO list :)
        
       | NeroVanbierv wrote:
       | Really love the island map! But the automatic zooming on the map
       | doesn't seem very relevant. E.g. try typing "openai" - I can't
       | see anything related to that query in that part of the map
        
         | NeroVanbierv wrote:
         | Ok I just noticed there is a region "OpenAI" in the north-west,
         | but for some reason it zooms in somewhere close to "Apple"
         | (middle of the island) when I type the query
        
         | oersted wrote:
         | Indeed I've long been intreagued by the idea of rendering such
         | clustering maps more like geographic maps for better
         | readability.
         | 
         | It would be cool to have analogous continents, countries, sub-
         | regions, roads, different-sized settlements, and significant
         | landmarks... This version looks great at the highest zoom
         | level, but rapidly becomes hard to interpret as you zoom in,
         | same as most similar large embedding or graph visualizations.
        
         | wilsonzlin wrote:
         | Thanks! Yeah sometimes there are one or two "far" away results
         | which make the auto zoom seem strange. It's something I'd like
         | to tune, perhaps zooming to where most but not all results are.
        
           | luke-stanley wrote:
           | Often embeddings are not so good for comparing similarity of
           | text. A cross-encoder might be a good alternative, perhaps as
           | a second-pass, since you already have the embeddings.
           | https://www.sbert.net/docs/pretrained_cross-encoders.html
           | Pairwise, this can be quite slow, but as a second pass, it
           | might be much higher quality. Obviously this gets into LLM's
           | territory, but the language models for this can be small and
           | more reliable than cosine on embeddings.
        
       | paddycap wrote:
       | Adding a subscribe feature to get an email with the most recent
       | posts in a topic/community would be really cool. One of my
       | favorite parts of HN is the weekly digest I get in my inbox; it
       | would be awesome if that were tailored to me.
       | 
       | What you've built is really impressive. I'm excited to see where
       | this goes!
        
         | wilsonzlin wrote:
         | Thanks! Yeah if there's enough interested users I'd love to
         | turn this into a live service. Would an email subscription to a
         | set of communities you pick be something you'd be interested
         | in?
        
       | oersted wrote:
       | This is a surprisingly big endeavour for what looks like an
       | exploratory hobby project. Not to minimize the achievement, very
       | cool, I'm just surprised by how much was invested into it.
       | 
       | They used 150 GPUs and developed two custom systems (db-rpc and
       | queued) for inter-server communication, and this was just to
       | compute the embeddings, there's a lot of other work and
       | computation surrounding it.
       | 
       | I'm curious about the context of the project, and how someone
       | gets this kind of funding and time for such research.
       | 
       | PS: Having done a lot of similar work professionally (mapping
       | academic paper and patent landscapes), I'm not sure if 150 GPUs
       | were really needed. If you end up just projecting to 2D and
       | clustering, I think that traditional methods like bag-of-words
       | and/or topic modelling would be much easier and cheaper, and the
       | difference in quality would be unnoticeable. You can also use
       | author and comment-thread graphs for similar results.
        
         | alchemist1e9 wrote:
         | The author is definitely very skilled. I find it interesting
         | they submit posts on HN but haven't commented since 2018! And
         | then embarked on this project.
         | 
         | As far as funding/time, one possibility is they are between
         | endeavors/employment and it's self funded as they have had a
         | successful career or business financially. They were very
         | efficient at the GPU utilization so it probably didn't cost
         | that much.
        
           | wilsonzlin wrote:
           | Thanks! Haha yeah I'm trying to get into the habit of writing
           | about and sharing the random projects I do more often. And
           | yeah the cost was surprisingly low (in the hundreds of
           | dollars), so it was pretty accessible as a hobby project.
        
         | wilsonzlin wrote:
         | Hey, thanks for the kind words. I wasn't able to mention the
         | costs in the post (might follow up in the future) but it was in
         | the hundreds of dollars, so was reasonably accessible as a
         | hobby project. The GPUs were surprisingly cheap, and was only
         | scaled up mostly because I was impatient :) --- the entire
         | cluster only ran for a few hours.
         | 
         | Do you have any links to your work? They sound interesting and
         | I'd like to read more about them.
        
           | oersted wrote:
           | "Hundreds of dollars" sounds a bit painful as an EU engineer
           | and entrepreneur :), but I guess it's all relative. We would
           | think twice about investing this much manpower and compute
           | for such an exploratory project even in a commercial setting
           | if it was not directly funded by a client.
           | 
           | But your technical skill is obvious and very impressive.
           | 
           | If you want to read more, my old bachelor's thesis is
           | somewhat related, from when we only had word embeddings and
           | document embeddings were quite experimental still:
           | https://ad-publications.cs.uni-
           | freiburg.de/theses/Bachelor_J...
           | 
           | I've done a lot follow-up work in my startup Scitodate, which
           | includes large-scale graph and embedding analysis, but we
           | haven't published most of it for now.
        
             | wilsonzlin wrote:
             | Thanks for sharing, I'll have a read, looks very relevant
             | and interesting!
        
             | b800h wrote:
             | As an EU-based engineer, you wouldn't do this, it's a
             | massive GDPR violation (failure to notify data subjects of
             | data processing), which does actually have
             | extraterritoriality, although I somehow doubt that the
             | information commissioners are going to be coming after OP.
        
         | PaulHoule wrote:
         | (1) Definitely you could use a cheaper embedding and still get
         | pretty good results
         | 
         | (2) I apply classical ML (say probability calibrated SVM) to
         | embeddings like that and get good results for classification
         | and clustering at speeds over 100x fine-tuning an LLM.
        
       | ashu1461 wrote:
       | This is pretty great.
       | 
       | Feature request : Is it possible to show in the graph how famous
       | the topic / sub topic / article is ?
       | 
       | So that we can do an educated exploration in the graph around
       | what was upvoted and what was not ?
        
         | wilsonzlin wrote:
         | Thanks! Do you mean within the sentiment/popularity analysis
         | graph? Or the points and topics within the map?
        
       | oersted wrote:
       | Here's a great tool that does almost exactly the same thing for
       | any dataset: https://github.com/enjalot/latent-scope
       | 
       | Obviously the scale of OP's project adds a lot of interesting
       | complexity, this tool cannot handle that, but it's great for
       | medium-sized datasets.
        
       | xnx wrote:
       | As a novice, is there a benefit to using custom Node as the
       | downloader? When I did my download of the 40 million Hacker News
       | api items I used "curl --parallel".
       | 
       | What I would like to figure out is the easiest way to go from the
       | API straight into a parquet file.
        
         | wilsonzlin wrote:
         | I think your curl approach would work just as fine if not
         | better. My instinct was to reach for Node.js out of
         | familiarity, but curl is fast and, given the IDs are
         | sequential, something like `parallel curl ::: $(seq 0 $max_id)`
         | would be pretty simple and fast. I did end up needing more
         | logic though so Node.js did ultimately come in handy.
         | 
         | As for the Arrow file, I'm not sure unfortunately. I imagine
         | there are some difficulties because the format is columnar, so
         | it probably wants a batch of rows (when writing) instead of one
         | item at a time.
        
       | password4321 wrote:
       | Related a month ago:
       | 
       |  _A Peek inside HN: Analyzing ~40M stories and comments_
       | 
       | https://news.ycombinator.com/item?id=39910600
        
       | chossenger wrote:
       | Awesome visualisation, and great write-up. On mobile (in
       | portrait), a lot of longer titles get culled as their origin
       | scrolls off, with half of it still off the other side of the
       | screen - wonder if it'd be worth keeping on rendering them until
       | the entire text field is off screen (especially since you've
       | already got a bounding box for them).
       | 
       | I stumbled upon [1] using it that reflects your comments on
       | comment sentiment.
       | 
       | This also reminded me of [2] (for which the site itself had
       | rotted away, incidentally) - analysing HN users' similarity by
       | writing style.
       | 
       | [1] https://minimaxir.com/2014/10/hn-comments-about-comments/ [2]
       | https://news.ycombinator.com/item?id=33755016
        
         | wilsonzlin wrote:
         | Thanks for the kind words, and raising that problem --- I've
         | added it as an issue to fix.
         | 
         | Thanks for sharing that article, it was an interesting read. It
         | was cool how deep the analysis went with a few simple
         | statistical methods.
        
       | abe94 wrote:
       | This is impressive work, especially for a one man show!
       | 
       | One thing that stood out to me was the graph of the sentiment
       | analysis over time, I hadn't seen something like that before and
       | it was interesting to see it for Rust. What were the most
       | positive topics over time? And were there topics that saw very
       | sudden drops?
       | 
       | I also found this sentence interesting, as it rings true to me
       | about social media "there seems to be a lot of negative sentiment
       | on HN in general." It would be cool to see a comparison of
       | sentiment across social media platforms and across time!
        
         | wilsonzlin wrote:
         | Thanks! Yeah I'd like to dive deeper into the sentiment aspect.
         | As you say it'd be interesting to see some overview, instead of
         | specific queries.
         | 
         | The negative sentiment stood out to me mostly because I was
         | expecting a more "clear-cut" sentiment graph: largely neutral-
         | positive, with spikes in the positive direction around positive
         | posts and negative around negative posts. However, for almost
         | all my queries, the sentiment was almost always negative. Even
         | positive posts apparently attracted a lot of negativity
         | (according to the model and my approach, both of which could be
         | wrong). It's something I'd like to dive deeper into, perhaps in
         | a future blog post.
        
           | walterbell wrote:
           | Great work! Would you consider adding support for search-via-
           | url, e.g. https://hn.wilsonl.in/?q=sentiment+analysis. It
           | would enable sharing and bookmarks of stable queries.
        
             | wilsonzlin wrote:
             | Thanks for the suggestion, I've just added the feature:
             | 
             | https://hn.wilsonl.in/s/sentiment%20analysis
        
           | luke-stanley wrote:
           | I did something related for my ChillTranslator project for
           | translating spicy HN comments to calm variations which has a
           | GGUF model that runs easily and quickly but it's early days.
           | I did it with a much smaller set of data, using LLM's to make
           | calm variations and an algo to pick the closest least spicy
           | one to make the synthetic training data then used Phi 2. I
           | used Detoxify then OpenAI's sentiment analysis is free, I use
           | that to verify Detoxify has correctly identified spicy
           | comments then generate a calm pair. I do worry that HN could
           | implode / degrade if there is not able to be a good balance
           | for the comments and posts that people come here for. Maybe I
           | can use your sentiment data to mine faster and generate more
           | pairs. I've only done an initial end-to-end test so far
           | (which works!). The model, so far is not as high quality as
           | I'd like but I've not used Phi 3 on it yet and I've only used
           | a very small fine-tune dataset so far. File is here though:
           | https://huggingface.co/lukestanley/ChillTranslator I've had
           | no feedback from anyone on it though I did have a 404 in my
           | Show HN post!
        
           | deadbabe wrote:
           | Anecdotally, I think anyone who reads HN for a while will
           | realize it to be a negative, cynical place.
           | 
           | Posts written in sweet syrupy tones wouldn't do well here,
           | and jokes are in short supply or outright banned. Most people
           | here also seem to be men. There's always someone shooting you
           | down. And after a while, you start to shoot back.
        
             | xanderlewis wrote:
             | (Without wanting to sound negative or cynical) I don't
             | think it is, but maybe I haven't been here long enough to
             | notice. It skews towards technical and science and
             | technology-minded people, which makes it automatically a
             | bit 'cynical', but I feel like 95% of commenters are doing
             | so at least in good faith. The same cannot be said of many
             | comparable discussion forums or social media websites.
             | 
             | Jokes are also not banned; I see plenty on here. Low-effort
             | ones and chains of unfunny wordplay or banter seem to be
             | frowned upon though. And that makes it cleaner.
        
               | sethammons wrote:
               | I've been here a hot minute and I agree with you. Lots of
               | good faith. Lots of personal anecdotes presumably
               | anchored in experience. Some jokes are really funny, just
               | not reddit-style. Similarly, no slashdot quips generally,
               | such as "first post" or "i, for one, welcome our new HN
               | sentiment mapping robot overlords." Sometimes things get
               | downvoted that shouldn't, but most of the flags I see are
               | well deserved, and I vouch for ones that I think are not
               | flag-worthy
        
               | goles wrote:
               | I wonder how much of a persons impression of this is
               | formed by their browsing habits.
               | 
               | As a parent comment mentions big threads can be a bit of
               | a mess but usually only for the first couple of hours.
               | Comments made in the spirit of HN tend to bubble up and
               | off-topic, rude comments and bad jokes tend to percolate
               | down over the course of hours. Also a number of threads
               | that tend to spiral get manually detached which takes
               | time to go clean up.
               | 
               | Someone who isn't somewhat familiar with how HN works
               | that is consistently early to stories that attract a lot
               | of comments is reading an almost entirely different site
               | than someone who just catches up at the end of the day.
        
               | fragmede wrote:
               | some of the more negative threads will get flagged and
               | detached and by the end of the day a casual browse
               | through the comments isn't even going to come across
               | them. eg something about the situation in the middle east
               | is going to attract a lot of attention.
        
             | darby_eight wrote:
             | > Anecdotally, I think anyone who reads HN for a while will
             | realize it to be a negative, cynical place.
             | 
             | I don't think this is particularly unique to HN. Anonymous
             | forums tend to attract contrarian assholes. Perhaps this
             | place is more, erm, poorly socially-adapted to the general
             | population, but I don't see it as very far outside the norm
             | outside of the average wealth of the posters.
        
             | holoduke wrote:
             | Really? Mmm i think hn is a place with on avarage above
             | intelligent people. People who understand that their
             | opinion is not the only one. I rarely have issues with
             | people here. Might be also because we are all in the same
             | bubble here.
        
             | flir wrote:
             | I think it's the engineering mindset. You're always trying
             | to figure out what's wrong with an idea, because you might
             | be the poor bastard that ends up having to build it. Less
             | costly all round if you can identify the flaw now, not
             | halfway through sprint 7. After a while it bleeds into
             | everything you do.
        
           | abakker wrote:
           | its so interesting that in Likert scale surveys, I tend to
           | see huge positivity bias/agreement bias, but comments tend to
           | be critical/negative. I think there is something related to
           | the format of feedback that skews the graph in general.
           | 
           | On HN, my theory is that positivity is the upvotes, and
           | negativity/criticality is the discussion.
           | 
           | Personally, my contribution to your effort is that I would
           | love to see a tool that could do this analysis for me over a
           | dataset/corpus of my choosing. The code is nice, but it is a
           | bit beyond me to follow in your footsteps.
        
           | dylan604 wrote:
           | The sentiment issue is a curious one to me. For example, a
           | lot of humans I interact with that are not devs take my
           | direct questioning or critical responses to be "negative"
           | when there is no negative intent at all. Pointing out
           | something doesn't work or anything that the dev community
           | encounters on a daily basis isn't an immediate negative
           | sentiment but just pointing out the issues. Is it a meme-like
           | helicopter parent constantly doling out praise positive so
           | that anything differing shows negativity? Not every piece of
           | art needs to be hung on the fridge door, and providing
           | constructive criticism for improvement is oh so often framed
           | as negative. That does the world no favors.
           | 
           | Essentially, I'm not familiar with HuggingFace or any models
           | in this regard. But if they are trained from the socials,
           | then it seems skewed from the start to me.
           | 
           | Also, fully aware that this comment will probably be viewed
           | as negative based on stated assumptions.
           | 
           | edit: reading further down the comments, clearly I'm not the
           | first with these sentiments.
        
         | walterbell wrote:
         | _> sentiment across social media platforms and across time!_
         | 
         | Also time zones and weekday/weekend.
        
         | kcorbitt wrote:
         | I actually did a blog post a few months ago where I analyzed HN
         | commenter sentiment across AI, blockchain, remote work and
         | Rust. The final graph at the very end of the post is the
         | relevant one on this topic!
         | 
         | https://openpipe.ai/blog/hn-ai-crypto
        
         | necovek wrote:
         | It's really unfortunate the HN API does not provide votes on
         | comments: I wonder if and how sentiment analysis would change
         | if they were weighted by votes/downvotes?
         | 
         | My unsupported take is that engineers are mostly critical, but
         | will +1 positive feedback instead of repeating it, as they
         | might for critism :)
        
       | gaauch wrote:
       | A long term side project of mine is to try to build a
       | recommendation algorithm trained on HN data.
       | 
       | I trained a model to predict if a given post will reach the front
       | page, get flagged etc, I collected over a 1000 RSS feeds and rank
       | the RSS entries with my ranking models.
       | 
       | I submit the high ranking entries on HN to test out my models and
       | I can reach the front page consistently sometimes having multiple
       | entries on the front page at a given time.
       | 
       | I also experiment with user->content recommendation, for that I
       | use comment data for modeling interactions between users and
       | entries, which seems to work fine.
       | 
       | Only problem I have is that I get a lot of 'out of distribution'
       | content in my RSS feeds which causes my ranking models to get
       | 'confused' for this I trained models to predict if a given entry
       | belongs HN or not. On top of that I have some tagging models
       | trained on data I scraped from lobste.rs and hand annotated.
       | 
       | I had been working on this on and off for the last 2 years or so,
       | this account is not my main, and just one I created for testing.
       | 
       | AMA
        
         | saganus wrote:
         | did you find if submitted entries are more likely to reach the
         | frontpage depending on the title or the content?
         | 
         | i.e. do HN users upvote more based on the title of the article
         | or on actually reading them?
        
           | gaauch wrote:
           | I tried making an LLM generate different titles for a given
           | article and compared their ranking scores. There seems to be
           | a lot of variation in the ranking scores based on the way the
           | title is worded. Titles that are more likely to generate
           | 'outrage' seems to be getting ranked higher, but at the same
           | time that increases is_hn_flagged score which tries to
           | predict if a entry will get flagged.
        
       | swozey wrote:
       | I'm.. shocked there's been 40 million posts. Wow.
       | 
       | Really neat work
       | 
       | edit: Also had no idea HN went back to 2006.
       | https://news.ycombinator.com/item?id=1
       | 
       | edit2: PG wrote this? https://news.ycombinator.com/item?id=487171
        
       | fancy_pantser wrote:
       | HN submissions and comments are very different on weekends (and
       | US holidays). Your data could explore and quantify this in some
       | very interesting ways!
        
       | callalex wrote:
       | "Cloud Computing" "us-east-1 down"
       | 
       | This gave me a belly laugh.
        
       | replete wrote:
       | I think this is easily the coolest post I've seen on HN this year
        
       | minimaxir wrote:
       | A modern recommendation for UMAP is Parametric UMAP
       | (https://umap-
       | learn.readthedocs.io/en/latest/parametric_umap....), which
       | instead trains a small Keras MLP to perform the dimensionality
       | reduction down to 2D by minimizing the UMAP loss. The advantage
       | is that this model is small and can be saved and reused to
       | predict on unknown new data (a traditionally trained UMAP model
       | is large), and training is theoetically much faster because GPUs
       | are GPUs.
       | 
       | The downside is that the implementation in the Python UMAP
       | package isn't great and creates/pushes the whole expanded
       | node/edge dataset to the GPU, which means you can only train it
       | on about 100k embeddings before going OOM.
       | 
       | The UMAP -> HDBSCAN -> AI cluster labeling pipeline that's all
       | unsupervised is so useful that I'm tempted to figure out a more
       | scalable implementation of Parametric UMAP.
        
         | bravura wrote:
         | From a quick glance, it appears that it's because the
         | implementation pushes the entire graph (all edges) to the GPU.
         | Sampling of edges during training could alleviate this.
        
           | minimaxir wrote:
           | Indeed, TensorFlow likes pushing everything to the GPU by
           | default whereas many PyTorch DL implementations encourage
           | feeding data from the CPU to the GPU as needed with a
           | DataLoader.
           | 
           | There have been attempts at a PyTorch port of Parametric UMAP
           | (https://github.com/lmcinnes/umap/issues/580) but nothing as
           | good.
        
             | bravura wrote:
             | Looks like there is a little motion on this topic:
             | 
             | https://github.com/lmcinnes/umap/pull/1103
        
         | Der_Einzige wrote:
         | It exists in cuML with a fast GPU implementation. Not sure why
         | cuMl is so poorly known though...
        
           | minimaxir wrote:
           | I'll give that a look: the feature set of GPU-accelerated ops
           | seems just up my alley for this pipeline:
           | https://github.com/rapidsai/cuml
           | 
           | EDIT: looking through the docs it's just GPU-acceletated
           | UMAP, not a parametric UMAP which trains a NN model. That's
           | easy to work around though by training a new NN model to
           | predict the reduced dimensionality values and minimizing
           | rMSE.
        
             | minimaxir wrote:
             | Tested it out and the UMAP implementation with this library
             | is very very fast compared to Parametric UMAP: running it
             | on 100k embeddings took about 7 seconds when the same
             | pipeline on the same GPU took about a half-hour. I will
             | definitely be playing around with it more.
        
               | lmeyerov wrote:
               | Yeah we advise Graphistry users to keep umap training
               | sets to < 100k rows, and instead focus on doing careful
               | sampling within that, and multiple models for going
               | beyond that. It'd be more accessible for teams if we
               | could raise the limit, but quality wise, it's generally
               | fine. Security logs, customer activity, genomes, etc.
               | 
               | RAPIDS umap is darn impressive tho. Instead of focusing
               | on improving further, it did the job: our bottleneck
               | shifted to optimizing the ingest pipeline to feed umap,
               | so we released cu_cat as a GPU-accelerated automated
               | feature engineering library to get all that data into
               | umap. RAPIDS cudf helps take care of the intermediate IO
               | and wrangling in-between.
               | 
               | The result is we can now umap and interactively visualize
               | most real-world large datasets, database query results,
               | and LLM embeddings that pygraphistry & louie.ai users
               | encounter in seconds. Many years to get here, and now it
               | is so easy!
        
       | dfworks wrote:
       | If anybody found this interesting and would like some further
       | reading, the paper below employed a similar strategy to analyse
       | inauthentic content/disinformation on Twitter.
       | 
       | https://files.casmconsulting.co.uk/message-based-community-d...
       | 
       | If you would like to read about my largely unsuccessful
       | recreation of the paper, you can do so here -
       | https://dfworks.xyz/blog/partygate/
        
       | Lerc wrote:
       | A suggestion for analysis:
       | 
       | Compare topics/sentiment etc. by number of users and by number of
       | posts.
       | 
       | Are some topics dominated by a few prolific posters? Positively
       | or negatively.
       | 
       | Also, How does one seperate negative/positive sentiment to
       | criticism/advocacy?
       | 
       | How hard is it to detect positive criticism, or enthusiastic
       | endorsement of an acknowledged bad thing?
        
       | nojvek wrote:
       | I'm impressed with the map component in canvas. It's very smooth,
       | dynamic zoom and google-maps like.
       | 
       | Gonna dig more into it.
       | 
       | Exemplary Show HN! We need more of this.
        
       | datguyfromAT wrote:
       | What a great read! Thats for taking the time and effort to
       | provide the inside into your process
        
       | gsuuon wrote:
       | This is super cool! Both the writeup and the app. It'd be great
       | if the search results linked to the HN story so we can check out
       | the comments.
        
       | jxy wrote:
       | > We can see that in this case, where perhaps the X axis
       | represents "more cat" and Y axis "more dog", using the euclidean
       | distance (i.e. physical distance length), a pitbull is somehow
       | more similar to a Siamese cat than a "dog", whereas intuitively
       | we'd expect the opposite. The fact that a pitbull is "very dog"
       | somehow makes it closer to a "very cat". Instead, if we take the
       | angle distance between lines (i.e. cosine distance, or 1 minus
       | angle), the world makes sense again.
       | 
       | Typically the vectors are normalized, instead of what's shown in
       | this demonstration.
       | 
       | When using normalized vectors, the euclidean distance measures
       | the distance between the two end points of the respective
       | vectors. While the cosine distance measures the length of one
       | vector projected onto the other.
        
         | GeneralMayhem wrote:
         | The issue with normalization is that you lose a degree of
         | freedom - which when you're visualizing, effectively means
         | losing a dimension. Normalized 2d vectors are really just 1d
         | vectors; if you want to show a 2d relationship, now you have to
         | use 3d vectors (so that you have 2 degrees of freedom again).
        
       | cyclecount wrote:
       | I can't tell from the documentation on GitHub: does the API
       | expose the flagged/dead posts? It would be interesting to see
       | statistics on what's been censored lately.
        
       | coolspot wrote:
       | Absolutely wonderful project and even more so the writeup!
       | 
       | Feedback: on my iOS phone, once you select a dot on the map,
       | there is no way to unselect it. Preview card of some articles
       | takes full screen, so I can't even click to another dot. Maybe
       | add a "cross" icon for the preview card or make that when you tap
       | outside of a card, it hides whole card strip?
        
       | Igor_Wiwi wrote:
       | how much you paid to generate those embeddings?
        
       | sourcepluck wrote:
       | Where is lisp?! I thought it was a verifiable (urban) legend
       | around these parts that this forum is obssessed with lisp..?
        
         | pinkmuffinere wrote:
         | Maybe lisp is so niche that even a rather small interest makes
         | HN relatively lispy?
        
       | gitgud wrote:
       | Very cool! I was hoping to be able to navigate to the HN post
       | from the map though? Is that possible?
        
       ___________________________________________________________________
       (page generated 2024-05-09 23:00 UTC)