[HN Gopher] Show HN: Exploring HN by mapping and analyzing 40M p...
___________________________________________________________________
Show HN: Exploring HN by mapping and analyzing 40M posts and
comments for fun
Author : wilsonzlin
Score : 286 points
Date : 2024-05-09 12:31 UTC (10 hours ago)
(HTM) web link (blog.wilsonl.in)
(TXT) w3m dump (blog.wilsonl.in)
| thyrox wrote:
| Very nice. Since Hn data spawns so many such fun projects, there
| should be a monthly or weekly updates zip file or torrent with
| this data, which hackers can just download instead of writing a
| scraper and starting from scratch all the time.
| average_r_user wrote:
| that's a nice idea
| noman-land wrote:
| I very much support this idea. Put them on ipfs and/or
| torrents. Put them on HuggingFace.
| pfarrell wrote:
| I've had this same thought but was unsure what the licensing
| for the data would be.
| pfarrell wrote:
| I have a daily updated dataset that has the HN data split out
| by months. I've published it on my web page, but it's served
| from my home server so I don't want to link to it directly.
| Each month is about 30mb of compressed csv. I've wanted to
| torrent it, but don't know how to get enough seeders since each
| month will produce a new torrent file (unless I'm mistaken). If
| you're interested, send me a message. My email is mrpatfarrell.
| Use gmail for the domain.
| minimaxir wrote:
| There is a public dataset of Hacker News posts on BigQuery, but
| it unfortunately has only been updated up to November 2022:
| https://news.ycombinator.com/item?id=19304326
| zX41ZdbW wrote:
| It is very easy to get this dataset directly from HN API. Let
| me just post it here:
|
| Table definition: CREATE TABLE
| hackernews_history ( update_time DateTime
| DEFAULT now(), id UInt32, deleted
| UInt8, type Enum('story' = 1, 'comment' = 2, 'poll'
| = 3, 'pollopt' = 4, 'job' = 5), by
| LowCardinality(String), time DateTime,
| text String, dead UInt8, parent UInt32,
| poll UInt32, kids Array(UInt32), url
| String, score Int32, title String,
| parts Array(UInt32), descendants Int32 )
| ENGINE = MergeTree(update_time) ORDER BY id;
|
| A shell script: BATCH_SIZE=1000
| TWEAKS="--optimize_trivial_insert_select 0
| --http_skip_not_found_url_for_globs 1 --http_make_head_request
| 0 --engine_url_skip_empty_files 1 --http_max_tries 10
| --max_download_threads 1 --max_threads $BATCH_SIZE"
| rm -f maxitem.json wget --no-verbose https://hacker-
| news.firebaseio.com/v0/maxitem.json clickhouse-
| local --query " SELECT
| arrayStringConcat(groupArray(number), ',') FROM numbers(1,
| $(cat maxitem.json)) GROUP BY number DIV
| ${BATCH_SIZE} ORDER BY any(number) DESC" | while read
| ITEMS do echo $ITEMS
| clickhouse-client $TWEAKS --query " INSERT INTO
| hackernews_history SELECT * FROM url('https://hacker-
| news.firebaseio.com/v0/item/{$ITEMS}.json')" done
|
| It takes a few hours to download the data and fill the table.
| zX41ZdbW wrote:
| Also, a proof that it is updated in real-time: https://play.c
| lickhouse.com/play?user=play#U0VMRUNUICogRlJPT...
| graiz wrote:
| Would be cool to see member similarity. Finding like-minded
| commentors/posters may help discover content that would be of
| interest.
| vsnf wrote:
| Reminds me of a similar project a few months ago whose purpose
| was to unmask alt accounts. It wasn't well received as I
| recall.
| noman-land wrote:
| Accidental dating app.
| internetter wrote:
| > Accidental dating app.
|
| Possibly the greatest indicator of social startup success.
| naveen99 wrote:
| We implemented member similarity in our hacker read app:
| https://apps.apple.com/in/app/hacker-read/id6479697844
|
| Once you register on ios, you can also login through webapp:
| https://hn.garglet.com
|
| probably not ready for a hacker news hug of death yet, but you
| can try.
| ed_db wrote:
| This is amazing, the amount of skill and knowledge involved is
| very impressive.
| wilsonzlin wrote:
| Thank you for the kind words!
| seanlinehan wrote:
| It was not obvious at first glance to me, but the actual app is
| here: https://hn.wilsonl.in/
| uncertainrhymes wrote:
| I'm curious if the link to the landing page was intentionally
| near the end. Only the people who actually read it would go to
| the site.
|
| (That's not a dig, I think it's a good idea.)
| bravura wrote:
| 1) it doesn't appear search links are shareable or have the
| query terms are in it
|
| 2) are you embedding the search phrases word by word? And using
| the same model as the documents used? Because I searched for
| ,,lead generation" which any decent non-unigram embedding
| should understand, but I got results for lead poisoning.
| oschvr wrote:
| I found me and my post there ! Nice
| freediver wrote:
| If you have a blog, add an RSS feed :)
| breck wrote:
| I tried to fetch his RSS too! :)
|
| Turns out, there's only 1 post so far on his blog.
|
| Hoping for more! This one is great.
| CuriouslyC wrote:
| Good example of data engineering/MLops for people who aren't
| familiar.
|
| I'd suggest using HDBScan to generate hierarchical clusters for
| the points, then use a model to generate names for interior
| clusters. That'll make it easy to explore topics out to the
| leaves, as you can just pop up refinements based on the
| connectivity to the current node using the summary names.
|
| The groups need more distinct coloring, which I think having
| clusters could help with. The individual article text size should
| depend on how important or relevant the article is, either in
| general or based on the current search. If you had more interior
| cluster summaries that'd also help cut down on some of the text
| clutter, as you could replace multiple posts with a group summary
| until more zoomed in.
| wilsonzlin wrote:
| Thanks for the great pointers! I didn't get the time to look
| into hierarchical clustering unfortunately but it's on my TODO
| list. Your comment about making the map clearer is great and
| something I think there's a lot of low-hanging approaches for
| improving. Another thing for the TODO list :)
| NeroVanbierv wrote:
| Really love the island map! But the automatic zooming on the map
| doesn't seem very relevant. E.g. try typing "openai" - I can't
| see anything related to that query in that part of the map
| NeroVanbierv wrote:
| Ok I just noticed there is a region "OpenAI" in the north-west,
| but for some reason it zooms in somewhere close to "Apple"
| (middle of the island) when I type the query
| oersted wrote:
| Indeed I've long been intreagued by the idea of rendering such
| clustering maps more like geographic maps for better
| readability.
|
| It would be cool to have analogous continents, countries, sub-
| regions, roads, different-sized settlements, and significant
| landmarks... This version looks great at the highest zoom
| level, but rapidly becomes hard to interpret as you zoom in,
| same as most similar large embedding or graph visualizations.
| wilsonzlin wrote:
| Thanks! Yeah sometimes there are one or two "far" away results
| which make the auto zoom seem strange. It's something I'd like
| to tune, perhaps zooming to where most but not all results are.
| luke-stanley wrote:
| Often embeddings are not so good for comparing similarity of
| text. A cross-encoder might be a good alternative, perhaps as
| a second-pass, since you already have the embeddings.
| https://www.sbert.net/docs/pretrained_cross-encoders.html
| Pairwise, this can be quite slow, but as a second pass, it
| might be much higher quality. Obviously this gets into LLM's
| territory, but the language models for this can be small and
| more reliable than cosine on embeddings.
| paddycap wrote:
| Adding a subscribe feature to get an email with the most recent
| posts in a topic/community would be really cool. One of my
| favorite parts of HN is the weekly digest I get in my inbox; it
| would be awesome if that were tailored to me.
|
| What you've built is really impressive. I'm excited to see where
| this goes!
| wilsonzlin wrote:
| Thanks! Yeah if there's enough interested users I'd love to
| turn this into a live service. Would an email subscription to a
| set of communities you pick be something you'd be interested
| in?
| oersted wrote:
| This is a surprisingly big endeavour for what looks like an
| exploratory hobby project. Not to minimize the achievement, very
| cool, I'm just surprised by how much was invested into it.
|
| They used 150 GPUs and developed two custom systems (db-rpc and
| queued) for inter-server communication, and this was just to
| compute the embeddings, there's a lot of other work and
| computation surrounding it.
|
| I'm curious about the context of the project, and how someone
| gets this kind of funding and time for such research.
|
| PS: Having done a lot of similar work professionally (mapping
| academic paper and patent landscapes), I'm not sure if 150 GPUs
| were really needed. If you end up just projecting to 2D and
| clustering, I think that traditional methods like bag-of-words
| and/or topic modelling would be much easier and cheaper, and the
| difference in quality would be unnoticeable. You can also use
| author and comment-thread graphs for similar results.
| alchemist1e9 wrote:
| The author is definitely very skilled. I find it interesting
| they submit posts on HN but haven't commented since 2018! And
| then embarked on this project.
|
| As far as funding/time, one possibility is they are between
| endeavors/employment and it's self funded as they have had a
| successful career or business financially. They were very
| efficient at the GPU utilization so it probably didn't cost
| that much.
| wilsonzlin wrote:
| Thanks! Haha yeah I'm trying to get into the habit of writing
| about and sharing the random projects I do more often. And
| yeah the cost was surprisingly low (in the hundreds of
| dollars), so it was pretty accessible as a hobby project.
| wilsonzlin wrote:
| Hey, thanks for the kind words. I wasn't able to mention the
| costs in the post (might follow up in the future) but it was in
| the hundreds of dollars, so was reasonably accessible as a
| hobby project. The GPUs were surprisingly cheap, and was only
| scaled up mostly because I was impatient :) --- the entire
| cluster only ran for a few hours.
|
| Do you have any links to your work? They sound interesting and
| I'd like to read more about them.
| oersted wrote:
| "Hundreds of dollars" sounds a bit painful as an EU engineer
| and entrepreneur :), but I guess it's all relative. We would
| think twice about investing this much manpower and compute
| for such an exploratory project even in a commercial setting
| if it was not directly funded by a client.
|
| But your technical skill is obvious and very impressive.
|
| If you want to read more, my old bachelor's thesis is
| somewhat related, from when we only had word embeddings and
| document embeddings were quite experimental still:
| https://ad-publications.cs.uni-
| freiburg.de/theses/Bachelor_J...
|
| I've done a lot follow-up work in my startup Scitodate, which
| includes large-scale graph and embedding analysis, but we
| haven't published most of it for now.
| wilsonzlin wrote:
| Thanks for sharing, I'll have a read, looks very relevant
| and interesting!
| b800h wrote:
| As an EU-based engineer, you wouldn't do this, it's a
| massive GDPR violation (failure to notify data subjects of
| data processing), which does actually have
| extraterritoriality, although I somehow doubt that the
| information commissioners are going to be coming after OP.
| PaulHoule wrote:
| (1) Definitely you could use a cheaper embedding and still get
| pretty good results
|
| (2) I apply classical ML (say probability calibrated SVM) to
| embeddings like that and get good results for classification
| and clustering at speeds over 100x fine-tuning an LLM.
| ashu1461 wrote:
| This is pretty great.
|
| Feature request : Is it possible to show in the graph how famous
| the topic / sub topic / article is ?
|
| So that we can do an educated exploration in the graph around
| what was upvoted and what was not ?
| wilsonzlin wrote:
| Thanks! Do you mean within the sentiment/popularity analysis
| graph? Or the points and topics within the map?
| oersted wrote:
| Here's a great tool that does almost exactly the same thing for
| any dataset: https://github.com/enjalot/latent-scope
|
| Obviously the scale of OP's project adds a lot of interesting
| complexity, this tool cannot handle that, but it's great for
| medium-sized datasets.
| xnx wrote:
| As a novice, is there a benefit to using custom Node as the
| downloader? When I did my download of the 40 million Hacker News
| api items I used "curl --parallel".
|
| What I would like to figure out is the easiest way to go from the
| API straight into a parquet file.
| wilsonzlin wrote:
| I think your curl approach would work just as fine if not
| better. My instinct was to reach for Node.js out of
| familiarity, but curl is fast and, given the IDs are
| sequential, something like `parallel curl ::: $(seq 0 $max_id)`
| would be pretty simple and fast. I did end up needing more
| logic though so Node.js did ultimately come in handy.
|
| As for the Arrow file, I'm not sure unfortunately. I imagine
| there are some difficulties because the format is columnar, so
| it probably wants a batch of rows (when writing) instead of one
| item at a time.
| password4321 wrote:
| Related a month ago:
|
| _A Peek inside HN: Analyzing ~40M stories and comments_
|
| https://news.ycombinator.com/item?id=39910600
| chossenger wrote:
| Awesome visualisation, and great write-up. On mobile (in
| portrait), a lot of longer titles get culled as their origin
| scrolls off, with half of it still off the other side of the
| screen - wonder if it'd be worth keeping on rendering them until
| the entire text field is off screen (especially since you've
| already got a bounding box for them).
|
| I stumbled upon [1] using it that reflects your comments on
| comment sentiment.
|
| This also reminded me of [2] (for which the site itself had
| rotted away, incidentally) - analysing HN users' similarity by
| writing style.
|
| [1] https://minimaxir.com/2014/10/hn-comments-about-comments/ [2]
| https://news.ycombinator.com/item?id=33755016
| wilsonzlin wrote:
| Thanks for the kind words, and raising that problem --- I've
| added it as an issue to fix.
|
| Thanks for sharing that article, it was an interesting read. It
| was cool how deep the analysis went with a few simple
| statistical methods.
| abe94 wrote:
| This is impressive work, especially for a one man show!
|
| One thing that stood out to me was the graph of the sentiment
| analysis over time, I hadn't seen something like that before and
| it was interesting to see it for Rust. What were the most
| positive topics over time? And were there topics that saw very
| sudden drops?
|
| I also found this sentence interesting, as it rings true to me
| about social media "there seems to be a lot of negative sentiment
| on HN in general." It would be cool to see a comparison of
| sentiment across social media platforms and across time!
| wilsonzlin wrote:
| Thanks! Yeah I'd like to dive deeper into the sentiment aspect.
| As you say it'd be interesting to see some overview, instead of
| specific queries.
|
| The negative sentiment stood out to me mostly because I was
| expecting a more "clear-cut" sentiment graph: largely neutral-
| positive, with spikes in the positive direction around positive
| posts and negative around negative posts. However, for almost
| all my queries, the sentiment was almost always negative. Even
| positive posts apparently attracted a lot of negativity
| (according to the model and my approach, both of which could be
| wrong). It's something I'd like to dive deeper into, perhaps in
| a future blog post.
| walterbell wrote:
| Great work! Would you consider adding support for search-via-
| url, e.g. https://hn.wilsonl.in/?q=sentiment+analysis. It
| would enable sharing and bookmarks of stable queries.
| wilsonzlin wrote:
| Thanks for the suggestion, I've just added the feature:
|
| https://hn.wilsonl.in/s/sentiment%20analysis
| luke-stanley wrote:
| I did something related for my ChillTranslator project for
| translating spicy HN comments to calm variations which has a
| GGUF model that runs easily and quickly but it's early days.
| I did it with a much smaller set of data, using LLM's to make
| calm variations and an algo to pick the closest least spicy
| one to make the synthetic training data then used Phi 2. I
| used Detoxify then OpenAI's sentiment analysis is free, I use
| that to verify Detoxify has correctly identified spicy
| comments then generate a calm pair. I do worry that HN could
| implode / degrade if there is not able to be a good balance
| for the comments and posts that people come here for. Maybe I
| can use your sentiment data to mine faster and generate more
| pairs. I've only done an initial end-to-end test so far
| (which works!). The model, so far is not as high quality as
| I'd like but I've not used Phi 3 on it yet and I've only used
| a very small fine-tune dataset so far. File is here though:
| https://huggingface.co/lukestanley/ChillTranslator I've had
| no feedback from anyone on it though I did have a 404 in my
| Show HN post!
| deadbabe wrote:
| Anecdotally, I think anyone who reads HN for a while will
| realize it to be a negative, cynical place.
|
| Posts written in sweet syrupy tones wouldn't do well here,
| and jokes are in short supply or outright banned. Most people
| here also seem to be men. There's always someone shooting you
| down. And after a while, you start to shoot back.
| xanderlewis wrote:
| (Without wanting to sound negative or cynical) I don't
| think it is, but maybe I haven't been here long enough to
| notice. It skews towards technical and science and
| technology-minded people, which makes it automatically a
| bit 'cynical', but I feel like 95% of commenters are doing
| so at least in good faith. The same cannot be said of many
| comparable discussion forums or social media websites.
|
| Jokes are also not banned; I see plenty on here. Low-effort
| ones and chains of unfunny wordplay or banter seem to be
| frowned upon though. And that makes it cleaner.
| sethammons wrote:
| I've been here a hot minute and I agree with you. Lots of
| good faith. Lots of personal anecdotes presumably
| anchored in experience. Some jokes are really funny, just
| not reddit-style. Similarly, no slashdot quips generally,
| such as "first post" or "i, for one, welcome our new HN
| sentiment mapping robot overlords." Sometimes things get
| downvoted that shouldn't, but most of the flags I see are
| well deserved, and I vouch for ones that I think are not
| flag-worthy
| goles wrote:
| I wonder how much of a persons impression of this is
| formed by their browsing habits.
|
| As a parent comment mentions big threads can be a bit of
| a mess but usually only for the first couple of hours.
| Comments made in the spirit of HN tend to bubble up and
| off-topic, rude comments and bad jokes tend to percolate
| down over the course of hours. Also a number of threads
| that tend to spiral get manually detached which takes
| time to go clean up.
|
| Someone who isn't somewhat familiar with how HN works
| that is consistently early to stories that attract a lot
| of comments is reading an almost entirely different site
| than someone who just catches up at the end of the day.
| fragmede wrote:
| some of the more negative threads will get flagged and
| detached and by the end of the day a casual browse
| through the comments isn't even going to come across
| them. eg something about the situation in the middle east
| is going to attract a lot of attention.
| darby_eight wrote:
| > Anecdotally, I think anyone who reads HN for a while will
| realize it to be a negative, cynical place.
|
| I don't think this is particularly unique to HN. Anonymous
| forums tend to attract contrarian assholes. Perhaps this
| place is more, erm, poorly socially-adapted to the general
| population, but I don't see it as very far outside the norm
| outside of the average wealth of the posters.
| holoduke wrote:
| Really? Mmm i think hn is a place with on avarage above
| intelligent people. People who understand that their
| opinion is not the only one. I rarely have issues with
| people here. Might be also because we are all in the same
| bubble here.
| flir wrote:
| I think it's the engineering mindset. You're always trying
| to figure out what's wrong with an idea, because you might
| be the poor bastard that ends up having to build it. Less
| costly all round if you can identify the flaw now, not
| halfway through sprint 7. After a while it bleeds into
| everything you do.
| abakker wrote:
| its so interesting that in Likert scale surveys, I tend to
| see huge positivity bias/agreement bias, but comments tend to
| be critical/negative. I think there is something related to
| the format of feedback that skews the graph in general.
|
| On HN, my theory is that positivity is the upvotes, and
| negativity/criticality is the discussion.
|
| Personally, my contribution to your effort is that I would
| love to see a tool that could do this analysis for me over a
| dataset/corpus of my choosing. The code is nice, but it is a
| bit beyond me to follow in your footsteps.
| dylan604 wrote:
| The sentiment issue is a curious one to me. For example, a
| lot of humans I interact with that are not devs take my
| direct questioning or critical responses to be "negative"
| when there is no negative intent at all. Pointing out
| something doesn't work or anything that the dev community
| encounters on a daily basis isn't an immediate negative
| sentiment but just pointing out the issues. Is it a meme-like
| helicopter parent constantly doling out praise positive so
| that anything differing shows negativity? Not every piece of
| art needs to be hung on the fridge door, and providing
| constructive criticism for improvement is oh so often framed
| as negative. That does the world no favors.
|
| Essentially, I'm not familiar with HuggingFace or any models
| in this regard. But if they are trained from the socials,
| then it seems skewed from the start to me.
|
| Also, fully aware that this comment will probably be viewed
| as negative based on stated assumptions.
|
| edit: reading further down the comments, clearly I'm not the
| first with these sentiments.
| walterbell wrote:
| _> sentiment across social media platforms and across time!_
|
| Also time zones and weekday/weekend.
| kcorbitt wrote:
| I actually did a blog post a few months ago where I analyzed HN
| commenter sentiment across AI, blockchain, remote work and
| Rust. The final graph at the very end of the post is the
| relevant one on this topic!
|
| https://openpipe.ai/blog/hn-ai-crypto
| necovek wrote:
| It's really unfortunate the HN API does not provide votes on
| comments: I wonder if and how sentiment analysis would change
| if they were weighted by votes/downvotes?
|
| My unsupported take is that engineers are mostly critical, but
| will +1 positive feedback instead of repeating it, as they
| might for critism :)
| gaauch wrote:
| A long term side project of mine is to try to build a
| recommendation algorithm trained on HN data.
|
| I trained a model to predict if a given post will reach the front
| page, get flagged etc, I collected over a 1000 RSS feeds and rank
| the RSS entries with my ranking models.
|
| I submit the high ranking entries on HN to test out my models and
| I can reach the front page consistently sometimes having multiple
| entries on the front page at a given time.
|
| I also experiment with user->content recommendation, for that I
| use comment data for modeling interactions between users and
| entries, which seems to work fine.
|
| Only problem I have is that I get a lot of 'out of distribution'
| content in my RSS feeds which causes my ranking models to get
| 'confused' for this I trained models to predict if a given entry
| belongs HN or not. On top of that I have some tagging models
| trained on data I scraped from lobste.rs and hand annotated.
|
| I had been working on this on and off for the last 2 years or so,
| this account is not my main, and just one I created for testing.
|
| AMA
| saganus wrote:
| did you find if submitted entries are more likely to reach the
| frontpage depending on the title or the content?
|
| i.e. do HN users upvote more based on the title of the article
| or on actually reading them?
| gaauch wrote:
| I tried making an LLM generate different titles for a given
| article and compared their ranking scores. There seems to be
| a lot of variation in the ranking scores based on the way the
| title is worded. Titles that are more likely to generate
| 'outrage' seems to be getting ranked higher, but at the same
| time that increases is_hn_flagged score which tries to
| predict if a entry will get flagged.
| swozey wrote:
| I'm.. shocked there's been 40 million posts. Wow.
|
| Really neat work
|
| edit: Also had no idea HN went back to 2006.
| https://news.ycombinator.com/item?id=1
|
| edit2: PG wrote this? https://news.ycombinator.com/item?id=487171
| fancy_pantser wrote:
| HN submissions and comments are very different on weekends (and
| US holidays). Your data could explore and quantify this in some
| very interesting ways!
| callalex wrote:
| "Cloud Computing" "us-east-1 down"
|
| This gave me a belly laugh.
| replete wrote:
| I think this is easily the coolest post I've seen on HN this year
| minimaxir wrote:
| A modern recommendation for UMAP is Parametric UMAP
| (https://umap-
| learn.readthedocs.io/en/latest/parametric_umap....), which
| instead trains a small Keras MLP to perform the dimensionality
| reduction down to 2D by minimizing the UMAP loss. The advantage
| is that this model is small and can be saved and reused to
| predict on unknown new data (a traditionally trained UMAP model
| is large), and training is theoetically much faster because GPUs
| are GPUs.
|
| The downside is that the implementation in the Python UMAP
| package isn't great and creates/pushes the whole expanded
| node/edge dataset to the GPU, which means you can only train it
| on about 100k embeddings before going OOM.
|
| The UMAP -> HDBSCAN -> AI cluster labeling pipeline that's all
| unsupervised is so useful that I'm tempted to figure out a more
| scalable implementation of Parametric UMAP.
| bravura wrote:
| From a quick glance, it appears that it's because the
| implementation pushes the entire graph (all edges) to the GPU.
| Sampling of edges during training could alleviate this.
| minimaxir wrote:
| Indeed, TensorFlow likes pushing everything to the GPU by
| default whereas many PyTorch DL implementations encourage
| feeding data from the CPU to the GPU as needed with a
| DataLoader.
|
| There have been attempts at a PyTorch port of Parametric UMAP
| (https://github.com/lmcinnes/umap/issues/580) but nothing as
| good.
| bravura wrote:
| Looks like there is a little motion on this topic:
|
| https://github.com/lmcinnes/umap/pull/1103
| Der_Einzige wrote:
| It exists in cuML with a fast GPU implementation. Not sure why
| cuMl is so poorly known though...
| minimaxir wrote:
| I'll give that a look: the feature set of GPU-accelerated ops
| seems just up my alley for this pipeline:
| https://github.com/rapidsai/cuml
|
| EDIT: looking through the docs it's just GPU-acceletated
| UMAP, not a parametric UMAP which trains a NN model. That's
| easy to work around though by training a new NN model to
| predict the reduced dimensionality values and minimizing
| rMSE.
| minimaxir wrote:
| Tested it out and the UMAP implementation with this library
| is very very fast compared to Parametric UMAP: running it
| on 100k embeddings took about 7 seconds when the same
| pipeline on the same GPU took about a half-hour. I will
| definitely be playing around with it more.
| lmeyerov wrote:
| Yeah we advise Graphistry users to keep umap training
| sets to < 100k rows, and instead focus on doing careful
| sampling within that, and multiple models for going
| beyond that. It'd be more accessible for teams if we
| could raise the limit, but quality wise, it's generally
| fine. Security logs, customer activity, genomes, etc.
|
| RAPIDS umap is darn impressive tho. Instead of focusing
| on improving further, it did the job: our bottleneck
| shifted to optimizing the ingest pipeline to feed umap,
| so we released cu_cat as a GPU-accelerated automated
| feature engineering library to get all that data into
| umap. RAPIDS cudf helps take care of the intermediate IO
| and wrangling in-between.
|
| The result is we can now umap and interactively visualize
| most real-world large datasets, database query results,
| and LLM embeddings that pygraphistry & louie.ai users
| encounter in seconds. Many years to get here, and now it
| is so easy!
| dfworks wrote:
| If anybody found this interesting and would like some further
| reading, the paper below employed a similar strategy to analyse
| inauthentic content/disinformation on Twitter.
|
| https://files.casmconsulting.co.uk/message-based-community-d...
|
| If you would like to read about my largely unsuccessful
| recreation of the paper, you can do so here -
| https://dfworks.xyz/blog/partygate/
| Lerc wrote:
| A suggestion for analysis:
|
| Compare topics/sentiment etc. by number of users and by number of
| posts.
|
| Are some topics dominated by a few prolific posters? Positively
| or negatively.
|
| Also, How does one seperate negative/positive sentiment to
| criticism/advocacy?
|
| How hard is it to detect positive criticism, or enthusiastic
| endorsement of an acknowledged bad thing?
| nojvek wrote:
| I'm impressed with the map component in canvas. It's very smooth,
| dynamic zoom and google-maps like.
|
| Gonna dig more into it.
|
| Exemplary Show HN! We need more of this.
| datguyfromAT wrote:
| What a great read! Thats for taking the time and effort to
| provide the inside into your process
| gsuuon wrote:
| This is super cool! Both the writeup and the app. It'd be great
| if the search results linked to the HN story so we can check out
| the comments.
| jxy wrote:
| > We can see that in this case, where perhaps the X axis
| represents "more cat" and Y axis "more dog", using the euclidean
| distance (i.e. physical distance length), a pitbull is somehow
| more similar to a Siamese cat than a "dog", whereas intuitively
| we'd expect the opposite. The fact that a pitbull is "very dog"
| somehow makes it closer to a "very cat". Instead, if we take the
| angle distance between lines (i.e. cosine distance, or 1 minus
| angle), the world makes sense again.
|
| Typically the vectors are normalized, instead of what's shown in
| this demonstration.
|
| When using normalized vectors, the euclidean distance measures
| the distance between the two end points of the respective
| vectors. While the cosine distance measures the length of one
| vector projected onto the other.
| GeneralMayhem wrote:
| The issue with normalization is that you lose a degree of
| freedom - which when you're visualizing, effectively means
| losing a dimension. Normalized 2d vectors are really just 1d
| vectors; if you want to show a 2d relationship, now you have to
| use 3d vectors (so that you have 2 degrees of freedom again).
| cyclecount wrote:
| I can't tell from the documentation on GitHub: does the API
| expose the flagged/dead posts? It would be interesting to see
| statistics on what's been censored lately.
| coolspot wrote:
| Absolutely wonderful project and even more so the writeup!
|
| Feedback: on my iOS phone, once you select a dot on the map,
| there is no way to unselect it. Preview card of some articles
| takes full screen, so I can't even click to another dot. Maybe
| add a "cross" icon for the preview card or make that when you tap
| outside of a card, it hides whole card strip?
| Igor_Wiwi wrote:
| how much you paid to generate those embeddings?
| sourcepluck wrote:
| Where is lisp?! I thought it was a verifiable (urban) legend
| around these parts that this forum is obssessed with lisp..?
| pinkmuffinere wrote:
| Maybe lisp is so niche that even a rather small interest makes
| HN relatively lispy?
| gitgud wrote:
| Very cool! I was hoping to be able to navigate to the HN post
| from the map though? Is that possible?
___________________________________________________________________
(page generated 2024-05-09 23:00 UTC)