[HN Gopher] Show HN: Hacker Search - A semantic search engine fo...
___________________________________________________________________
Show HN: Hacker Search - A semantic search engine for Hacker News
Hi HN! I'm Jonathan and I built Hacker Search
(https://hackersearch.net), a semantic search engine for Hacker
News. Type a keyword or a description of what you're interested in,
and you'll get top links from HN surfaced to you along with brief
summaries. Unlike HN's otherwise very valuable search feature,
Hacker Search doesn't require you to get your keywords exactly
right. That's achieved by leveraging OpenAI's latest embedding
models alongside more traditional indexes extracted from the
scraped and cleaned up contents of the links. I think there are
many more interesting things one could build atop the HN dataset in
the age of LLMs (e.g. more explicitly searching for technical
opinions, recommending stories to you based on your interests, and
making the core search feature more useful). If any of those sound
interesting to you, head over to https://hackersearch.net/signup to
get notified when I launch them! Note: at least one person has
built something similar before
(https://news.ycombinator.com/item?id=36391655). Funnily enough, I
only found out about this through my own implementation, and I
based on my testing, I think Hacker Search generally performs
better when doing keyword/sentence searches (vs. whole document
similarity lookup), thanks to the way the data is indexed.
Author : jnnnthnn
Score : 145 points
Date : 2024-05-02 17:01 UTC (6 hours ago)
(HTM) web link (hackersearch.net)
(TXT) w3m dump (hackersearch.net)
| levkk wrote:
| Pretty cool. A little slow for a search engine, have you tried
| in-database embeddings for semantic search like PostgresML?
| jnnnthnn wrote:
| Thanks for trying it out!
|
| Agreed it could be faster for uncached queries. The embeddings
| retrieval itself is actually pretty fast (uses pgvector).
| However, I found that having a LLM rerank results + generate
| summaries related to the search query made results more useful,
| which is what accounts for much of the latency.
|
| Maybe I should make that a user-customizable setting!
| montanalow wrote:
| You can do all of that in a single SQL query, with
| pgml.embed() and then pgml.train() a custom reranker with
| xgboost, to pgml.predict() the conversion score of a search
| result based on click-through-rate, or other objective.
|
| If you'd like free hosting, feel free to reach out. I'm one
| of the founders at postgresml.org.
| jnnnthnn wrote:
| Sweet. I'll follow-up off HN. Thank you!
| Scene_Cast2 wrote:
| What's your stack like?
| jnnnthnn wrote:
| - TypeScript
|
| - Next.js
|
| - OpenAI's embeddings and GPT endpoints
|
| - Postgres with pgvector (on neon.tech)
|
| - Tailwind
|
| - tRPC
|
| - Vercel for web hosting
|
| - Google Cloud products for data pipelines (GCS, Cloud Tasks)
| v1sea wrote:
| Can you go into more detail on how it works or provide some
| references for articles/papers on implementing a system like
| this? Is it just RAG?
|
| Testing it out, I'd say the results for "graph visualization" are
| focused if a bit incomplete. So to me it has high precision, but
| lower recall.
|
| I don't see this searching comments. That could be a nice
| extension. Thanks for sharing.
| jnnnthnn wrote:
| It is mostly RAG, although I suppose that doesn't say much
| about the system: one thing I've found is that the way you
| clean and process the data substantially changes the quality of
| the results. I'll write a little blog post sharing some of the
| learnings!
|
| If you feel up for it, you should share your email in the
| righthand "Unhappy with your results?" widget. My plan is to
| manually look into the disappointing searches and follow-up
| with better results for folks, in addition to fixing whatever
| can be fixed.
|
| Agreed re: searching comments (which it indeed currently
| doesn't do).
| lpetting wrote:
| i am not surprised that 'the way you clean and process the
| data substantially changes the quality of the results.' can
| you share anything about your approach here?
| jnnnthnn wrote:
| I'll write up a little blog post once the traffic dies down
| a bit!
|
| In the meantime, one thing that comes to mind is that
| simply embedding the whole contents of the webpages after
| scraping them didn't yield very good search results. As an
| example, an article about Python might only mention Python
| by name once. I found that trimming extraneous strings
| (e.g. menus, share links), and then extracting key themes +
| embedding those directly yielded much, much better results.
| zarathustreal wrote:
| In our RAG pipeline we found that implementing HyDE also
| made a huge difference, maybe generating and embedding
| hypothetical user search queries (per document) would
| help here
| tinyhouse wrote:
| Nice work! I'm sure you know that you can also search Google with
| site:https://news.ycombinator.com. If I were you I would probably
| think of a niche where one can get better results by searching HN
| and other relevant data sources. Another suggestion is not about
| the search so much but about the UI. One of the worse things
| about sites like HN is that it's really hard to follow a long
| thread. If you can fix that by doing some data transformation and
| build a nice UI for search results, it'd be pretty useful. Good
| luck!
| jnnnthnn wrote:
| Thanks for the feedback!
|
| One big distinction with the
| "site:https://news.ycombinator.com" hack is that the search on
| Hacker Search directly runs against the underlying link's
| contents, rather than whatever happens to be on HN. We also
| more directly leverage HN's curation by factoring in scores.
|
| Appreciate your suggestions; will look into building those!
| hubraumhugo wrote:
| Great to see cool stuff being built on top of HN data! Many of us
| rely on this platform as one of the primary sources of
| information.
|
| > recommending stories to you based on your interests
|
| I built this as a service that monitors and classifies HN stories
| based on your interests (solved my FOMO):
| https://www.kadoa.com/hacksnack
| jnnnthnn wrote:
| Nice!
| manca wrote:
| I love projects like this. It shows the true potential of what
| LLMs and RAG can unlock. Imagine applying the same method on the
| actual content within the threads and extract the sentiment, as
| well as summarize the key points of a particular thread -- the
| options are limitless.
|
| My only piece of advice, though: try to do the reranking using
| some other rerankers instead of an LLM -- you'll save both on the
| latency AND the cost.
|
| Other than that, good job.
| jnnnthnn wrote:
| Thanks! I tried a few other approaches and found the LLM
| results were overall better (latency and cost aside). Maybe
| that should be an option made available to users though...
| bardiapour wrote:
| i think not, better results >>> better latency + cost
| robrenaud wrote:
| Maybe a combined approach beats either? Let some non-LLM
| reranker quickly spit out two results, and fill in the rest
| with the LLM.
| isoprophlex wrote:
| Cohere has a very cheap, fast and effective reranking API!
|
| https://cohere.com/rerank
| pjot wrote:
| I've done something very similar! But with duckDB as a vector
| store/engine.
|
| https://github.com/patricktrainer/duckdb-embedding-search
| jnnnthnn wrote:
| Sweet! I'll try running it tonight.
| isoprophlex wrote:
| There's at least three of us ;) I built https://searchhacker.news
|
| Loving your LLM generated summaries! Very nice user experience to
| see at a glance what a hit is about. Also your back button
| actually works, haha.
|
| Well done!
| jnnnthnn wrote:
| Thank you, and likewise!
| curious_cat_163 wrote:
| Four. :)
| jnnnthnn wrote:
| Welp, I'm so glad I wrote "at least one person", and not
| "one person" in my note in the OP :D
| curious_cat_163 wrote:
| Heh. But, it's great that you put your stuff out there.
|
| Kudos to you!
|
| I will play with it some more.
| stevenicr wrote:
| Give me a button to remove a story from the search results and
| you'd saved me many clicks and minutes yesterday.
|
| search 'ssh'.. select comments not stories.. omg the thing I am
| looking for is only a few days ago but I can't get through all
| the ones from the one story.. page to page.. meh..
|
| anyways I love the privacy terms statement page, I almost used it
| to check something.
| jnnnthnn wrote:
| Ack! If you're willing to share that example with me over email
| (jonathan@unikowski.net), I'd love to see what we can do. Maybe
| good enough semantic search over comments would remove the need
| for filtering on post type?
| awendland wrote:
| Following @isoprohplex, I'll be the fourth comment to say I also
| built a variant of this: https://hnss.alexwendland.com/
|
| I built mine on top of an RSS feed I generate from Hacker News
| which filters out any posts linking to the top 1 million domains
| [1] and creates a readable version of the content. I use it to
| surface articles on smaller blogs/personal websites--it's become
| my main content source. It's generated via Github Actions every 4
| hours and stored in a detached branch on Github (~2 GB of data
| from the past 4 years). Here's an example for posts with >= 10
| upvotes [2].
|
| It only took several hours to build the semantic search on top.
| And that included time for me to try out and learn several
| different vector DBs, embedding models, data pipelines, and UI
| frameworks! The current state of AI tooling is wonderfully
| simple.
|
| In the end I landed on (selected in haste optimizing for
| developer ergonomics, so only a partial endorsement):
| - BAAI/bge-small-en as an embedding model - Python with
| - HuggingFaceBgeEmbeddings from langchain_community for creating
| embeddings - SentenceSplitter from llama_index for
| chunking documents - ChromaDB as a vector DB + chroma-ops
| to prune the DB - sqlite3 for metadata - FastAPI,
| Pydantic, Jinja2, Tailwind for API and server-rendered webpages
| - jsdom and mozilla-readability for article extraction
|
| I generated the index locally on my M2 Mac which ripped through
| the ~70k articles in ~12 hours to generate all the embeddings.
|
| I run the search site with Podman on a VM from Hetzner--along
| with other projects--for ~$8 / month. All requests are handled on
| CPU w/o calls to external AI providers. Query times are <200 ms,
| which includes embedding generation - vector DB lookup - metadata
| retrieval - page rendering. The server source code is here [3].
|
| Nice work @jnnnthnn! What you built is fast, the rankings were
| solid, and the summaries are convenient.
|
| [1] https://majestic.com/reports/majestic-million
|
| [2] https://github.com/awendland/hacker-news-small-
| sites/blob/ge...
|
| [3] https://github.com/awendland/hacker-news-small-sites-
| website...
| jnnnthnn wrote:
| Fun! Thanks for sharing! It's super fast and the bias toward
| smaller websites really does surface interesting things. Very
| reminiscent of the Kagi small web site, also!
| avereveard wrote:
| doesn't seem like it's semantic
| https://hackersearch.net/search?q=solutions+for+postgres+clu...
| sure result are about postgres but none was relevant to
| clustering solutions
| jnnnthnn wrote:
| I suspect this is probably because of a bit of a bias in the
| indexed dataset: at present, the indexed stories tend to bias
| toward high-scores ones, and at a glance I don't see that many
| stories about Postgres clustering in that distribution.
| avereveard wrote:
| yeah there are only three stories coming up from the site
| search and none picks up things like citus etc
|
| https://hn.algolia.com/?q=postgres+clustering
|
| only one is semanthically correct, the other pick up the
| wrong version of clustering (i.e. k-means instead of multi
| master writes)
|
| but yeah if one doesn't test the hard cases, how does one
| know it preserves semantics :D
| jnnnthnn wrote:
| In fairness, it's probably impossible to unambiguously
| determine what the intended/desired interpretation is
| (though intuitively it seems like k-means should be lower
| likelihood)!
| avereveard wrote:
| I've tried Hyde and seems to work better. had to do it
| client side tho. asked chatgpt: "write one sentence
| explanation about this topic: solutions for postgres
| clustering" which returned "Solutions for PostgreSQL
| clustering involve implementing methods such as streaming
| replication or third-party tools like Patroni to manage
| and distribute database workloads across multiple servers
| for enhanced performance and fault tolerance." then I
| searched that:
|
| https://hackersearch.net/search?q=Solutions+for+PostgreSQ
| L+c...
|
| and results are much better:
|
| 1. An overview of distributed Postgres architectures 2. A
| Technical Dive into PostgreSQL's replication mechanisms
| 3. Ways to capture changes in Postgres
|
| hyde paper is here https://arxiv.org/abs/2212.10496
|
| it's possible that openai embedding are simmetrical, if
| that the case you need to hallucinate some content and
| use that as base for the embedding distance calculation.
| or you can move to asymmetric embedding, or you can try
| prompting their embedding
|
| edit: prompting embedding seems to work, tried searching
| for "write an article about: solutions for postgres
| clustering" and results are much better https://hackersea
| rch.net/search?q=write+an+article+about%3A+...
|
| you can try prepending "write an article about: " to all
| user searches :D
| jnnnthnn wrote:
| Sweet! Thanks for sharing. A prior implementation had
| HyDE running on user searches, but I found the results to
| be hit-or-miss depending on the query type.
|
| I definitely want to re-explore that though; I think it
| should be possible to do so a lot more rigorously now
| that I have a better sense for what people want to search
| for.
| jnnnthnn wrote:
| Relatedly, the "Follow up queries" in the righthand
| floating insert are an attempt at finding a decent
| balance between recovering from failures & giving users
| enough control on the queries themselves :) See https://h
| ackersearch.net/search?q=postgres+clustering&period...
| avereveard wrote:
| you should really use this visibility to get a thumbs up
| / down near each result and use that as a validation set
| :D
| jnnnthnn wrote:
| Totally! I've come to deeply dislike thumbs up/down UXes,
| so I'm collecting that by recording clicks on results!
| simonw wrote:
| It looks like you generated an LLM summary of every page you
| indexed. What model did you use for that, and how much did it
| cost?
| jnnnthnn wrote:
| Hi Simon! Big fan of your blog.
|
| I actually generate two summaries: one is part of the ingestion
| pipeline and used for indexing and embedding, and another is
| generated on-the-fly based on user queries (the goal there is
| to "reconcile" the user query with each individual item being
| suggested).
|
| I use GPT-3.5 Turbo, which works well enough for that purpose.
| Cost of generating the original summaries from raw page
| contents came down to about $0.01 per item. That could add up
| quickly, but I was lucky enough to have some OpenAI credits
| laying around so I didn't have to think much about this or
| explore alternative options.
|
| GPT-4 would produce nicer summaries for the user-facing
| portion, but the latency and costs are too high for production.
| With GPT-3.5 however those are super cheap since they require
| very few tokens (they operate off of the original summaries
| mentioned above).
|
| Worth noting that I've processed stories by score descending,
| and didn't process anything under 50 points which substantially
| reduced the number of tokens to process.
| rdli wrote:
| This is cool! I've been trying out bits & pieces of the RAG
| ecosystem, too, exploring this space.
|
| Here's a question for this crowd: Do we see domain/personalized
| RAG as the future of search? In other words, instead of Google,
| you go to your own personal LLM, which has indexed all of the
| content you care about (whether it's everything from HN, or an
| extra informative blog post, or ...)? I personally think this
| would be great. I would still use Google for general-purpose
| search, but a lot of my search needs are trying to remember that
| really interesting article someone posted to HN a year ago that
| is germane to what I'm doing now.
| jnnnthnn wrote:
| I definitely think there are opportunities to provide more
| useful & personalized search than what Google offers for at
| least some queries.
|
| Quality aside, I think the primary challenge is in figuring out
| the right UX for delivering that at scale. One of the really
| great advantages of Google is that it is right there in your
| URL bar, and that for many of the searches you might do, it
| works just fine. Figuring out when it doesn't and how to
| provide better result _then_ seems like a big unsolved UX
| component of figuring out personalized search.
| codethief wrote:
| Nice work!
|
| Unfortunately, though, it didn't find what I was looking for in
| the following real-word test case: The other day I tried to
| remember the name of an SaaS to pin/cache/ back up my apt/apk/pip
| dependencies, which I think I had read about either here[0] or
| here[1]. After quite a bit of time and some elaborate Google-fu,
| I did end up finding those HN threads again. However, they did
| not show up on hackersearch.net for me, neither when entering the
| service's name nor when I searched for "deterministic Docker
| builds" or "cache apt apk pip dependencies".
|
| [0]: https://news.ycombinator.com/item?id=39684416
|
| [1]: https://news.ycombinator.com/item?id=39723888
| jnnnthnn wrote:
| Thanks for the feedback! I made a cutoff at stories with <50
| points when ingesting content, and it looks like that cut off
| what you were looking for. Definitely not great.
|
| I'm planning to fix that in short order, feel free to sign up
| at https://hackersearch.net/signup if you care to receive an
| update when that goes live!
| codethief wrote:
| Ah, I see. Cool that you're already working on a fix!
| Shouldn't the second link have gotten picked up, though?
| jnnnthnn wrote:
| Only indexing stories for now, not yet comments. That too
| will get fixed!
| beefman wrote:
| Can anyone exhibit a search that works better here than on
| Algolia?
| jnnnthnn wrote:
| Try any of the examples listed on the landing page. You can
| easily access the HN/Algolia search equivalent by clicking on
| "Try Hacker News search instead" to the right (on desktop).
|
| Keyword searches on the Algolia engine will generally result in
| better recall -- at least when identifying the right keyword is
| easy, e.g. the name of a company. They likely will require more
| sifting through results & keyword "engineering" however.
|
| In my mind the two approaches are complementary. I suppose
| there's an argument for working more directly towards blending
| them :)
| beefman wrote:
| Thanks; don't know how I missed that.
|
| Trying the examples now, semantic search usually works
| better. But if I trim extra phrasing (e.g. how do diffusion
| models work -> diffusion models) they're about the same (but
| Algolia is much faster).
| curious_cat_163 wrote:
| So, I played with it some more. I think that this is a good
| starting point. You can tune various parameters for what you have
| indexed and it will get better. I am sure it will evolve in
| interesting directions from here.
|
| > e.g. more explicitly searching for technical opinions...
|
| Yes, please! I would love to be able to search for strongly held
| opinions by folks who _know_ what they are talking about.
|
| > recommending stories to you based on your interests...
|
| I am curious how, in principle, you would you do that? Where do
| you think the signal that indicates my "interest" lies?
| jnnnthnn wrote:
| Thank you!
|
| To learn your interests we'd at a minimum need to know what HN
| stories you tend to click or comment on, e.g. by a different
| reader view or using a browser extension. Presumably your
| comments and submissions could provide useful signal as well :)
| HanClinto wrote:
| Nice work! Love seeing RAG work get developed!
|
| What about using the embeddings for nearest-neighbor search for
| similar articles? I.E., for any given article, can you use the
| embeddings of an article to run a search, rather than encoding my
| query? That would let me find similar / related articles much
| more easily.
| jnnnthnn wrote:
| Thanks!
|
| Yup, totally feasible. I might add that!
| robrenaud wrote:
| It feels pretty good. I did some reading on some high quality
| posts about chess that I found through it.
|
| What was the biggest thing you learned while implementing this?
| Was anything surprisingly difficult? Was there anything that
| worked better than you expected?
| jnnnthnn wrote:
| Thanks for trying it out!
|
| > What was the biggest thing you learned while implementing
| this?
|
| How much the quality of the data and resulting indexes matter.
| My impression based on this experience is that "RAG" might be a
| cohesive set of techniques, but their application to various
| domains likely is very domain-specific.
|
| > Was anything surprisingly difficult?
|
| Evaluating results is very tedious, almost by definition: you
| need to figure out ground truth by some mechanism and build
| evaluation datasets from there. To be honest, a lot of this
| beta was built on "vibes" only.
|
| > Was there anything that worked better than you expected?
|
| In terms of whether something worked better than I expected:
| modern embeddings are really magical. I'd previously worked
| with TF/IDF (a decade-or-so ago) and Doc2Vec (6-7 years ago),
| and while those were surprisingly useful, they really pale
| compared to what LLM embeddings can encode in very dense
| representations.
___________________________________________________________________
(page generated 2024-05-02 23:01 UTC)