[HN Gopher] Long Read: Lessons from Building Semantic Search for...
___________________________________________________________________
Long Read: Lessons from Building Semantic Search for GitHub and Why
I Failed
Author : zxt_tzx
Score : 90 points
Date : 2025-03-08 12:23 UTC (10 hours ago)
(HTM) web link (tzx.notion.site)
(TXT) w3m dump (tzx.notion.site)
| zxt_tzx wrote:
| Author here. Over the last few months, I have built and launched
| a free semantic search tool for GitHub called SemHub
| (https://semhub.dev/). In this blog post, I share what I've
| learned and why I've failed, so that other builders can learn
| from my experience. This blog post runs long and I have sign-
| posted each section. I have marked the sections that I consider
| the particularly insightful with an asterisk (*).
|
| I have also summarized my key lessons here:
|
| 1. Default to pgvector, avoid premature optimization.
|
| 2. You probably can get away with shorter embeddings if you're
| using Matryoshka embedding models.
|
| 3. Filtering with vector search may be harder than you expect.
|
| 4. If you love full stack TypeScript and use AWS, you'll love
| SST. One day, I wish I can recommend Cloudflare in equally strong
| terms too.
|
| 5. Building is only half the battle. You have to solve a big
| enough problem and meet your users where they're at.
| cynicalsecurity wrote:
| With 5 you mean promoting the app? It is by far the biggest
| problem, yes. In many cases even bigger than building the app
| itself.
| vaidhy wrote:
| Having built a failed semantic search engine for life sciences
| (bioask when it existed), I think the last point should be the
| first. Not getting a product market fit very quickly killed
| mine.
| fulafel wrote:
| SST: https://github.com/sst/sst - vaguely similar to CDK but
| can also manage some non-AWS resources and seems TypeScript-
| only
| niel wrote:
| Thanks for writing this up!
|
| > Filtering with vector search may be harder than you expect.
|
| I've only ever used it for a small proof of concept, but Qdrant
| is great at _categorical_ filtering with HNSW.
|
| https://qdrant.tech/articles/filtrable-hnsw/
| wrs wrote:
| Fantastic writeup -- thank you for taking the time to do this!
| smarx007 wrote:
| Hi, thanks for building a great tool and a great write-up! I
| was trying to add a number of repos under oslc/ _, oslc-op /_,
| and eclipse-lyo/* orgs but no joy - internal server error.
| Hopefully, you will reconsider shutting down the project (just
| heard about it and am quite excited)!
|
| I think a project like yours is going to be helpful to OSS
| library maintainers to see which features are used in
| downstream projects and which have issues. Especially, as in my
| case, when the project attemps to advance an open standard and
| just checking issues in the main repo will not give you the
| full picture. For this use case, I deployed my own instance to
| index all OSS repos implementing OSLC REST or using our Lyo SDK
| - https://oslc-sourcebot.berezovskyi.me/ . I think your tool is
| great in complementing the code search.
| johnfn wrote:
| That was a great write up.
|
| If you don't mind me giving you some unsolicited product
| feedback: I think SemHub didn't do well because it's unclear what
| problem it's actually solving. Who actually wants your product?
| What's the use case? I use GitHub issues all the time, and I
| can't think of a reason I'd want semhub. If I need to find a
| particular issue on, say, TypeScript, I'll just google "github
| typescript issue [description]" and pull up the correct thing 9
| times out of 10. And that's already a pretty rare percentage of
| the time I spend on GitHub.
| kevmo314 wrote:
| It's somewhat ironic that the author advocates for keeping it
| simple and using pgvector but then buries a ton of complexity
| with an API server, auth server, Cloudflare workers, and durable
| objects. Especially given
|
| > Supabase easily the most expensive part of my stack (at
| $200/month, if we ran in it XL, i.e. the lowest tier with 4-core
| CPU)
|
| That could get you a pretty decent VPS and allow you to
| coassemble everything with less complexity. This is exemplified
| in some of the gotchas, like
|
| > Cloudflare Workers demand an entirely different pattern, even
| compared to other serverless runtimes like Lambda
|
| If I'm hacking something together, learning an entirely different
| pattern for some third-party service is the last thing I want to
| do.
|
| All that being said though, maybe all it would've done is prolong
| the inevitable death due to the product gap the author concludes
| with.
| franky47 wrote:
| I started a quick weekend project to do just that today: index my
| OSS project's [1] issues & discussions, so I can RAG-ask it to
| find references when I feel like I'm repeating myself (in "see
| issue/PR/discussion #123", finding the 123 is the hardest part).
|
| This article might be super helpful, thanks! I don't intend to
| make a product out of it though, so I can cut a lot of corners,
| like using a PAT for auth and running everything locally.
|
| [1] https://github.com/47ng/nuqs
| nosefrog wrote:
| > When using Cloudflare Workers as an API server, I have
| experienced requests that would "fail silently" and leave a
| "hanging connection", with no error thrown, no log emitted, and a
| frontend that is just loading. Honestly, no idea what's up with
| this.
|
| Yikes, these sorts of errors are so hard to debug. Especially if
| you don't have a real server to log into to get pcaps.
| viraptor wrote:
| Cloudflare workers are not amazing in terms of communicating
| problems. The errors you get can also be out of sync with the
| docs and the support doesn't have access to poke at your issues
| directly. Together with the custom runtime and outdated TS
| types... it can be a very frustrating DX.
| sebmellen wrote:
| We've tried, but it's hard to imagine any real production
| system using Cloudflare Workers..
| serjester wrote:
| Great write up, especially agree on pgvector with small (ideally
| fine tuned) embeddings. There's so much complexity that comes
| with keeping your vector db in sync with you main db (especially
| once you start filtering with metadata). 90% of gen AI apps don't
| need it.
| brian-armstrong wrote:
| Am I misunderstanding what is meant by semantic code search? I
| thought the idea was that you run something like a parser on the
| repo to extract function/class/variable names and then allow
| searching on a more rich set of data, rather than tokenizing it
| like English.
|
| I know github kind of added this but their version falls apart
| still even in common languages like C++. It's not unusual for it
| to just completely miss cross references, even in smaller repos.
| A proper compiler's eye view of symbolic data would be super
| useful, and Github's halfway attempt can be frustratingly daft
| about it.
| nchmy wrote:
| This seems pretty similar to something that the ManticoreSearch
| team released a year ago
|
| https://manticoresearch.com/blog/manticoresearch-github-issu...
|
| You can index any GH repo and then search it with vector,
| keyword, hybrid and more. There's faceting and anything else you
| could ever want. And it is astoundingly fast - even vector
| search.
|
| Here's the direct link to the demo
| https://github.manticoresearch.com/
___________________________________________________________________
(page generated 2025-03-08 23:00 UTC)