[HN Gopher] Long Read: Lessons from Building Semantic Search for...
       ___________________________________________________________________
        
       Long Read: Lessons from Building Semantic Search for GitHub and Why
       I Failed
        
       Author : zxt_tzx
       Score  : 90 points
       Date   : 2025-03-08 12:23 UTC (10 hours ago)
        
 (HTM) web link (tzx.notion.site)
 (TXT) w3m dump (tzx.notion.site)
        
       | zxt_tzx wrote:
       | Author here. Over the last few months, I have built and launched
       | a free semantic search tool for GitHub called SemHub
       | (https://semhub.dev/). In this blog post, I share what I've
       | learned and why I've failed, so that other builders can learn
       | from my experience. This blog post runs long and I have sign-
       | posted each section. I have marked the sections that I consider
       | the particularly insightful with an asterisk (*).
       | 
       | I have also summarized my key lessons here:
       | 
       | 1. Default to pgvector, avoid premature optimization.
       | 
       | 2. You probably can get away with shorter embeddings if you're
       | using Matryoshka embedding models.
       | 
       | 3. Filtering with vector search may be harder than you expect.
       | 
       | 4. If you love full stack TypeScript and use AWS, you'll love
       | SST. One day, I wish I can recommend Cloudflare in equally strong
       | terms too.
       | 
       | 5. Building is only half the battle. You have to solve a big
       | enough problem and meet your users where they're at.
        
         | cynicalsecurity wrote:
         | With 5 you mean promoting the app? It is by far the biggest
         | problem, yes. In many cases even bigger than building the app
         | itself.
        
         | vaidhy wrote:
         | Having built a failed semantic search engine for life sciences
         | (bioask when it existed), I think the last point should be the
         | first. Not getting a product market fit very quickly killed
         | mine.
        
         | fulafel wrote:
         | SST: https://github.com/sst/sst - vaguely similar to CDK but
         | can also manage some non-AWS resources and seems TypeScript-
         | only
        
         | niel wrote:
         | Thanks for writing this up!
         | 
         | > Filtering with vector search may be harder than you expect.
         | 
         | I've only ever used it for a small proof of concept, but Qdrant
         | is great at _categorical_ filtering with HNSW.
         | 
         | https://qdrant.tech/articles/filtrable-hnsw/
        
         | wrs wrote:
         | Fantastic writeup -- thank you for taking the time to do this!
        
         | smarx007 wrote:
         | Hi, thanks for building a great tool and a great write-up! I
         | was trying to add a number of repos under oslc/ _, oslc-op /_,
         | and eclipse-lyo/* orgs but no joy - internal server error.
         | Hopefully, you will reconsider shutting down the project (just
         | heard about it and am quite excited)!
         | 
         | I think a project like yours is going to be helpful to OSS
         | library maintainers to see which features are used in
         | downstream projects and which have issues. Especially, as in my
         | case, when the project attemps to advance an open standard and
         | just checking issues in the main repo will not give you the
         | full picture. For this use case, I deployed my own instance to
         | index all OSS repos implementing OSLC REST or using our Lyo SDK
         | - https://oslc-sourcebot.berezovskyi.me/ . I think your tool is
         | great in complementing the code search.
        
       | johnfn wrote:
       | That was a great write up.
       | 
       | If you don't mind me giving you some unsolicited product
       | feedback: I think SemHub didn't do well because it's unclear what
       | problem it's actually solving. Who actually wants your product?
       | What's the use case? I use GitHub issues all the time, and I
       | can't think of a reason I'd want semhub. If I need to find a
       | particular issue on, say, TypeScript, I'll just google "github
       | typescript issue [description]" and pull up the correct thing 9
       | times out of 10. And that's already a pretty rare percentage of
       | the time I spend on GitHub.
        
       | kevmo314 wrote:
       | It's somewhat ironic that the author advocates for keeping it
       | simple and using pgvector but then buries a ton of complexity
       | with an API server, auth server, Cloudflare workers, and durable
       | objects. Especially given
       | 
       | > Supabase easily the most expensive part of my stack (at
       | $200/month, if we ran in it XL, i.e. the lowest tier with 4-core
       | CPU)
       | 
       | That could get you a pretty decent VPS and allow you to
       | coassemble everything with less complexity. This is exemplified
       | in some of the gotchas, like
       | 
       | > Cloudflare Workers demand an entirely different pattern, even
       | compared to other serverless runtimes like Lambda
       | 
       | If I'm hacking something together, learning an entirely different
       | pattern for some third-party service is the last thing I want to
       | do.
       | 
       | All that being said though, maybe all it would've done is prolong
       | the inevitable death due to the product gap the author concludes
       | with.
        
       | franky47 wrote:
       | I started a quick weekend project to do just that today: index my
       | OSS project's [1] issues & discussions, so I can RAG-ask it to
       | find references when I feel like I'm repeating myself (in "see
       | issue/PR/discussion #123", finding the 123 is the hardest part).
       | 
       | This article might be super helpful, thanks! I don't intend to
       | make a product out of it though, so I can cut a lot of corners,
       | like using a PAT for auth and running everything locally.
       | 
       | [1] https://github.com/47ng/nuqs
        
       | nosefrog wrote:
       | > When using Cloudflare Workers as an API server, I have
       | experienced requests that would "fail silently" and leave a
       | "hanging connection", with no error thrown, no log emitted, and a
       | frontend that is just loading. Honestly, no idea what's up with
       | this.
       | 
       | Yikes, these sorts of errors are so hard to debug. Especially if
       | you don't have a real server to log into to get pcaps.
        
         | viraptor wrote:
         | Cloudflare workers are not amazing in terms of communicating
         | problems. The errors you get can also be out of sync with the
         | docs and the support doesn't have access to poke at your issues
         | directly. Together with the custom runtime and outdated TS
         | types... it can be a very frustrating DX.
        
           | sebmellen wrote:
           | We've tried, but it's hard to imagine any real production
           | system using Cloudflare Workers..
        
       | serjester wrote:
       | Great write up, especially agree on pgvector with small (ideally
       | fine tuned) embeddings. There's so much complexity that comes
       | with keeping your vector db in sync with you main db (especially
       | once you start filtering with metadata). 90% of gen AI apps don't
       | need it.
        
       | brian-armstrong wrote:
       | Am I misunderstanding what is meant by semantic code search? I
       | thought the idea was that you run something like a parser on the
       | repo to extract function/class/variable names and then allow
       | searching on a more rich set of data, rather than tokenizing it
       | like English.
       | 
       | I know github kind of added this but their version falls apart
       | still even in common languages like C++. It's not unusual for it
       | to just completely miss cross references, even in smaller repos.
       | A proper compiler's eye view of symbolic data would be super
       | useful, and Github's halfway attempt can be frustratingly daft
       | about it.
        
       | nchmy wrote:
       | This seems pretty similar to something that the ManticoreSearch
       | team released a year ago
       | 
       | https://manticoresearch.com/blog/manticoresearch-github-issu...
       | 
       | You can index any GH repo and then search it with vector,
       | keyword, hybrid and more. There's faceting and anything else you
       | could ever want. And it is astoundingly fast - even vector
       | search.
       | 
       | Here's the direct link to the demo
       | https://github.manticoresearch.com/
        
       ___________________________________________________________________
       (page generated 2025-03-08 23:00 UTC)