[HN Gopher] Show HN: Feep! search, an independent search engine ...
___________________________________________________________________
Show HN: Feep! search, an independent search engine for programmers
Hi HN! This started late last year as an afternoon project to play
around with ElasticSearch, and then I kept thinking of new features
I wanted to add. I still have a lot of things I want to build, but
now seemed like a good time to put it out there: even if the
results aren't nearly the quality I'd like, I've still found it
useful and I want to show it off! I've been working on it since
September 2021, but only in fits and starts. The entire thing runs
on a computer in my living room (there's a picture on the About
page); I haven't done any load testing so we'll see how it holds
up.
Author : wolfgang42
Score : 91 points
Date : 2022-11-06 16:25 UTC (6 hours ago)
(HTM) web link (search.feep.dev)
(TXT) w3m dump (search.feep.dev)
| brokenkebab2 wrote:
| Frankly it doesn't look like it's ready to be useful. As an
| example tried "Braze notifications" and the first result was
| about Brave, then two mildly relevant, and then a long stretch of
| "Who's hiring?" topics from which HN seem to mention only
| notifications.
| wolfgang42 wrote:
| The title mentioning "Brave" seems to be a red herring: there's
| someone in the comments talking about Braze, though it looks
| like a typo. Similarly the "Who's hiring" posts do actually
| have job listings for Braze, but you have to click through the
| More link at the bottom to find them. (Because I load HN
| directly from a data dump, the search doesn't know about the
| pagination.)
|
| I think the main problem here is that my index is relatively
| small: it has only (!) 30 million pages, and it looks like
| Braze just isn't popular enough for me to have run into it with
| the right keywords yet.
| pcl wrote:
| This looks great!
|
| MDN docs are pretty strong. Perhaps devdocs is a superset, but if
| not, I'd recommend indexing them as well.
|
| Also, feature request: it'd be nice if the query help unfolded
| the instructions in-line with the current page instead of
| navigating to another page. That way, I would be able to see them
| while mucking with my query.
| wolfgang42 wrote:
| Glad you like it! Devdocs includes Mozilla's docs, for JS, CSS,
| DOM, SVG, and others. (But my ranking algorithm doesn't
| understand that "mdn" is a synonym for "developers.mozilla.org"
| so it's hard to surface them explicitly.)
|
| Thanks for the feature request--I don't have any frontend JS
| set up yet that I could easily add this to, but I can see how
| this could be useful and I'll put it on my list.
| pcl wrote:
| Personally, I'd rather click a link to MDN docs than most
| other sources, so if you had some way to expose pass-through
| attribution from devdocs, that'd be useful at least for me.
|
| I wonder how many engineers think about search results link
| origin before clicking through.
| wolfgang42 wrote:
| Although I'm sourcing the crawl data from devdocs, the
| ingestion process uses the upstream URL, so the search
| results link to developer.mozilla.org with the appropriate
| favicon.
|
| I've heard enough complaints about W3Schools and other SEO-
| heavy but accuracy-light sources that I suspect a fair
| proportion of technically-minded users probably consider
| the domain before clicking on a result link.
| pcl wrote:
| Yeah w3schools is the very reason I avoid Google search
| for code questions.
| culi wrote:
| Other sources you may wanna crawl
|
| - https://www.thecodingforums.com/ (and other programming-related
| online communities like Lobste.rs, certain subreddits, and
| certain Lemmy instances/communities)
|
| - https://pldb.com/ (might be a good way to automatically get all
| the docs of each programming language as well as
| books/videos/publications that mention a certain language)
| mdaniel wrote:
| Congratulations, I always enjoy new search engines
|
| W.r.t. "and updated intermittently," I wanted to draw your
| attention to the HN realtime API:
| https://github.com/HackerNews/API#live-data and also that S.O.
| offers Atom Feeds: https://stackoverflow.com/feeds/ _(I 'd guess
| the rest do, too, but I didn't verify)_
|
| I am a huge proponent of taking advantage of any update features
| that a site offers, because otherwise the "how about now?" of re-
| crawling is wasteful to both parties.
| wolfgang42 wrote:
| My current architecture is built around batch ingestion[1], and
| doesn't (yet) have a way to do incremental updates. This is
| great for getting coverage--there are a lot of long-tail
| results in my search engine but not in Google!--but it does
| mean there's more lag and the results aren't instantly up-to-
| date.
|
| [1]: e.g. for StackOverflow, I download an XML dump of the
| entire site once a quarter:
| https://search.feep.dev/blog/post/2021-09-04-stackexchange
| aliqot wrote:
| Hey this is pretty cool. I noticed a lot of the results are HN
| posts and StackExchange. Is there a definitive list somewhere of
| the sources, and/or maybe a way to contribute to those?
|
| Thanks for showing this to us, I like where your head's at!
|
| Edit: found it, it was explained on the front page.
| https://search.feep.dev/about/datasources
| wolfgang42 wrote:
| I can pretty easily add any StackExchange sites I left out, or
| anything that comes as a Zim file (e.g. from
| https://farm.openzim.org/recipes or the like). If it'd be
| appropriate for https://devdocs.io (official docs with suitable
| licensing), you can contribute a crawler to them and it'll flow
| downstream to me.
|
| I also have plans to do proper web crawling, though it'll take
| me a while to get there:
| https://search.feep.dev/blog/post/2022-08-10-crawling-roadma...
| bragr wrote:
| Is there a particular reason this site does not index the
| official documentation for languages and frameworks? Tried a
| couple different searches for the things I work on and mostly got
| HN and stack overflow posts that aren't really responsive to my
| query.
|
| https://search.feep.dev/about/datasources
|
| Edit: it does appear the the devdocs.io has the docs I'm
| interested in, but they don't appear to be surfaced in the first
| several pages of results at least. A good example for this is
| searching "python datetime" which does not actually return links
| to datetime docs, just a lot of HN and SO posts referencing
| datetime.
| wolfgang42 wrote:
| "python library datetime" gets the results you're looking for--
| but mostly by virtue of this not being a way anyone ordinarily
| thinks of describing it, so it knocks the irrelevant results
| down in the rankings. I think there are a couple of things
| going on here:
|
| - The ranking algorithm I'm using isn't great at distinguishing
| pages _about_ a topic from pages which merely _mention_ a topic
| in passing.
|
| - Because the Python docs are versioned, the PageRank they
| deserve is spread out over several URLs and they appear less
| relevant than they really are.
|
| I have plans to fix both of these problems, but they're pretty
| involved and I haven't had the time to dig into the matter yet.
| For the moment, it's definitely a gamble whether the results
| will be any good: sometimes they're great, and other times
| they're completely useless. (There's a reason I put links to
| other search engines at the bottom of the results page!)
| Kwpolska wrote:
| For a programming search engine, the official docs of
| languages should get special treatment. Google often surfaces
| outdated versions of the documentation, but they're usually
| at the top. If you want to improve on this, you should (a)
| rank official docs the highest, (b) give extra weight to
| docs.python.org if the query contains "python", (c) merge the
| same page for different versions and add a version picker.
| wolfgang42 wrote:
| All 3 of these are on my list (plus understanding page
| sections, which would improve the results for things like
| the Python docs where there's a bunch of topics on a single
| page); I just haven't gotten round to writing the code yet.
| PaulHoule wrote:
| I have long wanted a programming search engine where I
| could pick something like "Python" and "version 3.9" and
| always get the right thing. Similarly there is
| documentation for software packages that are similarly
| versioned (e.g. "react-router 4" vs "react-router 5")
| FeepingCreature wrote:
| I don't know how I feel about this.
| djbusby wrote:
| How do you find ES is for this kind of thing? Have you looked at
| others, eg: Solr or even Sqlite FTS?
| wolfgang42 wrote:
| I spent fifteen minutes on a search for "best full text search"
| and Elastic looked like the best combination of popular+easy.
| Since I was expecting this to be the diversion of an afternoon
| there wasn't any point in investigating more than that.
|
| In hindsight, ES wasn't the best choice for what this turned
| into: the problem is that it wants to be a managed cluster that
| does log analysis/analytics/observability/machine learning/I
| don't know what all and full text search is almost an
| afterthought; whereas I want a single node that does full text
| search and nothing else. All that extra complexity makes it
| hard for me to figure out how to get it to do what I want, and
| I don't have the time to invest in really understanding how it
| works under the hood. So I probably will switch to something
| simpler when I get a chance, so I can have a better chance of
| being able to figure out how to adjust it to make the results
| look the way I want.
| karmakaze wrote:
| This is great. Would be nice if it considered symbols better.
|
| Looked for "reject!" and it returned "reject" and "rejection"
| when "reject!" matches exist. Ironic given its name.
| topher515 wrote:
| Agreed! It seems like exactly searching for special characters
| could be the "killer app" that programmers need which would get
| them to leave Google.
| karmakaze wrote:
| Even Github search is so bad. Has it improved any lately?
| wolfgang42 wrote:
| Yeah, yet another reason for me to switch from ElasticSearch--I
| need a stemmer that understands symbols (and also can
| distinguish English from function names and not try to inflect
| the latter).
| b34r wrote:
| Searched for "React" and the official docs weren't even in the
| top 10.
| Dig1t wrote:
| This is a cool idea! I would love to use this. I don't know how
| it works or if I'm using it right, but I tried this example:
| "swift ios upload picture multipart/form-data". Something I was
| just yesterday searching in Google.
|
| The results are not great, first 2 are links for crystal lang,
| something about Salesforce, general REST PUT, and the rest are
| other things not related to Swift or iOS. I would have expected
| results specifically related to iOS or Swift since those were the
| technologies I specified.
|
| How should I rephrase this query to end up landing at pages like
| this: https://stackoverflow.com/questions/29623187/upload-image-
| wi...
|
| Which is the page that Google took me to, and the one that solved
| my problem.
| wolfgang42 wrote:
| I have a page with some advice on writing searches
| (https://search.feep.dev/about/query), but I don't think you
| did anything wrong here: sometimes my search results are just
| inexplicably terrible. This definitely falls into that category
| and is going on my list of test cases that need improvement.
| There's a reason I link to Google at the bottom of the results
| page!
|
| I'm currently using ElasticSearch for ranking, and made a brief
| effort at tuning it. The problem is that it's very big and
| complicated, which makes it hard for me to understand what's
| going on under the hood. If I were doing this professionally
| I'd dive into ES internals and figure it out, but when I can
| only squeeze in a few hours a week it's hard to really sink my
| teeth in. I'd like to switch to something simpler to wrap my
| head around (possibly Lucene, or Bleve); once I've done that I
| should be able to get a better handle on how the ranking works
| and how to make it more reliable.
| pjot wrote:
| Elasticsearch is distributed Lucene, no?
| wolfgang42 wrote:
| Yes (well, plus a lot of other features); and it's the
| "distributed" part that gives me headaches. I don't need
| any of that stuff, since I'm running on a single node, and
| it means there's a bunch of abstractions between me and
| Lucene (which Elastic mostly tries to hide away as an
| implementation detail).
| O__________O wrote:
| Might be wrong, but the page they provided as an example
| correct result is not even in your index. Is that correct,
| and if so, why? If it is in your index, what is a query that
| would return it as a top ten result?
| wolfgang42 wrote:
| I can _see_ it in Kibana when I request it by ID, but I
| can't seem to get it via text search no matter what
| keywords I use, which is bizarre. ("NSMutableURLRequest
| image" should be pulling it up, but isn't.) I have no idea
| what's going on here, but thanks for bringing my attention
| to it!
|
| This sort of thing is part of the reason I want to move off
| of ES: it's a big black box and when something goes wrong I
| have no idea how to diagnose it. (I'm currently researching
| "unassigned shards" in case that's the problem, but for all
| I know that could be a red herring.) Something a lot
| simpler would be easier for me to hold in my head and
| easier to figure out when it goes wrong.
| ColonelPhantom wrote:
| Seems neat, but trying to find Django documentation is not ideal:
| searching for "django prefetch" has GitHub repos as the first two
| results, and the third result is official Django documentation
| but for an ancient version, something that annoys me about other
| search engines too.
|
| Out of curiousity, what kind of hardware are you running this on?
| I can imagine that you'd need a lot of storage to store the
| index, but the size of plain text can often be surprisingly
| small.
| wolfgang42 wrote:
| The problem is that newer versions have fewer links to them, so
| they seem less authoritative. I have a plan for some heuristics
| that will detect version numbers in URLs and collapse them into
| a single result with a version picker in case you want an older
| version.
|
| The server is an HP Microserver Gen8 (purchased on eBay), with
| an "Intel(R) Pentium(R) CPU G2020T @ 2.50GHz" and 16GB of RAM.
| The production index is 70GB, and I also have a 1TB spinning
| rust disk that I use for scratch space and raw data.
| throwawayacc4 wrote:
| For a loonnggg time I thought of developing something like this.
| I have an entire bookmark section of developer documentation, all
| of them with their special search and organization. If only there
| was one search (a good search engine) for all of them! Great
| work!
| justsomehnguy wrote:
| https://duckduckgo.com/?q=powershell+snipeit
|
| vs
|
| https://search.feep.dev/search?q=powershell+snipeit
| wolfgang42 wrote:
| Well, with a mere 30 million pages in my index it was
| inevitable I'd be missing something. I'd expect this to show up
| eventually as I add more data sources.
| IceWreck wrote:
| Great idea, but search is pretty bad right now.
|
| Searching for "django signals" got unofficial search results on
| the first page and all the links on the second page (1) are
| broken.
|
| Searching for "go cobra" gets no official docs at all.
|
| (1) https://search.feep.dev/search?q=django%20signal&p=2
|
| Some suggestions:
|
| - Prioritize github, gitlab, readthedocs, go.dev, docs.rust links
|
| - In github, only parse readme and wiki links. Avoid parsing
| links that are related to a specific commit hash.
|
| - Python, Rust docs have versions in the url. Can you link them
| to the latest version instead ?
| PaulHoule wrote:
| One thing you get from reading TREC conference proceedings is
| that most of the things that you think will improve search
| relevance won't.
|
| People have almost forgotten how bad search indexes were before
| Google.
| wolfgang42 wrote:
| Thanks for checking it out!
|
| All those broken links in your "django signals" seem to have
| come from a page full of mangled URLs that got picked up on;
| unfortunately they've pushed the actual results all the way
| down to page 6! I definitely need to give a boost to official
| documentation.
|
| "golang cobra" gets what appears to be the official repo as the
| first result; but it's clearly not really getting what you're
| going for here. This is a good example of the sort of
| challenges a search engine faces: both "go" and "cobra" have
| multiple meanings, and it needs to understand the context to
| figure out whether a given link is relevant for this particular
| search. I think something like a vector search would be useful
| here but I haven't looked into setting something like that up
| yet.
|
| GitHub is on my list, but it's very big and is going to require
| careful optimization. (Even if I only load top-level READMEs
| it's still a ton of data.)
|
| ReadTheDocs would be great, but they don't seem to have any
| dump/download support, or even a list of all the documentation
| sites they host, so in lieu of that they're going to have to
| wait until I get a general web crawler.
|
| I have some heuristics to collapse multiple versions into
| single result with a version picker, but they require some
| adjustments to the rest of my data processing pipeline which I
| haven't gotten round to yet.
| [deleted]
| phpisatrash wrote:
| This is awesome. I really enjoyed the UI and the no javascript.
|
| Could i ask you a question? What is your tech stack? (programing
| language, background worker, database) How often does the index
| updates?
|
| Are you planning to make it open source?
| wolfgang42 wrote:
| Always happy to answer questions! The code is mostly Node.js,
| with a lot of shell scripts to glue things together. The
| "background worker" is mostly me running things in tmux, though
| I do (ab)use GitLab CI for some scheduled tasks. The main full-
| text index is currently ElasticSearch (as I mention elsewhere
| in this thread, I'm not a fan of it); various other data in the
| ingestion process is stored a combination of JSON-Lines files,
| SQLite, and bespoke binary formats as needed. Because I'm
| squeezing this into the hardware I have, the details are
| generally dictated by performance constraints for the
| particular problem at hand.
|
| Update frequency depends on the data source, details here:
| https://search.feep.dev/about/datasources
|
| No plans to open-source it at the moment; that implies a level
| of stewardship that I don't have the energy for at the moment,
| and also some of the code is kind of tied to my specific server
| right now.
| Y_Y wrote:
| Do you think you could eventually reproduce the glory of
| mid-2000s Google? At least for some large predefined subset of
| the internet?
| culi wrote:
| I think the biggest problem with this is not necessarily
| Google's algorithm changing, but the internet changing. Sites
| evolved to produce SEO spam for higher rankings and Google's
| search, as bad as it is now, would probably be even worse if it
| stayed stagnant and didn't evolve in response
|
| The "predefined subset of the internet" part can definitely be
| a solution but the preceding "large" is probably where the
| challenge remains. However, projects like Looria[0] give me
| hope for a more curated search experience (i.e. without the
| "large" adjective)
|
| [0] https://www.looria.com/
| wolfgang42 wrote:
| I don't really remember Google of that era; I got on the
| Internet pretty late. But I do have high hopes for the recent
| rise we've seen in smaller, targeted search engines; a lot of
| the Google-scale problems of "making a search engine" go away
| when you focus on a small corner of the Web:
|
| - the tech has reached a point where it's actually pretty
| reasonable for someone to index a fairly large chunk of it
| themselves: https://search.feep.dev/blog/post/2022-07-23-write-
| your-own
|
| - benefits of diversification: if one search engine isn't
| helpful, you can try another instead of just being out of luck;
| and spammers now have to game a bunch of different algorithms
| rather than being able to target just one.
|
| - having just one person, or a small group, focuses the
| results, and can hopefully produce a higher level of polish in
| a targeted area.
| crosser wrote:
| I would be very happy if such service worked (or if I could run
| it myself). It's my long term goal to break out of dependency on
| the Borg.
|
| But the results are not even promising, let alone useful, which
| is very sad.
|
| (I tried "haskell gloss terminate animation normally". That was
| my real search a couple of days ago.)
| wolfgang42 wrote:
| Result quality is something of a gamble right now: sometimes
| the results are really excellent, but as you've found they can
| also be pretty useless. I'm planning to use all the searches
| I'm getting today to construct a benchmark I can use to improve
| things.
|
| On that note: what were you hoping to get out of that search? I
| see that Gloss is a package for doing animations, but (without
| knowing anything about Haskell) it seems like Google/DuckDuckGo
| don't really have anything useful to offer either. (In fact the
| only thing I found was what I assume is your post on the Gloss
| mailing list: https://groups.google.com/g/haskell-
| gloss/c/FGNxutKmm-w)
| marginalia_nu wrote:
| I think it looks untuned rather than somehow broken.
|
| Fine tuning result relevance is a pretty long and tedious
| process, and small problems with this can make results look
| very bad.
| culi wrote:
| Love to see more independent indexes! Sometimes there seems like
| there's plenty of search engines, but when grouped by the indexes
| they rely on there's actually very few major ones when you group
| them together
|
| - Google, StartPage
|
| - Bing, DuckDuckGo, Ecosia, AOL, Yahoo
|
| - Yandex (mainly Russian)
|
| - Brave (recently started its own index but often falls back on
| Google's)
|
| Love to see projects like Marginalia and now this. These projects
| also make meta search engines like Searx[0] that much more
| powerful.
|
| Anyways since I'm in the business of listing out relevant
| projects, other code-centered search engines you might wanna
| check out are searchcode.com[1], codesearch.ai[2],
| symbolhound[3], and publicwww.com[4] (some of these are often
| down, but might still be good to learn from)
|
| [0] https://searx.tuxcloud.net/
|
| [1] https://searchcode.com/
|
| [2] https://codesearch.ai/
|
| [3] http://symbolhound.com/
|
| [4] https://publicwww.com/
| throwup wrote:
| To that first list you could add Kagi, who also runs their own
| index
|
| EDIT: Tough crowd, did Kagi get cancelled or something while I
| wasn't looking?
| marginalia_nu wrote:
| Kagi is to the best of my awareness mostly doing magic with
| Google results.
| teddyh wrote:
| More:
|
| * http://codesearch.debian.net/
|
| * https://codesearch.isocpp.org/
|
| * https://www.programcreek.com/python/
|
| * https://livegrep.com/search/linux
|
| * https://grep.app/
___________________________________________________________________
(page generated 2022-11-06 23:00 UTC)