[HN Gopher] Show HN: Feep! search, an independent search engine ...
       ___________________________________________________________________
        
       Show HN: Feep! search, an independent search engine for programmers
        
       Hi HN! This started late last year as an afternoon project to play
       around with ElasticSearch, and then I kept thinking of new features
       I wanted to add. I still have a lot of things I want to build, but
       now seemed like a good time to put it out there: even if the
       results aren't nearly the quality I'd like, I've still found it
       useful and I want to show it off!  I've been working on it since
       September 2021, but only in fits and starts. The entire thing runs
       on a computer in my living room (there's a picture on the About
       page); I haven't done any load testing so we'll see how it holds
       up.
        
       Author : wolfgang42
       Score  : 91 points
       Date   : 2022-11-06 16:25 UTC (6 hours ago)
        
 (HTM) web link (search.feep.dev)
 (TXT) w3m dump (search.feep.dev)
        
       | brokenkebab2 wrote:
       | Frankly it doesn't look like it's ready to be useful. As an
       | example tried "Braze notifications" and the first result was
       | about Brave, then two mildly relevant, and then a long stretch of
       | "Who's hiring?" topics from which HN seem to mention only
       | notifications.
        
         | wolfgang42 wrote:
         | The title mentioning "Brave" seems to be a red herring: there's
         | someone in the comments talking about Braze, though it looks
         | like a typo. Similarly the "Who's hiring" posts do actually
         | have job listings for Braze, but you have to click through the
         | More link at the bottom to find them. (Because I load HN
         | directly from a data dump, the search doesn't know about the
         | pagination.)
         | 
         | I think the main problem here is that my index is relatively
         | small: it has only (!) 30 million pages, and it looks like
         | Braze just isn't popular enough for me to have run into it with
         | the right keywords yet.
        
       | pcl wrote:
       | This looks great!
       | 
       | MDN docs are pretty strong. Perhaps devdocs is a superset, but if
       | not, I'd recommend indexing them as well.
       | 
       | Also, feature request: it'd be nice if the query help unfolded
       | the instructions in-line with the current page instead of
       | navigating to another page. That way, I would be able to see them
       | while mucking with my query.
        
         | wolfgang42 wrote:
         | Glad you like it! Devdocs includes Mozilla's docs, for JS, CSS,
         | DOM, SVG, and others. (But my ranking algorithm doesn't
         | understand that "mdn" is a synonym for "developers.mozilla.org"
         | so it's hard to surface them explicitly.)
         | 
         | Thanks for the feature request--I don't have any frontend JS
         | set up yet that I could easily add this to, but I can see how
         | this could be useful and I'll put it on my list.
        
           | pcl wrote:
           | Personally, I'd rather click a link to MDN docs than most
           | other sources, so if you had some way to expose pass-through
           | attribution from devdocs, that'd be useful at least for me.
           | 
           | I wonder how many engineers think about search results link
           | origin before clicking through.
        
             | wolfgang42 wrote:
             | Although I'm sourcing the crawl data from devdocs, the
             | ingestion process uses the upstream URL, so the search
             | results link to developer.mozilla.org with the appropriate
             | favicon.
             | 
             | I've heard enough complaints about W3Schools and other SEO-
             | heavy but accuracy-light sources that I suspect a fair
             | proportion of technically-minded users probably consider
             | the domain before clicking on a result link.
        
               | pcl wrote:
               | Yeah w3schools is the very reason I avoid Google search
               | for code questions.
        
       | culi wrote:
       | Other sources you may wanna crawl
       | 
       | - https://www.thecodingforums.com/ (and other programming-related
       | online communities like Lobste.rs, certain subreddits, and
       | certain Lemmy instances/communities)
       | 
       | - https://pldb.com/ (might be a good way to automatically get all
       | the docs of each programming language as well as
       | books/videos/publications that mention a certain language)
        
       | mdaniel wrote:
       | Congratulations, I always enjoy new search engines
       | 
       | W.r.t. "and updated intermittently," I wanted to draw your
       | attention to the HN realtime API:
       | https://github.com/HackerNews/API#live-data and also that S.O.
       | offers Atom Feeds: https://stackoverflow.com/feeds/ _(I 'd guess
       | the rest do, too, but I didn't verify)_
       | 
       | I am a huge proponent of taking advantage of any update features
       | that a site offers, because otherwise the "how about now?" of re-
       | crawling is wasteful to both parties.
        
         | wolfgang42 wrote:
         | My current architecture is built around batch ingestion[1], and
         | doesn't (yet) have a way to do incremental updates. This is
         | great for getting coverage--there are a lot of long-tail
         | results in my search engine but not in Google!--but it does
         | mean there's more lag and the results aren't instantly up-to-
         | date.
         | 
         | [1]: e.g. for StackOverflow, I download an XML dump of the
         | entire site once a quarter:
         | https://search.feep.dev/blog/post/2021-09-04-stackexchange
        
       | aliqot wrote:
       | Hey this is pretty cool. I noticed a lot of the results are HN
       | posts and StackExchange. Is there a definitive list somewhere of
       | the sources, and/or maybe a way to contribute to those?
       | 
       | Thanks for showing this to us, I like where your head's at!
       | 
       | Edit: found it, it was explained on the front page.
       | https://search.feep.dev/about/datasources
        
         | wolfgang42 wrote:
         | I can pretty easily add any StackExchange sites I left out, or
         | anything that comes as a Zim file (e.g. from
         | https://farm.openzim.org/recipes or the like). If it'd be
         | appropriate for https://devdocs.io (official docs with suitable
         | licensing), you can contribute a crawler to them and it'll flow
         | downstream to me.
         | 
         | I also have plans to do proper web crawling, though it'll take
         | me a while to get there:
         | https://search.feep.dev/blog/post/2022-08-10-crawling-roadma...
        
       | bragr wrote:
       | Is there a particular reason this site does not index the
       | official documentation for languages and frameworks? Tried a
       | couple different searches for the things I work on and mostly got
       | HN and stack overflow posts that aren't really responsive to my
       | query.
       | 
       | https://search.feep.dev/about/datasources
       | 
       | Edit: it does appear the the devdocs.io has the docs I'm
       | interested in, but they don't appear to be surfaced in the first
       | several pages of results at least. A good example for this is
       | searching "python datetime" which does not actually return links
       | to datetime docs, just a lot of HN and SO posts referencing
       | datetime.
        
         | wolfgang42 wrote:
         | "python library datetime" gets the results you're looking for--
         | but mostly by virtue of this not being a way anyone ordinarily
         | thinks of describing it, so it knocks the irrelevant results
         | down in the rankings. I think there are a couple of things
         | going on here:
         | 
         | - The ranking algorithm I'm using isn't great at distinguishing
         | pages _about_ a topic from pages which merely _mention_ a topic
         | in passing.
         | 
         | - Because the Python docs are versioned, the PageRank they
         | deserve is spread out over several URLs and they appear less
         | relevant than they really are.
         | 
         | I have plans to fix both of these problems, but they're pretty
         | involved and I haven't had the time to dig into the matter yet.
         | For the moment, it's definitely a gamble whether the results
         | will be any good: sometimes they're great, and other times
         | they're completely useless. (There's a reason I put links to
         | other search engines at the bottom of the results page!)
        
           | Kwpolska wrote:
           | For a programming search engine, the official docs of
           | languages should get special treatment. Google often surfaces
           | outdated versions of the documentation, but they're usually
           | at the top. If you want to improve on this, you should (a)
           | rank official docs the highest, (b) give extra weight to
           | docs.python.org if the query contains "python", (c) merge the
           | same page for different versions and add a version picker.
        
             | wolfgang42 wrote:
             | All 3 of these are on my list (plus understanding page
             | sections, which would improve the results for things like
             | the Python docs where there's a bunch of topics on a single
             | page); I just haven't gotten round to writing the code yet.
        
             | PaulHoule wrote:
             | I have long wanted a programming search engine where I
             | could pick something like "Python" and "version 3.9" and
             | always get the right thing. Similarly there is
             | documentation for software packages that are similarly
             | versioned (e.g. "react-router 4" vs "react-router 5")
        
       | FeepingCreature wrote:
       | I don't know how I feel about this.
        
       | djbusby wrote:
       | How do you find ES is for this kind of thing? Have you looked at
       | others, eg: Solr or even Sqlite FTS?
        
         | wolfgang42 wrote:
         | I spent fifteen minutes on a search for "best full text search"
         | and Elastic looked like the best combination of popular+easy.
         | Since I was expecting this to be the diversion of an afternoon
         | there wasn't any point in investigating more than that.
         | 
         | In hindsight, ES wasn't the best choice for what this turned
         | into: the problem is that it wants to be a managed cluster that
         | does log analysis/analytics/observability/machine learning/I
         | don't know what all and full text search is almost an
         | afterthought; whereas I want a single node that does full text
         | search and nothing else. All that extra complexity makes it
         | hard for me to figure out how to get it to do what I want, and
         | I don't have the time to invest in really understanding how it
         | works under the hood. So I probably will switch to something
         | simpler when I get a chance, so I can have a better chance of
         | being able to figure out how to adjust it to make the results
         | look the way I want.
        
       | karmakaze wrote:
       | This is great. Would be nice if it considered symbols better.
       | 
       | Looked for "reject!" and it returned "reject" and "rejection"
       | when "reject!" matches exist. Ironic given its name.
        
         | topher515 wrote:
         | Agreed! It seems like exactly searching for special characters
         | could be the "killer app" that programmers need which would get
         | them to leave Google.
        
           | karmakaze wrote:
           | Even Github search is so bad. Has it improved any lately?
        
         | wolfgang42 wrote:
         | Yeah, yet another reason for me to switch from ElasticSearch--I
         | need a stemmer that understands symbols (and also can
         | distinguish English from function names and not try to inflect
         | the latter).
        
       | b34r wrote:
       | Searched for "React" and the official docs weren't even in the
       | top 10.
        
       | Dig1t wrote:
       | This is a cool idea! I would love to use this. I don't know how
       | it works or if I'm using it right, but I tried this example:
       | "swift ios upload picture multipart/form-data". Something I was
       | just yesterday searching in Google.
       | 
       | The results are not great, first 2 are links for crystal lang,
       | something about Salesforce, general REST PUT, and the rest are
       | other things not related to Swift or iOS. I would have expected
       | results specifically related to iOS or Swift since those were the
       | technologies I specified.
       | 
       | How should I rephrase this query to end up landing at pages like
       | this: https://stackoverflow.com/questions/29623187/upload-image-
       | wi...
       | 
       | Which is the page that Google took me to, and the one that solved
       | my problem.
        
         | wolfgang42 wrote:
         | I have a page with some advice on writing searches
         | (https://search.feep.dev/about/query), but I don't think you
         | did anything wrong here: sometimes my search results are just
         | inexplicably terrible. This definitely falls into that category
         | and is going on my list of test cases that need improvement.
         | There's a reason I link to Google at the bottom of the results
         | page!
         | 
         | I'm currently using ElasticSearch for ranking, and made a brief
         | effort at tuning it. The problem is that it's very big and
         | complicated, which makes it hard for me to understand what's
         | going on under the hood. If I were doing this professionally
         | I'd dive into ES internals and figure it out, but when I can
         | only squeeze in a few hours a week it's hard to really sink my
         | teeth in. I'd like to switch to something simpler to wrap my
         | head around (possibly Lucene, or Bleve); once I've done that I
         | should be able to get a better handle on how the ranking works
         | and how to make it more reliable.
        
           | pjot wrote:
           | Elasticsearch is distributed Lucene, no?
        
             | wolfgang42 wrote:
             | Yes (well, plus a lot of other features); and it's the
             | "distributed" part that gives me headaches. I don't need
             | any of that stuff, since I'm running on a single node, and
             | it means there's a bunch of abstractions between me and
             | Lucene (which Elastic mostly tries to hide away as an
             | implementation detail).
        
           | O__________O wrote:
           | Might be wrong, but the page they provided as an example
           | correct result is not even in your index. Is that correct,
           | and if so, why? If it is in your index, what is a query that
           | would return it as a top ten result?
        
             | wolfgang42 wrote:
             | I can _see_ it in Kibana when I request it by ID, but I
             | can't seem to get it via text search no matter what
             | keywords I use, which is bizarre. ("NSMutableURLRequest
             | image" should be pulling it up, but isn't.) I have no idea
             | what's going on here, but thanks for bringing my attention
             | to it!
             | 
             | This sort of thing is part of the reason I want to move off
             | of ES: it's a big black box and when something goes wrong I
             | have no idea how to diagnose it. (I'm currently researching
             | "unassigned shards" in case that's the problem, but for all
             | I know that could be a red herring.) Something a lot
             | simpler would be easier for me to hold in my head and
             | easier to figure out when it goes wrong.
        
       | ColonelPhantom wrote:
       | Seems neat, but trying to find Django documentation is not ideal:
       | searching for "django prefetch" has GitHub repos as the first two
       | results, and the third result is official Django documentation
       | but for an ancient version, something that annoys me about other
       | search engines too.
       | 
       | Out of curiousity, what kind of hardware are you running this on?
       | I can imagine that you'd need a lot of storage to store the
       | index, but the size of plain text can often be surprisingly
       | small.
        
         | wolfgang42 wrote:
         | The problem is that newer versions have fewer links to them, so
         | they seem less authoritative. I have a plan for some heuristics
         | that will detect version numbers in URLs and collapse them into
         | a single result with a version picker in case you want an older
         | version.
         | 
         | The server is an HP Microserver Gen8 (purchased on eBay), with
         | an "Intel(R) Pentium(R) CPU G2020T @ 2.50GHz" and 16GB of RAM.
         | The production index is 70GB, and I also have a 1TB spinning
         | rust disk that I use for scratch space and raw data.
        
       | throwawayacc4 wrote:
       | For a loonnggg time I thought of developing something like this.
       | I have an entire bookmark section of developer documentation, all
       | of them with their special search and organization. If only there
       | was one search (a good search engine) for all of them! Great
       | work!
        
       | justsomehnguy wrote:
       | https://duckduckgo.com/?q=powershell+snipeit
       | 
       | vs
       | 
       | https://search.feep.dev/search?q=powershell+snipeit
        
         | wolfgang42 wrote:
         | Well, with a mere 30 million pages in my index it was
         | inevitable I'd be missing something. I'd expect this to show up
         | eventually as I add more data sources.
        
       | IceWreck wrote:
       | Great idea, but search is pretty bad right now.
       | 
       | Searching for "django signals" got unofficial search results on
       | the first page and all the links on the second page (1) are
       | broken.
       | 
       | Searching for "go cobra" gets no official docs at all.
       | 
       | (1) https://search.feep.dev/search?q=django%20signal&p=2
       | 
       | Some suggestions:
       | 
       | - Prioritize github, gitlab, readthedocs, go.dev, docs.rust links
       | 
       | - In github, only parse readme and wiki links. Avoid parsing
       | links that are related to a specific commit hash.
       | 
       | - Python, Rust docs have versions in the url. Can you link them
       | to the latest version instead ?
        
         | PaulHoule wrote:
         | One thing you get from reading TREC conference proceedings is
         | that most of the things that you think will improve search
         | relevance won't.
         | 
         | People have almost forgotten how bad search indexes were before
         | Google.
        
         | wolfgang42 wrote:
         | Thanks for checking it out!
         | 
         | All those broken links in your "django signals" seem to have
         | come from a page full of mangled URLs that got picked up on;
         | unfortunately they've pushed the actual results all the way
         | down to page 6! I definitely need to give a boost to official
         | documentation.
         | 
         | "golang cobra" gets what appears to be the official repo as the
         | first result; but it's clearly not really getting what you're
         | going for here. This is a good example of the sort of
         | challenges a search engine faces: both "go" and "cobra" have
         | multiple meanings, and it needs to understand the context to
         | figure out whether a given link is relevant for this particular
         | search. I think something like a vector search would be useful
         | here but I haven't looked into setting something like that up
         | yet.
         | 
         | GitHub is on my list, but it's very big and is going to require
         | careful optimization. (Even if I only load top-level READMEs
         | it's still a ton of data.)
         | 
         | ReadTheDocs would be great, but they don't seem to have any
         | dump/download support, or even a list of all the documentation
         | sites they host, so in lieu of that they're going to have to
         | wait until I get a general web crawler.
         | 
         | I have some heuristics to collapse multiple versions into
         | single result with a version picker, but they require some
         | adjustments to the rest of my data processing pipeline which I
         | haven't gotten round to yet.
        
       | [deleted]
        
       | phpisatrash wrote:
       | This is awesome. I really enjoyed the UI and the no javascript.
       | 
       | Could i ask you a question? What is your tech stack? (programing
       | language, background worker, database) How often does the index
       | updates?
       | 
       | Are you planning to make it open source?
        
         | wolfgang42 wrote:
         | Always happy to answer questions! The code is mostly Node.js,
         | with a lot of shell scripts to glue things together. The
         | "background worker" is mostly me running things in tmux, though
         | I do (ab)use GitLab CI for some scheduled tasks. The main full-
         | text index is currently ElasticSearch (as I mention elsewhere
         | in this thread, I'm not a fan of it); various other data in the
         | ingestion process is stored a combination of JSON-Lines files,
         | SQLite, and bespoke binary formats as needed. Because I'm
         | squeezing this into the hardware I have, the details are
         | generally dictated by performance constraints for the
         | particular problem at hand.
         | 
         | Update frequency depends on the data source, details here:
         | https://search.feep.dev/about/datasources
         | 
         | No plans to open-source it at the moment; that implies a level
         | of stewardship that I don't have the energy for at the moment,
         | and also some of the code is kind of tied to my specific server
         | right now.
        
       | Y_Y wrote:
       | Do you think you could eventually reproduce the glory of
       | mid-2000s Google? At least for some large predefined subset of
       | the internet?
        
         | culi wrote:
         | I think the biggest problem with this is not necessarily
         | Google's algorithm changing, but the internet changing. Sites
         | evolved to produce SEO spam for higher rankings and Google's
         | search, as bad as it is now, would probably be even worse if it
         | stayed stagnant and didn't evolve in response
         | 
         | The "predefined subset of the internet" part can definitely be
         | a solution but the preceding "large" is probably where the
         | challenge remains. However, projects like Looria[0] give me
         | hope for a more curated search experience (i.e. without the
         | "large" adjective)
         | 
         | [0] https://www.looria.com/
        
         | wolfgang42 wrote:
         | I don't really remember Google of that era; I got on the
         | Internet pretty late. But I do have high hopes for the recent
         | rise we've seen in smaller, targeted search engines; a lot of
         | the Google-scale problems of "making a search engine" go away
         | when you focus on a small corner of the Web:
         | 
         | - the tech has reached a point where it's actually pretty
         | reasonable for someone to index a fairly large chunk of it
         | themselves: https://search.feep.dev/blog/post/2022-07-23-write-
         | your-own
         | 
         | - benefits of diversification: if one search engine isn't
         | helpful, you can try another instead of just being out of luck;
         | and spammers now have to game a bunch of different algorithms
         | rather than being able to target just one.
         | 
         | - having just one person, or a small group, focuses the
         | results, and can hopefully produce a higher level of polish in
         | a targeted area.
        
       | crosser wrote:
       | I would be very happy if such service worked (or if I could run
       | it myself). It's my long term goal to break out of dependency on
       | the Borg.
       | 
       | But the results are not even promising, let alone useful, which
       | is very sad.
       | 
       | (I tried "haskell gloss terminate animation normally". That was
       | my real search a couple of days ago.)
        
         | wolfgang42 wrote:
         | Result quality is something of a gamble right now: sometimes
         | the results are really excellent, but as you've found they can
         | also be pretty useless. I'm planning to use all the searches
         | I'm getting today to construct a benchmark I can use to improve
         | things.
         | 
         | On that note: what were you hoping to get out of that search? I
         | see that Gloss is a package for doing animations, but (without
         | knowing anything about Haskell) it seems like Google/DuckDuckGo
         | don't really have anything useful to offer either. (In fact the
         | only thing I found was what I assume is your post on the Gloss
         | mailing list: https://groups.google.com/g/haskell-
         | gloss/c/FGNxutKmm-w)
        
           | marginalia_nu wrote:
           | I think it looks untuned rather than somehow broken.
           | 
           | Fine tuning result relevance is a pretty long and tedious
           | process, and small problems with this can make results look
           | very bad.
        
       | culi wrote:
       | Love to see more independent indexes! Sometimes there seems like
       | there's plenty of search engines, but when grouped by the indexes
       | they rely on there's actually very few major ones when you group
       | them together
       | 
       | - Google, StartPage
       | 
       | - Bing, DuckDuckGo, Ecosia, AOL, Yahoo
       | 
       | - Yandex (mainly Russian)
       | 
       | - Brave (recently started its own index but often falls back on
       | Google's)
       | 
       | Love to see projects like Marginalia and now this. These projects
       | also make meta search engines like Searx[0] that much more
       | powerful.
       | 
       | Anyways since I'm in the business of listing out relevant
       | projects, other code-centered search engines you might wanna
       | check out are searchcode.com[1], codesearch.ai[2],
       | symbolhound[3], and publicwww.com[4] (some of these are often
       | down, but might still be good to learn from)
       | 
       | [0] https://searx.tuxcloud.net/
       | 
       | [1] https://searchcode.com/
       | 
       | [2] https://codesearch.ai/
       | 
       | [3] http://symbolhound.com/
       | 
       | [4] https://publicwww.com/
        
         | throwup wrote:
         | To that first list you could add Kagi, who also runs their own
         | index
         | 
         | EDIT: Tough crowd, did Kagi get cancelled or something while I
         | wasn't looking?
        
           | marginalia_nu wrote:
           | Kagi is to the best of my awareness mostly doing magic with
           | Google results.
        
         | teddyh wrote:
         | More:
         | 
         | * http://codesearch.debian.net/
         | 
         | * https://codesearch.isocpp.org/
         | 
         | * https://www.programcreek.com/python/
         | 
         | * https://livegrep.com/search/linux
         | 
         | * https://grep.app/
        
       ___________________________________________________________________
       (page generated 2022-11-06 23:00 UTC)