[HN Gopher] SearchHut
       ___________________________________________________________________
        
       SearchHut
        
       Author : tsujp
       Score  : 362 points
       Date   : 2022-07-15 04:27 UTC (18 hours ago)
        
 (HTM) web link (searchhut.org)
 (TXT) w3m dump (searchhut.org)
        
       | la64710 wrote:
       | Many 404 not found.
        
         | xigoi wrote:
         | It was not supposed to be released now, OP accidentally shared
         | it here because of a misunderstanding.
        
       | tsujp wrote:
       | Hey everyone I accidentally shared this too early.
       | 
       | I misinterpreted a don't-share-yet announcement to mean the
       | announcement post only and not the entire software and
       | announcement. I don't mean this as an excuse; that's just the
       | context.
       | 
       | So this is out before Drew et. al. intended it be hence some 404s
       | and so forth as commented by Drew in this very thread here:
       | https://news.ycombinator.com/item?id=32105407
       | 
       | I let my excitement get the better of me this time and I hope
       | people revisit SearchHut in a week or so after these quirks are
       | resolved.
        
         | skrebbel wrote:
         | With a bit of collaboration i bet we can flag it off the front
         | page in no time.
         | 
         | You (or Drew or someone else) can resubmit it later with a
         | bogus query string to skip HNs dupe checker.
        
           | ddevault wrote:
           | Cat's out of the bag now.
        
             | skrebbel wrote:
             | Ok, unflagged!
        
         | tgv wrote:
         | It's well communicated, so no harm done. And it gives an idea
         | what kind of "curiosity hit" you can expect when announcing it
         | for real.
        
       | yellow_lead wrote:
       | https://searchhut.org/search?q=toilet+cleaner
       | 
       | > Secaucus, New Jersey
       | 
       | Lol
        
       | newbieuser wrote:
       | What does it promise as an alternative to other search engines?
        
       | mosfets wrote:
       | good
        
       | wrycoder wrote:
       | Devault is doing some beautiful things over at his Sourcehut.
       | 
       | I watched him crank out a prototype of this in Go in about three
       | hours.
       | 
       | It really feels like the old days at sr.ht. It's fun again.
        
         | kungfufrog wrote:
         | Watched as in streamed? Or refreshed the repo page?
         | 
         | How can I watch?
        
       | saucepines wrote:
       | Pretty much just a worse version of Wikipedia's search at this
       | point.
        
       | synu wrote:
       | The about page has some good general info and links:
       | https://searchhut.org/about
        
         | flobosg wrote:
         | https://searchhut.org/about/domains returns 404, by the way.
        
           | eesmith wrote:
           | Lots of broken links from https://searchhut.org/about .
        
             | ISL wrote:
             | "Notice! This product is experimental and incomplete. User
             | beware!"
             | 
             | :).
        
       | the_biot wrote:
       | I think the idea of federation of domain-specific search engines,
       | possibly tied together by one or more front-ends, is a brilliant
       | idea.
       | 
       | I think it's similar to how Google's search works internally,
       | though I doubt the separation is based on a list of domain (as in
       | DNS) names. IIRC they have a set of search modules, and what they
       | return (and how fast they return it) all gets mixed in to the
       | search results according to some weighting. Right below the ads.
       | 
       | If you look at a search system that way, it's easy enough to add
       | modules that do things like search only wikipedia, and display
       | those results in a separate box (like DDG), or parse out currency
       | conversion requests, and display those up top based on some API
       | (like Google). etc
        
         | remram wrote:
         | How would you do ranking though?
         | 
         | It is possible for a site's results to be of different quality:
         | maybe one article about MySQL is not so informative, and an
         | article about Python on the same site is a reference.
         | 
         | The search engine operated by the author is unlikely to
         | acknowledge that.
        
       | chrismsimpson wrote:
       | Another day on hacker news, another roll-your-own search engine,
       | yet I still waste my time pitching to VCs that 'can't see it'.
        
       | fosefx wrote:
       | https://searchhut.org/search?q=js+service+worker
       | 
       | > List of accidents and incidents involving commercial aircraft
        
       | Beltalowda wrote:
       | Tried a few things:
       | 
       | - Beltalowda - no results (for reference: it's a term to refer to
       | "people from the [asteroid] belt" used in the The Expanse books
       | and TV series).
       | 
       | - The Expanse - bunch of results, but none are what I'm looking
       | for (the TV series or books). It looks like it may drop the "the"
       | in there?
       | 
       | - Star Trek - a bunch of results, but ordered very curiously; the
       | first is the Wikipedia page for "Star Trek Star Fleet Technical
       | Manual", and lots of pages like "Weapons in Star Trek" and such.
       | 
       | - NGC 3623 - lists "Messier object" and "Messier 65", in that
       | order, which is somewhat wrong as NGC 3623 refers to Messier 65
       | specifically.
       | 
       | - NGC3623 (same as previous, but without a space) - no results.
       | 
       | - vim map key - pretty useless results, most of which have no
       | bearing on Vim at all, much less mapping keys in Vim.
       | 
       | - python print list - the same; The Go type parameters proposal
       | is the first result; automake the second, etc.
       | 
       | Conclusion: "this product is experimental and incomplete" is an
       | understatement.
        
         | memorable wrote:
         | You could say it's in the Garbage stage (though Garbage is a
         | bit harsh for a product that is built in a week).
        
           | Beltalowda wrote:
           | I didn't call the product garbage, just some of the results,
           | which I think is fairly accurate. But I edited it to
           | "useless" now, as that comes off as a bit less harsh.
        
           | smegsicle wrote:
           | Hot Garbage
        
       | metadaemon wrote:
       | Looks cool! Most of the results right now are all Wikipedia
       | though.
        
       | charcircuit wrote:
       | >What's the most popular web server
       | 
       | SearchHut: The first result is django which is not the most
       | popular web server.
       | 
       | Google: Shows an answer box with the market share of various web
       | servers.
        
         | lsbehe wrote:
         | This one is actually hilarious because google cites the site
         | wrong for me. >Apache HTTP Server >It is one of the most
         | popular web servers around the world. As of May 2022, Apache
         | holds 31.5% of the market according to W3Techs and 22.99%
         | according to Netcraft. It's quoting that from
         | https://www.stackscale.com/blog/top-web-servers/ which clearly
         | states Nginx as the top one. >As of May 2022, Nginx holds 33.5%
         | of the market according to W3Techs and 30.71% according to
         | Netcraft.
        
         | alpaca128 wrote:
         | Considering Google's answer box randomly picked multiple photos
         | of unrelated people as pictures of murderers and rape victims
         | (with Google being very uncooperative about resolving the
         | issue) I'd say the lack of an answer box might not be that bad.
        
           | jfoster wrote:
           | An answer box is the right thing for that query. (the web
           | servers)
           | 
           | The part that Google seem to have unfortunately skimmed over
           | is that the answers need to be relevant, exact & correct.
        
             | alpaca128 wrote:
             | It can certainly be a helpful feature, but I wonder whether
             | it's really better than good, relevant search results
             | presented in a readable way. For example I'd argue the
             | manually curated infoboxes on Wikipedia are likely more
             | reliable than the algorithmic versions Google shows in
             | their results, especially as it's difficult to fix mistakes
             | in Google's version. Google thinks their own solution is
             | the best one because Google made it and so they circumvent
             | the whole page ranking process. Some queries of course need
             | more than just plain search results (see Semantic Web and
             | related things) but for those most engines don't offer
             | enough control and transparency.
             | 
             | But I'm glad people are trying to build alternatives. I'd
             | love a search engine that ignores sites with antipatterns
             | like required registration for any kind of usage, and this
             | is the first step.
        
           | charcircuit wrote:
           | Even if you skip the answer box the first result is a page
           | which breaks down the market share of the most popular web
           | servers.
        
       | heywire wrote:
       | Has anyone experimented with creating a search engine that only
       | indexes the landing page of domains? I'm less interested in
       | another Google, and more interested in a way to find new and
       | interesting sites/blogs/etc. Stumbleupon was great for this back
       | in the day.
       | 
       | Seems like it would be an interesting experiment to see what the
       | results would be, indexing only the content / meta tags of
       | "index.html".
        
         | kordlessagain wrote:
         | I built a solution at https://mitta.us/ that lets you submit
         | the sites you want crawled, and puts them in a self-managed
         | index (which isn't shared globally). I don't do link
         | extraction, but instead let GPT-3 generate URLs based off
         | keywords.
         | 
         | !url <keyterms> |synthesize
         | 
         | I also wrote a screenshot extension for Chrome that lets you
         | save a page when you find it interesting. The site is
         | definitely not "done" but it's usable if you want to try it.
         | Some info in help and in commands is inaccurate/broken, so it
         | is what it is for now.
         | 
         | It does the !google <search term> and !ddg <search term> thing
         | to find pages to save to the index. There are a bunch of other
         | commands I added, and there's an ability for others to write
         | commands and submit them to a Github repo:
         | https://github.com/kordless/mitta-community
         | 
         | !xkcd was fun to write. It shows comics. The rest of the
         | commands can be viewed from !help or just !<tab>
         | 
         | I've been working on pivoting the site to do prompt management
         | for GPT-3 developers and have been kicking around Open Sourcing
         | the other version for use as a personal search engine for
         | bookmarked pages.
        
       | bastardoperator wrote:
       | I'm not a fan of google but you can do exactly what this search
       | engine does by curating your own list of domains to search
       | against.
       | 
       | https://programmablesearchengine.google.com
        
         | danskeren wrote:
         | Some downsides with this approach:
         | 
         | - search queries are performed directly from the clients
         | computer so can't protect their privacy (since Custom Search
         | JSON API have a daily limit of 10k queries)
         | 
         | - forced to use javascript, and the way it's implemented makes
         | it difficult if not impossible to do even basic things like the
         | loading animation cards
         | 
         | - ads are loaded from an iframe so you can't do any styling
         | (except extremely limited options that they make available in
         | their settings, but no matter what then it will be very ugly if
         | you want to have a light/dark theme)
         | 
         | But there are of course many benefits as well, such as it being
         | 'free' (Bing is ridiculously expensive IMO, and feels
         | impossible to join their ad network to offset the costs.. which
         | might explain why you see countless Bing proxies shut down
         | after a few months) and search results are no doubt better than
         | the ones you'd get from Bing.
        
         | zxwrt wrote:
         | It requires google account, has tracking and isn't open-source.
         | I'd say it is no go
        
           | the_biot wrote:
           | Add to that the likelihood that Google will just randomly
           | cancel the product one day. Why invest time in this Google
           | product?
        
             | bastardoperator wrote:
             | It's been around forever, but your concern is real. Who's
             | to say an OSS project won't get archived, or removed from
             | the internet? Why invest time into anything when it will
             | all be replaced eventually?
             | 
             | edit: Looks like this OSS project was launched and
             | cancelled in a single day.
        
               | the_biot wrote:
               | > Looks like this OSS project was launched and cancelled
               | in a single day.
               | 
               | Touche :-)
               | 
               | However the idea of a federated search is some measure of
               | protection against that. If it ever happens.
        
         | jfoster wrote:
         | Interesting. This looks like Custom Search Engines evolved into
         | this?
         | 
         | I can't tell whether this is a neglected Google product that
         | they were going to refresh but lost interest in, or something
         | that is undergoing a breath of fresh air.
         | 
         | As you say, I was able to add a list of domains and get some
         | pretty decent results from it. The UI makes me feel like Google
         | are not interested in making it a truly successful product,
         | though.
        
         | a5huynh wrote:
         | For those looking for an alternative to that, I've been
         | building a self-hosted search engine that crawls what you want
         | based on a basic set of rules. It can be a list of domains, a
         | very specific list of URLs, and/or even some basic regexes.
         | 
         | https://github.com/a5huynh/spyglass
        
           | imiric wrote:
           | Great project! Given a local archive of Wikipedia and other
           | sources, this can be very powerful.
           | 
           | Which raises the question: does archive.org offer their
           | Wayback Machine index for download anywhere? Technically, why
           | should anyone go through the trouble of crawling the web if
           | archive.org has been doing it for years, and likely has one
           | of the best indexes around? I've seen some 3rd-party
           | downloaders for specific sites, but I'd like the full thing.
           | Yes, I realize it's probably petabytes of data, but maybe it
           | could be trimmed down to just the most recent crawls.
           | 
           | If there was a way of having that index locally, it would
           | make a very powerful search engine with a tool like yours.
        
       | Sin2x wrote:
       | I hope Drew opens his own Google some day
        
       | [deleted]
        
       | whateveracct wrote:
       | This is super cool - especially the self-hosting angle.
        
       | butz wrote:
       | In addition to curated domains list, some searches would benefit
       | of limiting display of old results, as usually you might find an
       | answer, but solved in jQuery or older version of framework you
       | are using.
        
       | nyanpasu64 wrote:
       | Does Sourcehut offer textual search within a repo's files? GitHub
       | and GitLab offer it, but Codeberg doesn't seem to (and I couldn't
       | find any information about its presence or absence on Sourcehut).
        
         | kevincox wrote:
         | It doesn't appear to be a code search engine. Just a regular
         | search engine focused on code.
         | 
         | Does Sourcegraph index Sourceht projects? It is a proper code
         | search engine and very good.
        
           | jdorfman wrote:
           | Looks like SourceHut is down so I'm not sure what projects
           | need indexing.
           | 
           | In other news we now index 87k Rust packages on crates.io
           | 
           | https://sourcegraph.com/search?q=context:global+repo:%5Ecrat.
           | ..
        
       | mountainriver wrote:
       | SearchHut is a cool name
        
       | tsujp wrote:
       | SearchHut was built to this point in about a week by Drew and
       | contributors which I think is amazing.
       | 
       | It is also meant to be very simple to run in the case you want to
       | index your own category of sites. For instance cooking content is
       | specifically not indexed but if _you_ wanted to you can spin up
       | an instance and index cooking sites yourself.
        
         | exitb wrote:
         | This seems like a great idea, honesty. There are niche topics
         | that are very hard to navigate in Google, because it's so
         | skewed towards mainstream topics. I think it would make sense
         | for these communities to maintain their own search engine.
        
       | BlackLotus89 wrote:
       | I wonder how the page ranking will work in the end. A quick look
       | at the source doesn't show (me!) any planning for intelligent
       | ranking. The database has a last_index_date and an authoritative
       | field. Could be used fot basic relevance sorting, but nothing
       | exhaustive.
       | 
       | Postgres as backend is maybe not the best choice and there are
       | already many sites that index specific pages and take
       | suggestions. The hard part is getting relevant results when
       | having a large index.
       | 
       | Still thank you for a new web search.
        
         | marginalia_nu wrote:
         | As I understand, the idea is to only have manually curated high
         | quality domains. In that regard, ranking is entirely secondary
         | to BM-25. Might work, but it leaves out a lot of long tail
         | sites that (in my experience at least) often have very good
         | results. It's really the middle segment where most of the shit
         | is.
        
       | verall wrote:
       | I guess cppreference.com isn't even a part of the list?
       | 
       | I tried a couple test queries:
       | 
       | > lambda decay to function pointer c++
       | 
       | I get some FSF pages and the wikipedia for Helium?
       | 
       | > std function
       | 
       | I get... tons of Rust docs?
       | 
       | > std function c++
       | 
       | All rust docs? The wikipedia page for C++??
       | 
       | Interesting idea, but this seems like it would be the primary
       | failure mode for an idea like this: as soon as you are
       | researching outside of the curator's specializations, it doesn't
       | have what you're looking for. Yet these results would both be
       | fixed simply by adding cppreference.com to the index. Let's try
       | and give it a real challenge:
       | 
       | > How to define systemverilog interface
       | 
       | And as I might expect, I get wikipedia pages. For "Verilog", for
       | "System on a Chip" and for "Mixin".
       | 
       | 1st google result:
       | 
       | > An Interface is a way to encapsulate signals into a block...
       | 
       | Working as expected
        
         | ddevault wrote:
         | I added cppreference.com now and kicked off a crawl. It'll be a
         | while. The list of domains is pretty small right now -- it was
         | intended to be bigger before the announcement was made. Will
         | also add RFCs and man pages soon.
         | 
         | There will (soon) be a form to request new domains are added to
         | the index, so if there are any sites you want indexed which are
         | outside of my personal expertise, then you'll be able to
         | request them.
        
           | ZWoz wrote:
           | You probably are already thought about it, but just in case
           | feature idea: adding moderation support for collaboration.
           | Somewhat trusted persons vetting niche subjects.
        
         | reidrac wrote:
         | I guess missing content can be explained by...
         | 
         | > Notice! This product is experimental and incomplete. User
         | beware!
         | 
         | But in reality, if you know where is the answer that you are
         | looking for, why would you use _that_ search engine.
         | 
         | I use DDG and if I want to search on the Scala docs, I use
         | "!scala whatever I'm searching" instead of just search with
         | DDG.
        
         | meibo wrote:
         | GitHub doesn't seem to be either. I get that it's a competitor
         | but not being able to search GitHub is probably a deal breaker
         | for most devs that aren't Drew.
        
           | ddevault wrote:
           | I'm not opposed to indexing GitHub, but the signal to noise
           | ratio on GitHub is poor. Nearly all GitHub repositories are
           | useless, so we'd have to filter most of it out. I think
           | instead I'll have to set it up where people can request that
           | specific interesting repositories are added to the index, and
           | maybe crawl /explore to fill in a decent base set.
        
             | interactivecode wrote:
             | Perhaps all repo's that have a published package is a good
             | heuristic. Then you'll at least get all the repos of npm,
             | python and other packages.
        
               | BlackLotus89 wrote:
               | Some interesting repos have no published packages. A
               | combination of number of commits, stars and forks would
               | be probably more relevant.
        
               | yellowapple wrote:
               | And likewise, some uninteresting repos do have published
               | packages.
        
             | marginalia_nu wrote:
             | GitHub is hella tricky to crawl too due to its sheer size
             | and single entry point (meaning slow crawl speed). I've
             | been looking at the problem as well, and so far just
             | ignored it as un-crawlable, but I might do something like
             | crawl only the about pages for repos that are linked to
             | externally some time in the future.
        
               | mdaniel wrote:
               | There's an asterisk to that: they serve the underlying
               | content through two different APIs so one can side-step
               | the HTML wrapper around the bytes: the discovery phase
               | has a formal API (both REST and GraphQL) for finding
               | repos, and then the in-repo content can be git-cloned and
               | one can locally index every branch, commit, and blob,
               | without issuing hundreds of thousands of http requests to
               | GH. One would still need to hit GH for the issues, if
               | that's in scope, but it'd be way less http requests
               | unless your repo is named kubernetes or terraform.
        
               | marginalia_nu wrote:
               | We're still talking about git clone:ing a hundred
               | thousand github repos. Git Repos get big very fast.
               | That's a lot of data when you're realistically only
               | interested in is a few markdown files per repo.
        
       | [deleted]
        
       | egberts1 wrote:
       | Never
       | 
       | Give
       | 
       | Up
       | 
       | This requires less paying attention to negative emotion and more
       | "water off my back".
       | 
       | Tweak it. tweak it some more.
       | 
       | Focus on the goal, notably one tiny sub-goal at a time.
       | 
       | Good luck, entrepreneurial spirit is a tough beast to attain.
       | 
       | Whatever you stake on,
       | 
       | NEVER
       | 
       | EVER
       | 
       | GIVE
       | 
       | UP
        
       | the_duke wrote:
       | >> https://searchhut.org/about/domains
       | 
       | => 404
        
         | ddevault wrote:
         | Here's the current list:
         | 
         | https://paste.sr.ht/~sircmpwn/0cab5e3137c2c2077b5aabf9e2fc8d...
         | 
         | It was intended to be larger prior to launch. Here's some other
         | domains I want to index:
         | 
         | https://paste.sr.ht/~sircmpwn/84d052f14a9a282698b5e5f7a9d9d9...
        
           | oefrha wrote:
           | sqlite.org is not on the latter list yet. Should be added.
        
           | baisq wrote:
           | I congratulate you for the novel approach, but this is
           | impossible to scale in a way that would make the engine
           | useful.
        
           | mordae wrote:
           | > erowid
           | 
           | Cool.
        
         | wraptile wrote:
         | Same for https://searchhut.org/request
        
       | ocdtrekkie wrote:
       | DDoS Drew for a year and he starts writing code for your next
       | competitor.
        
       | AlphaWeaver wrote:
       | I love that I can self-host this! Are there plans for federation?
       | 
       | Rather than maintaining a whole separate index for myself, I'd
       | love to self-host an instance of this, only indexing sites that
       | aren't in the main index, and then falling back to the main index
       | / merging it with my index to answer queries. I wonder how easy
       | that would be with the current architecture.
        
       | f0xJtpvHYTVQ88B wrote:
       | All 4 search results for "searX" (a self hostable meta-search
       | engine):
       | 
       | Wikipedia: List of Search engines
       | 
       | Drew's blog: We can do better than DuckDuckGo (perhaps the
       | impetus for this project)
       | 
       | Wikipedia: List of free and open source projects
       | 
       | Wikipedia: Internet Privacy
        
         | dijit wrote:
         | The point of the project is that it's a curated list of sites
         | to crawl, doesnt make sense to crawl other search engines.
        
       | nonrandomstring wrote:
       | Search results are presently poor. Mostly Wikipedia pages.
       | 
       | But it passes tests that are very important for me:
       | 
       | 1) It's fully accessible by Tor. No CAPTCHAs or "We don't serve
       | your kind in here" messages.
       | 
       | 2) It works in a text browser without JavaScript and renders in a
       | sensible way without style requirements.
       | 
       | 10/10 for accessibility. Something Google and other search
       | engines could learn from.
        
       | ddevault wrote:
       | Good morning, HN. Please note that SearchHut is not done or in a
       | presentable state right now, and those who were in the know were
       | asked not to share it. Alas. I had planned to announce this next
       | week, after we had more time to build up a bigger index, add more
       | features, fix up the 404's and stub pages, do more testing, and
       | so on, so if you notice any rough edges, this is why.
       | 
       | I went ahead and polished up the announcement for an early
       | release:
       | 
       | https://sourcehut.org/blog/2022-07-15-searchhut/
       | 
       | Let me know if you have any questions!
        
         | Chris2048 wrote:
         | Do you intend any of this to merge/cooperate with other similar
         | initiative?
         | 
         | e.g opencrawl, internet-archive, archiveteam
         | 
         | It strikes me the resources to crawl, update, and manage/index
         | data is a common problem.
        
           | ddevault wrote:
           | I intend to at least support other search engines by adding
           | !bangs for them and recommending them in the UI if you didn't
           | find the results you're looking for. I don't think that
           | crawling is something that is easily distributed across
           | independent orgs, though.
        
         | marginalia_nu wrote:
         | Just a warning from a fellow search engine developer.
         | 
         | If you happen to be cloud hosting this, and if you do not have
         | a global rate limit, implement one ASAP!
         | 
         | Several independent search engines have been hit _hard_ by a
         | botnet soon after they got attention, both mine and wiby.me,
         | and I think a few others. I 've had 10-12 QPS, sustained load
         | for weeks after weeks from a rotating set of mostly eastern
         | european IPs.
         | 
         | It's fine if this is on your own infrastructure, but on the
         | cloud, you'll be racking up bills like crazy from something
         | like that :-/
        
         | ancientsofmumu wrote:
         | I think it would be great if we had Code Forge index to search
         | uniquely. In this index are only the myriad of code hosting
         | sites around the internet - shared hosting like gitlab, github,
         | sourcehut, sourceforge, codeberg, and all the project instances
         | like the kernel.org, GNU Savannah, GNOME, KDE, BSD, etc.
         | Probably hundreds of them out there, and allow people to submit
         | their own self-hosted Gitea/Gitlab/sr.ht/etc. instances to be
         | crawled - maybe even suggest a robots.txt entry your crawler
         | could key in on as "yes please index me, hutbot".
        
           | ksherlock wrote:
           | Long ago -- 2006 to 2011 -- google had a functional source
           | code search engine:
           | https://en.wikipedia.org/wiki/Google_Code_Search
           | 
           | I don't recall if it supported SourceForge and GitHub (2008)
           | but it certainly included gzipped tarballs which were popular
           | and prevalent at the time.
        
         | wraptile wrote:
         | Could you clarify domain submition rules?
         | 
         | e.g. "Any websites engaging in SEO spam are rejected from the
         | index" - how is determined whether something is SEO spam or
         | not? More clarification of whats allowed/not allowed would be
         | nice!
        
           | ddevault wrote:
           | The criteria are documented here:
           | 
           | https://searchhut.org/docs/docs/webadmins/requirements/
           | 
           | And there's some advice for web masters on ways to improve
           | your site's ranking without running afoul of this rule:
           | 
           | https://searchhut.org/docs/docs/webadmins/recommendations/
           | 
           | But ultimately, it's subjective, and a judgement call will be
           | made. If it's minor you might get a warning, if it's blatant
           | then you'll just get de-listed.
        
       | stanleychink wrote:
        
       | jona4s wrote:
       | May I know what are the limits of the API?
       | 
       | How many requests per minute / hour are acceptable?
        
         | ddevault wrote:
         | The API limits are not documented yet, like many other things,
         | due to the early launch. For now I'll just say "be good". Don't
         | hit it with a battering ram.
        
       | closedloop129 wrote:
       | >SearchHut indexes from a curated set of domains.
       | 
       | Are there already plans to expand it as a service? E.g.
       | subreddits could maintain their preferred lists of domains.
        
         | xcambar wrote:
         | Also, the curated list is 404.
        
       | fbn79 wrote:
       | Bad serp... Searched 'mdn a'. Google return '<a>: The Anchor
       | element - HTML: HyperText Markup Language | MDN' SearchHut rerurn
       | a generic: 'MDN Web Docs'
        
         | SahAssar wrote:
         | It seems like it uses postgresql's FTS, which will generally
         | drop stop-words so "the", "a", "and" and similar words are
         | dropped. I've been meaning to figure out the best way to deal
         | with this myself, and I'm guessing looking for exact matches
         | first and then running a FTS query could work.
        
           | sjamaan wrote:
           | You can write a custom stemming algorithm and load it as an
           | extension library into Postgres, then use that with `CREATE
           | TEXT SEARCH DICTIONARY` to create a custom dictionary. It's
           | not as difficult as it sounds - you can use the default
           | Snowball stemmer as a sample, and tweak it.
        
             | ignaloidas wrote:
             | It's not just a custom dictionary. Stop words are usually
             | excluded for a reason, you need to understand the nature of
             | the query, and when to exclude what looks like stop words
             | from being pruned. It's not really a job for a snowball
             | stemmer, as you do need to operate over multiple tokens to
             | gather context.
        
           | marginalia_nu wrote:
           | Most of the time, keywords like this come from external
           | anchors as well, which is something that you're gonna be able
           | to leverage with this design (as I understand it).
        
       | jayzalowitz wrote:
       | Curated set of domains page is down.
       | 
       | Also, selfish plug I think it would be cool if you added
       | Hackernoon to that list.
        
         | yessirwhatever wrote:
         | I think the point is to avoid including low quality websites...
        
       | p-e-w wrote:
       | I like the idea of searching a curated list of domains, but I'm
       | not sure that doing the curation yourself is the best approach
       | considering the huge number of useful but niche websites in
       | existence.
       | 
       | I wonder if simply parsing all of Wikipedia (dumps are available,
       | and so are parsers capable of handling them) and building a list
       | of all domains used in external links would do the trick.
       | Wikipedia already has substantial quality control mechanisms, and
       | the resulting list should be essentially free of blog spam and
       | other low-quality content. Wikipedia also maintains "official
       | website" links for many topics, which can be obtained by parsing
       | the infobox from the article.
        
         | marginalia_nu wrote:
         | I've looked into this, and found wikipedia's links not to be
         | super useful. Wikipedia prefers references that don't change,
         | so books primarily, and beyond that websites that don't change,
         | so academic journals where you're getting paywalled, and
         | WaybackMachine archives of websites (even if they are still
         | live). You aren't getting much use out of wikipedia.
        
         | mormegil wrote:
         | Oh, and more to the point: This is a role Wikipedia explicitly
         | renounced, isn't it. When it became so big and the Google
         | PageRank gave it high importance, the spam became unbearable so
         | Wikipedia decided it needs to change the incentives and it
         | applied the rel=nofollow to all external links, so that it
         | could stop working as an unpaid manual spam filter for the
         | whole internet. Sure, your new search might ignore the
         | rel=nofollow but if you ever become big enough, the incentives
         | of spammers would lead to a bigger spamming pressure on
         | Wikipedia...
        
           | p-e-w wrote:
           | That's easily fixed by relying only on protected or high-
           | profile pages. Those already deal (mostly successfully) with
           | spam and NPOV violations on a daily basis, so piggybacking on
           | those protection mechanisms should yield a fairly high-
           | quality pool of curated external links.
        
             | Cthulhu_ wrote:
             | While that sounds good in theory, who is to say those that
             | can edit protected and high profile pages aren't in SEO
             | spammers' pockets?
             | 
             | I mean the edit history is public and there's plenty of
             | people that actually pay attention to edits and the like so
             | they would be found out soon enough, but still.
             | 
             | I'm sure this is an ongoing discussion when e.g. political
             | figures' pages are protected as well - who becomes the
             | gatekeeper, and what is their political angle?
        
               | p-e-w wrote:
               | Sure, but at that point you're simply discussing
               | Wikipedia's quality control system, which may be an
               | interesting discussion, but has nothing to do with search
               | engines _per se._
               | 
               | Considering that Wikipedia has become a pillar of most
               | scientific work (imagine writing a math or computer
               | science paper without Wikipedia - utterly unthinkable),
               | it's safe to say that knowledgeable people have
               | collectively decided that its quality control mechanisms
               | are "good enough", or at least better than those of any
               | other resource of comparable depth and breadth.
               | 
               | And that puts Wikipedia's link pool leaps and bounds
               | ahead of whatever dark magic current search engines are
               | using, which mostly seems to be "funnel the entire web
               | through some (easily gamed) heuristic".
        
               | nneonneo wrote:
               | Ehh, I think you've overestimated the quality of
               | Wikipedia on highly specialist topics (like, say, the
               | kinds of things you'd write academic papers on). It's not
               | so much that it doesn't have the content, it's that the
               | coverage is super uneven; sometimes it has extremely
               | detailed articles on a niche theorem in a particular
               | field, and other times it has the barest stub of
               | information on an entire sub-sub-field of study.
        
         | groffee wrote:
         | The problem with only using curated lists is that you kill
         | discoverability of new sites, but it does have promise like
         | we've seen with Braves 'goggles'.
        
         | richardsocher wrote:
         | That's our strategy at you.com - we start with the most popular
         | sites, crawl them and build apps for them, eg. you.com/apps and
         | let users vote in search results.
         | 
         | Full disclosure: I'm the CEO.
        
           | rawgabbit wrote:
           | I just learned about you.com from this thread. It looks very
           | promising.
        
           | ngshiheng wrote:
           | > you.com/apps and let users vote
           | 
           | what is stopping companies/users from abusing/gaming on this
           | system with bots?
        
           | agileAlligator wrote:
           | Your search actually performed better than Google for me on a
           | random query. I queried both engines "What is the specific
           | heat of alcohol?", Google threw up a rich search answer that
           | linked to some random Mexican site that is clearly exploiting
           | SEO [1], you.com linked me to answers.com (which is more
           | trustworthy than random mexican website).
           | 
           | [1]:
           | http://elempresario.mx/sites/default/files/scith/specific-
           | he...
        
             | charcircuit wrote:
             | When I search Google I get a rich search answer of
             | http://hyperphysics.phy-
             | astr.gsu.edu/hbase/Tables/sphtt.html
        
         | mormegil wrote:
         | The "official website" links can be retrieved in a machine-
         | readable way from Wikidata. E.g. a completely random silly
         | example of official websites of embassies: https://w.wiki/5T3R
        
           | 12907835202 wrote:
           | Is there documentation for rolling your own?
           | 
           | I've been considering building my own search engine for a
           | while for my niche topic which has <50 websites and blogs on
           | the web.
           | 
           | I can't tell how useful this will be but it'd be fun to give
           | it a go.
        
             | abujazar wrote:
             | Elastic App Search would be well suited for something like
             | that. It comes with a built in crawler.
        
             | mormegil wrote:
             | Your own what? Your own query? Sure; it's just a SPARQL
             | query over the Wikidata data model. The documentation
             | portal for the query service is at https://www.wikidata.org
             | /wiki/Wikidata:SPARQL_query_service/... though you'd need
             | some familiarity with Wikidata, its properties, etc. E.g.
             | the "wdt:P856" in my query is the "official website"
             | property on Wikidata:
             | https://m.wikidata.org/wiki/Property:P856
        
           | Karrot_Kream wrote:
           | Only problem is that Wikidata is still incomplete when
           | compared to Wikipedia itself. But yeah it's "trivial" to
           | search it.
        
       | Bromeo wrote:
       | The site has just been taken offline by Drew due to the
       | unfortunate start. I hope we can come back to this once the
       | project has been properly launched, although Drew notes that he
       | is "really unhappy with how the roll-out went" and that "my
       | motivation for this project has evaporated" [1].
       | 
       | Thanks for all the work Drew, I hope you guys manage to come to a
       | conclusion that you are satisfied with!
       | 
       | [1]:
       | https://paste.sr.ht/~sircmpwn/048293268d4ed4254659c3cd6abe67...
        
         | sergiotapia wrote:
         | Jesus he needs to take it easy. It's not that big of a deal.
        
         | Xeoncross wrote:
         | Drew has the right to cancel his projects, but I really hope
         | others don't cling to the hopes of "the perfect rollout" with
         | their projects.
         | 
         | Startups and side projects are messy and sometimes things don't
         | go as planned. Contracts get canceled, DoS takes down your
         | homepage when you launch losing all those free leads, people
         | leak new features and your sixth deployment erases most of the
         | production database.
         | 
         | There are a lot of great ideas that start out as bad as the
         | first release of thefacebook, AirBnB, Twitch and Youtube.
         | Still, they iterate on these wonky, almost-working sites and
         | end up making something great.
         | 
         | YC pushes this idea constantly; put something in front of
         | people and iterate. Drew was following that advice and I
         | applaud him. https://www.youtube.com/c/ycombinator/videos
        
           | marginalia_nu wrote:
           | The amount of tweaking needed to make a search engine work
           | well can't be overstated either. When you start out, it's
           | inevitably going to be kinda shit. That's fine. Now you need
           | to draw the rest of the owl.
        
             | Xeoncross wrote:
             | Yeah, I agree. I was certainly underwhelmed with my first
             | small search engine. It was so bad even I didn't want to
             | use it - and I had spent months and months on it.
             | 
             | Still, most projects aren't a search engine. I see people
             | put high expectations on how things will go and often it's
             | just really hard to realize some of those hopes.
             | 
             | Sometimes you just have to take what you can get and
             | iterate. Don't give up.
        
               | marginalia_nu wrote:
               | I think, with search engines, it's best to work with them
               | for the problem domain. It is a fractal of interesting
               | programming problems touching upon pretty much every area
               | of CS, programming and programming-adjacent topics, and
               | take whatever comes out of it as an unexpected bonus.
        
               | Xeoncross wrote:
               | Well said, my next version will be focused on a niche I
               | actually need instead of general search. Still, I haven't
               | finished studying the CS books + 47 algorithms I'll need
               | to actually implement it.
        
         | marginalia_nu wrote:
         | Oh, that's sad :-(
         | 
         | For what it's worth, my search engine got prematurely
         | "announced" for me on HN as well, while likewise hilariously
         | janky. I don't think the launch is the end of the world. (I
         | guess I had the benefit of serving comically bizarre results
         | when the queries failed, so it got some love for that)
         | 
         | The bigger struggle is, because a search box is so ambiguous,
         | people tend to have very high expectations for what it can do.
         | Many people just assume it's exactly like Google. It's
         | something a lot of indie search engine developers struggle
         | with. Even if your tool is (or potentially could be) very
         | useful, how can you make people understand how to use it when
         | it looks like something else? Design wise it's a real tough nut
         | to crack, a capital H Hard UX problem.
        
         | remram wrote:
         | Is the source available somewhere? I'm really interested in the
         | internals, regardless of how the "product"/platform/performance
         | turns out.
         | 
         | [edit: nevermind, found it easy enough. Seems to be Go and
         | PostgreSQL with the RUM extension]
        
       | birken wrote:
       | > SearchHut indexes from a curated set of domains. The quality of
       | results is higher as a result, but the index covers a small
       | subset of the web.
       | 
       | [citation needed]
       | 
       | The quality of the results right now are not very high, and in
       | theory I don't understand why one would believe a search engine
       | with a hand picked set of domains would be expected to outcompete
       | a search engine that can crawl the entire web and determines
       | reputation by itself. This also ignores the fact that a lot of
       | domains have a mix of high quality content and low quality
       | content, for example twitter or medium. If you are going to rely
       | on domain-level reputation then your search engine is going to be
       | way behind the search engines that can judge content more
       | specifically, which is all of the other search engines.
       | 
       | If you were to tell me curated domains is just a bootstrapping
       | method and as the search engine evolves it will change, fine, but
       | right now the search engine is so simplistic that the theory of
       | how it _might_ be good is really the only point. And if that
       | underlying theory is dubious, and the infrastructure is
       | simplistic and obviously won 't scale, then I don't know what is
       | interesting or novel about this _right now_. Doesn 't seem worthy
       | of reaching the top of HN.
        
         | p-e-w wrote:
         | > If you are going to rely on domain-level reputation then your
         | search engine is going to be way behind the search engines that
         | can judge content more specifically, which is all of the other
         | search engines.
         | 
         | Then why do Google and DuckDuckGo return 90% garbage for most
         | queries?
         | 
         | "All of the other search engines" have _completely failed_ to
         | keep pages from the results that are not only low-quality, but
         | outright spam.
        
           | mda wrote:
           | They definitely do not return "90% garbage for most queries".
           | This 8s an unsubstantiated claim I see often i HN and
           | honestly not backed by any real data. e.g. You can check your
           | search history and see it yourself.
        
             | p-e-w wrote:
             | I just tried searching for "python str" on Google. I
             | expected the top result to be a link to the official Python
             | docs for the `str` type, then ideally some relevant
             | StackOverflow questions highlighting common Python issues
             | with strings, bytes, Unicode etc.
             | 
             | Instead, the top result was W3Schools. Then came the Python
             | docs, then 5 pages somewhere between blogspam and poor-
             | quality tutorials. Then a ReadTheDocs page dating to 2015.
             | And that was it. No more official Python resources, no
             | StackOverflow. In the middle of the results some worthless
             | "Google Q&A" dropdowns that lead to more garbage quality
             | content.
             | 
             | So for this query, using my definition of "garbage", the
             | "garbage percentage" is somewhere between 80% and 90+%,
             | depending on how many Q&A dropdowns you waste your time
             | opening.
        
               | jwilk wrote:
               | For me, https://docs.python.org/3/library/stdtypes.html
               | is the top result.
        
               | p-e-w wrote:
               | The fact that the ranking of results for queries that
               | have nothing to do with location-based services depends
               | on where you are located (and, possibly, on whether or
               | not you are logged in) is one of the worst things about
               | Google. And the fact that you can't seem to disable that
               | behavior is even worse.
        
               | goldsteinq wrote:
               | I just tried searching for "python str" on searchhut and
               | the top result is Postgres docs, then Wikipedia article
               | for empty strings and then Drew's blog. Official Python
               | docs isn't in the index at all.
        
               | jwilk wrote:
               | For me the second hit is:
               | https://docs.python.org/3/howto/clinic.html
               | 
               | So at least some official Python docs are indexed.
        
               | mda wrote:
               | Afaik Python does not have a str type (I think you meant
               | string?).
               | 
               | You could instead search for "python string" to find more
               | information about python strings.
               | 
               | Even then the very first result for Python str is
               | actually relevant for me (Python documentation about
               | built in types.
        
               | jwilk wrote:
               | > Afaik Python does not have a str type
               | 
               | It does have it:                 $ python3 -c
               | 'print(type(""))'       <class 'str'>
        
           | birken wrote:
           | > Then why do Google and DuckDuckGo return 90% garbage for
           | most queries?
           | 
           | If you can give me a list of 10 normal-ish queries where 9
           | out of the first 10 results on Google or DDG are "garbage",
           | then I'll concede your point.
           | 
           | I think you are creating an impossible standard for search
           | engines, then using it to deem the current ones as failures.
           | While at the same time ignoring that this new search engine
           | is, as present, unusable with no realistic argument for why
           | it might eventually be better.
        
             | p-e-w wrote:
             | See my reply on the sibling comment for an illustrative
             | example.
        
         | slimsag wrote:
         | > Notice! This product is experimental and incomplete. User
         | beware!
         | 
         | Seems like your expectations are misplaced. Being at top of HN
         | is not an indicator of quality, just interest.
        
         | [deleted]
        
         | yjftsjthsd-h wrote:
         | > why one would believe a search engine with a hand picked set
         | of domains would be expected to outcompete a search engine that
         | can crawl the entire web and determines reputation by itself.
         | 
         | Because SEO manipulation is a well developed field, ensuring
         | that the search engines trying to determine reputation
         | automatically will (and does) end up with bad results.
        
           | p-e-w wrote:
           | Indeed. Whatever "smart" algorithm you use to rank results,
           | you can be certain that half the web will turn into
           | adversarial examples once your engine becomes popular enough.
        
       | reachableceo wrote:
       | Quite passive aggressive if you ask me. Boo-boo someone shared
       | your project before you were "ready".
       | 
       | If you don't want something disclosed , don't disclose it.
       | 
       | Only way for three people to keep a secret is if two of them are
       | dead.
       | 
       | A thing is in the world. Let it be in the world. Harness the
       | collective power and focus it into a force multiplier.
       | 
       | Or don't.
        
         | adhall wrote:
         | Where did you read aggression?
        
           | [deleted]
        
         | sevagh wrote:
         | Whole thing sounds made up. "Haha oops one of my fans from IRC
         | totally misunderstood and got me additional publicity UwU"
        
         | 93po wrote:
         | It's not passive aggressive. It's sensitive, but he has a right
         | to be if he wants to. He wasn't petty or mean spirited in his
         | announcement to take it down. He only expressed that he was
         | taking the feedback very hard, which is understandable if you
         | had big plans to roll out and make a good first impression.
        
           | reachableceo wrote:
           | Meh. Hacker news is the place to get actual real feedback.
           | Frying pan to fire etc.
           | 
           | Develop a thick skin or don't read the comments lol!
           | 
           | He chose to disclose it to a few people. Word spreads. That's
           | what happens.
           | 
           | Execute NDAs and have a security program if you don't want
           | stuff getting out.
        
       | burlesona wrote:
       | Too bad this came out before Drew intended, and I hope that after
       | having a weekend to rest he'll feel his motivation recover.
       | 
       | One meta-thought, I think projects like this are surfacing
       | something interesting: The underlying technology to make a pretty
       | good search engine is no longer especially difficult for
       | programmers or for server. This is potentially a very good thing,
       | as it means the end of the Google era.
       | 
       | I can imagine a future that is almost a blast from the past,
       | where there are a lot of different search engines, those engines
       | are curated differently, and while none of them index the entire
       | Internet, that's what makes them valuable and better than Google
       | (which I think cannot defeat spam).
       | 
       | I'm trying to think of a historical parallel, Where are some
       | service used to be very difficult to provide and therefore could
       | only effectively be done by a single natural monopoly, but
       | technology progressed and opened up the playing field, breaking
       | the monopoly. Television has some similarities. Perhaps radio vs
       | podcasting. What others?
        
         | voidfunc wrote:
         | What Google has is marketing and momentum... its the ubiquitous
         | search engine.
        
           | marginalia_nu wrote:
           | Google also funnels a lot of traffic to itself through
           | Chrome's search bar, and Firefox does the same. Sure you can
           | replace the search engine, but whatever you replace it with
           | needs to have the same capabilities or the entire model falls
           | apart. Meanwhile, alternative means of navigating the web
           | (such as bookmarks) are made increasingly difficult to
           | access, requiring multiple clicks.
           | 
           | I don't mean to be conspiratorial, I'm sure there are good
           | intentions behind this, the consequence however is
           | effectively locking in Google as the default gateway for the
           | Internet.
        
       ___________________________________________________________________
       (page generated 2022-07-15 23:02 UTC)