[HN Gopher] SearchHut
___________________________________________________________________
SearchHut
Author : tsujp
Score : 362 points
Date : 2022-07-15 04:27 UTC (18 hours ago)
(HTM) web link (searchhut.org)
(TXT) w3m dump (searchhut.org)
| la64710 wrote:
| Many 404 not found.
| xigoi wrote:
| It was not supposed to be released now, OP accidentally shared
| it here because of a misunderstanding.
| tsujp wrote:
| Hey everyone I accidentally shared this too early.
|
| I misinterpreted a don't-share-yet announcement to mean the
| announcement post only and not the entire software and
| announcement. I don't mean this as an excuse; that's just the
| context.
|
| So this is out before Drew et. al. intended it be hence some 404s
| and so forth as commented by Drew in this very thread here:
| https://news.ycombinator.com/item?id=32105407
|
| I let my excitement get the better of me this time and I hope
| people revisit SearchHut in a week or so after these quirks are
| resolved.
| skrebbel wrote:
| With a bit of collaboration i bet we can flag it off the front
| page in no time.
|
| You (or Drew or someone else) can resubmit it later with a
| bogus query string to skip HNs dupe checker.
| ddevault wrote:
| Cat's out of the bag now.
| skrebbel wrote:
| Ok, unflagged!
| tgv wrote:
| It's well communicated, so no harm done. And it gives an idea
| what kind of "curiosity hit" you can expect when announcing it
| for real.
| yellow_lead wrote:
| https://searchhut.org/search?q=toilet+cleaner
|
| > Secaucus, New Jersey
|
| Lol
| newbieuser wrote:
| What does it promise as an alternative to other search engines?
| mosfets wrote:
| good
| wrycoder wrote:
| Devault is doing some beautiful things over at his Sourcehut.
|
| I watched him crank out a prototype of this in Go in about three
| hours.
|
| It really feels like the old days at sr.ht. It's fun again.
| kungfufrog wrote:
| Watched as in streamed? Or refreshed the repo page?
|
| How can I watch?
| saucepines wrote:
| Pretty much just a worse version of Wikipedia's search at this
| point.
| synu wrote:
| The about page has some good general info and links:
| https://searchhut.org/about
| flobosg wrote:
| https://searchhut.org/about/domains returns 404, by the way.
| eesmith wrote:
| Lots of broken links from https://searchhut.org/about .
| ISL wrote:
| "Notice! This product is experimental and incomplete. User
| beware!"
|
| :).
| the_biot wrote:
| I think the idea of federation of domain-specific search engines,
| possibly tied together by one or more front-ends, is a brilliant
| idea.
|
| I think it's similar to how Google's search works internally,
| though I doubt the separation is based on a list of domain (as in
| DNS) names. IIRC they have a set of search modules, and what they
| return (and how fast they return it) all gets mixed in to the
| search results according to some weighting. Right below the ads.
|
| If you look at a search system that way, it's easy enough to add
| modules that do things like search only wikipedia, and display
| those results in a separate box (like DDG), or parse out currency
| conversion requests, and display those up top based on some API
| (like Google). etc
| remram wrote:
| How would you do ranking though?
|
| It is possible for a site's results to be of different quality:
| maybe one article about MySQL is not so informative, and an
| article about Python on the same site is a reference.
|
| The search engine operated by the author is unlikely to
| acknowledge that.
| chrismsimpson wrote:
| Another day on hacker news, another roll-your-own search engine,
| yet I still waste my time pitching to VCs that 'can't see it'.
| fosefx wrote:
| https://searchhut.org/search?q=js+service+worker
|
| > List of accidents and incidents involving commercial aircraft
| Beltalowda wrote:
| Tried a few things:
|
| - Beltalowda - no results (for reference: it's a term to refer to
| "people from the [asteroid] belt" used in the The Expanse books
| and TV series).
|
| - The Expanse - bunch of results, but none are what I'm looking
| for (the TV series or books). It looks like it may drop the "the"
| in there?
|
| - Star Trek - a bunch of results, but ordered very curiously; the
| first is the Wikipedia page for "Star Trek Star Fleet Technical
| Manual", and lots of pages like "Weapons in Star Trek" and such.
|
| - NGC 3623 - lists "Messier object" and "Messier 65", in that
| order, which is somewhat wrong as NGC 3623 refers to Messier 65
| specifically.
|
| - NGC3623 (same as previous, but without a space) - no results.
|
| - vim map key - pretty useless results, most of which have no
| bearing on Vim at all, much less mapping keys in Vim.
|
| - python print list - the same; The Go type parameters proposal
| is the first result; automake the second, etc.
|
| Conclusion: "this product is experimental and incomplete" is an
| understatement.
| memorable wrote:
| You could say it's in the Garbage stage (though Garbage is a
| bit harsh for a product that is built in a week).
| Beltalowda wrote:
| I didn't call the product garbage, just some of the results,
| which I think is fairly accurate. But I edited it to
| "useless" now, as that comes off as a bit less harsh.
| smegsicle wrote:
| Hot Garbage
| metadaemon wrote:
| Looks cool! Most of the results right now are all Wikipedia
| though.
| charcircuit wrote:
| >What's the most popular web server
|
| SearchHut: The first result is django which is not the most
| popular web server.
|
| Google: Shows an answer box with the market share of various web
| servers.
| lsbehe wrote:
| This one is actually hilarious because google cites the site
| wrong for me. >Apache HTTP Server >It is one of the most
| popular web servers around the world. As of May 2022, Apache
| holds 31.5% of the market according to W3Techs and 22.99%
| according to Netcraft. It's quoting that from
| https://www.stackscale.com/blog/top-web-servers/ which clearly
| states Nginx as the top one. >As of May 2022, Nginx holds 33.5%
| of the market according to W3Techs and 30.71% according to
| Netcraft.
| alpaca128 wrote:
| Considering Google's answer box randomly picked multiple photos
| of unrelated people as pictures of murderers and rape victims
| (with Google being very uncooperative about resolving the
| issue) I'd say the lack of an answer box might not be that bad.
| jfoster wrote:
| An answer box is the right thing for that query. (the web
| servers)
|
| The part that Google seem to have unfortunately skimmed over
| is that the answers need to be relevant, exact & correct.
| alpaca128 wrote:
| It can certainly be a helpful feature, but I wonder whether
| it's really better than good, relevant search results
| presented in a readable way. For example I'd argue the
| manually curated infoboxes on Wikipedia are likely more
| reliable than the algorithmic versions Google shows in
| their results, especially as it's difficult to fix mistakes
| in Google's version. Google thinks their own solution is
| the best one because Google made it and so they circumvent
| the whole page ranking process. Some queries of course need
| more than just plain search results (see Semantic Web and
| related things) but for those most engines don't offer
| enough control and transparency.
|
| But I'm glad people are trying to build alternatives. I'd
| love a search engine that ignores sites with antipatterns
| like required registration for any kind of usage, and this
| is the first step.
| charcircuit wrote:
| Even if you skip the answer box the first result is a page
| which breaks down the market share of the most popular web
| servers.
| heywire wrote:
| Has anyone experimented with creating a search engine that only
| indexes the landing page of domains? I'm less interested in
| another Google, and more interested in a way to find new and
| interesting sites/blogs/etc. Stumbleupon was great for this back
| in the day.
|
| Seems like it would be an interesting experiment to see what the
| results would be, indexing only the content / meta tags of
| "index.html".
| kordlessagain wrote:
| I built a solution at https://mitta.us/ that lets you submit
| the sites you want crawled, and puts them in a self-managed
| index (which isn't shared globally). I don't do link
| extraction, but instead let GPT-3 generate URLs based off
| keywords.
|
| !url <keyterms> |synthesize
|
| I also wrote a screenshot extension for Chrome that lets you
| save a page when you find it interesting. The site is
| definitely not "done" but it's usable if you want to try it.
| Some info in help and in commands is inaccurate/broken, so it
| is what it is for now.
|
| It does the !google <search term> and !ddg <search term> thing
| to find pages to save to the index. There are a bunch of other
| commands I added, and there's an ability for others to write
| commands and submit them to a Github repo:
| https://github.com/kordless/mitta-community
|
| !xkcd was fun to write. It shows comics. The rest of the
| commands can be viewed from !help or just !<tab>
|
| I've been working on pivoting the site to do prompt management
| for GPT-3 developers and have been kicking around Open Sourcing
| the other version for use as a personal search engine for
| bookmarked pages.
| bastardoperator wrote:
| I'm not a fan of google but you can do exactly what this search
| engine does by curating your own list of domains to search
| against.
|
| https://programmablesearchengine.google.com
| danskeren wrote:
| Some downsides with this approach:
|
| - search queries are performed directly from the clients
| computer so can't protect their privacy (since Custom Search
| JSON API have a daily limit of 10k queries)
|
| - forced to use javascript, and the way it's implemented makes
| it difficult if not impossible to do even basic things like the
| loading animation cards
|
| - ads are loaded from an iframe so you can't do any styling
| (except extremely limited options that they make available in
| their settings, but no matter what then it will be very ugly if
| you want to have a light/dark theme)
|
| But there are of course many benefits as well, such as it being
| 'free' (Bing is ridiculously expensive IMO, and feels
| impossible to join their ad network to offset the costs.. which
| might explain why you see countless Bing proxies shut down
| after a few months) and search results are no doubt better than
| the ones you'd get from Bing.
| zxwrt wrote:
| It requires google account, has tracking and isn't open-source.
| I'd say it is no go
| the_biot wrote:
| Add to that the likelihood that Google will just randomly
| cancel the product one day. Why invest time in this Google
| product?
| bastardoperator wrote:
| It's been around forever, but your concern is real. Who's
| to say an OSS project won't get archived, or removed from
| the internet? Why invest time into anything when it will
| all be replaced eventually?
|
| edit: Looks like this OSS project was launched and
| cancelled in a single day.
| the_biot wrote:
| > Looks like this OSS project was launched and cancelled
| in a single day.
|
| Touche :-)
|
| However the idea of a federated search is some measure of
| protection against that. If it ever happens.
| jfoster wrote:
| Interesting. This looks like Custom Search Engines evolved into
| this?
|
| I can't tell whether this is a neglected Google product that
| they were going to refresh but lost interest in, or something
| that is undergoing a breath of fresh air.
|
| As you say, I was able to add a list of domains and get some
| pretty decent results from it. The UI makes me feel like Google
| are not interested in making it a truly successful product,
| though.
| a5huynh wrote:
| For those looking for an alternative to that, I've been
| building a self-hosted search engine that crawls what you want
| based on a basic set of rules. It can be a list of domains, a
| very specific list of URLs, and/or even some basic regexes.
|
| https://github.com/a5huynh/spyglass
| imiric wrote:
| Great project! Given a local archive of Wikipedia and other
| sources, this can be very powerful.
|
| Which raises the question: does archive.org offer their
| Wayback Machine index for download anywhere? Technically, why
| should anyone go through the trouble of crawling the web if
| archive.org has been doing it for years, and likely has one
| of the best indexes around? I've seen some 3rd-party
| downloaders for specific sites, but I'd like the full thing.
| Yes, I realize it's probably petabytes of data, but maybe it
| could be trimmed down to just the most recent crawls.
|
| If there was a way of having that index locally, it would
| make a very powerful search engine with a tool like yours.
| Sin2x wrote:
| I hope Drew opens his own Google some day
| [deleted]
| whateveracct wrote:
| This is super cool - especially the self-hosting angle.
| butz wrote:
| In addition to curated domains list, some searches would benefit
| of limiting display of old results, as usually you might find an
| answer, but solved in jQuery or older version of framework you
| are using.
| nyanpasu64 wrote:
| Does Sourcehut offer textual search within a repo's files? GitHub
| and GitLab offer it, but Codeberg doesn't seem to (and I couldn't
| find any information about its presence or absence on Sourcehut).
| kevincox wrote:
| It doesn't appear to be a code search engine. Just a regular
| search engine focused on code.
|
| Does Sourcegraph index Sourceht projects? It is a proper code
| search engine and very good.
| jdorfman wrote:
| Looks like SourceHut is down so I'm not sure what projects
| need indexing.
|
| In other news we now index 87k Rust packages on crates.io
|
| https://sourcegraph.com/search?q=context:global+repo:%5Ecrat.
| ..
| mountainriver wrote:
| SearchHut is a cool name
| tsujp wrote:
| SearchHut was built to this point in about a week by Drew and
| contributors which I think is amazing.
|
| It is also meant to be very simple to run in the case you want to
| index your own category of sites. For instance cooking content is
| specifically not indexed but if _you_ wanted to you can spin up
| an instance and index cooking sites yourself.
| exitb wrote:
| This seems like a great idea, honesty. There are niche topics
| that are very hard to navigate in Google, because it's so
| skewed towards mainstream topics. I think it would make sense
| for these communities to maintain their own search engine.
| BlackLotus89 wrote:
| I wonder how the page ranking will work in the end. A quick look
| at the source doesn't show (me!) any planning for intelligent
| ranking. The database has a last_index_date and an authoritative
| field. Could be used fot basic relevance sorting, but nothing
| exhaustive.
|
| Postgres as backend is maybe not the best choice and there are
| already many sites that index specific pages and take
| suggestions. The hard part is getting relevant results when
| having a large index.
|
| Still thank you for a new web search.
| marginalia_nu wrote:
| As I understand, the idea is to only have manually curated high
| quality domains. In that regard, ranking is entirely secondary
| to BM-25. Might work, but it leaves out a lot of long tail
| sites that (in my experience at least) often have very good
| results. It's really the middle segment where most of the shit
| is.
| verall wrote:
| I guess cppreference.com isn't even a part of the list?
|
| I tried a couple test queries:
|
| > lambda decay to function pointer c++
|
| I get some FSF pages and the wikipedia for Helium?
|
| > std function
|
| I get... tons of Rust docs?
|
| > std function c++
|
| All rust docs? The wikipedia page for C++??
|
| Interesting idea, but this seems like it would be the primary
| failure mode for an idea like this: as soon as you are
| researching outside of the curator's specializations, it doesn't
| have what you're looking for. Yet these results would both be
| fixed simply by adding cppreference.com to the index. Let's try
| and give it a real challenge:
|
| > How to define systemverilog interface
|
| And as I might expect, I get wikipedia pages. For "Verilog", for
| "System on a Chip" and for "Mixin".
|
| 1st google result:
|
| > An Interface is a way to encapsulate signals into a block...
|
| Working as expected
| ddevault wrote:
| I added cppreference.com now and kicked off a crawl. It'll be a
| while. The list of domains is pretty small right now -- it was
| intended to be bigger before the announcement was made. Will
| also add RFCs and man pages soon.
|
| There will (soon) be a form to request new domains are added to
| the index, so if there are any sites you want indexed which are
| outside of my personal expertise, then you'll be able to
| request them.
| ZWoz wrote:
| You probably are already thought about it, but just in case
| feature idea: adding moderation support for collaboration.
| Somewhat trusted persons vetting niche subjects.
| reidrac wrote:
| I guess missing content can be explained by...
|
| > Notice! This product is experimental and incomplete. User
| beware!
|
| But in reality, if you know where is the answer that you are
| looking for, why would you use _that_ search engine.
|
| I use DDG and if I want to search on the Scala docs, I use
| "!scala whatever I'm searching" instead of just search with
| DDG.
| meibo wrote:
| GitHub doesn't seem to be either. I get that it's a competitor
| but not being able to search GitHub is probably a deal breaker
| for most devs that aren't Drew.
| ddevault wrote:
| I'm not opposed to indexing GitHub, but the signal to noise
| ratio on GitHub is poor. Nearly all GitHub repositories are
| useless, so we'd have to filter most of it out. I think
| instead I'll have to set it up where people can request that
| specific interesting repositories are added to the index, and
| maybe crawl /explore to fill in a decent base set.
| interactivecode wrote:
| Perhaps all repo's that have a published package is a good
| heuristic. Then you'll at least get all the repos of npm,
| python and other packages.
| BlackLotus89 wrote:
| Some interesting repos have no published packages. A
| combination of number of commits, stars and forks would
| be probably more relevant.
| yellowapple wrote:
| And likewise, some uninteresting repos do have published
| packages.
| marginalia_nu wrote:
| GitHub is hella tricky to crawl too due to its sheer size
| and single entry point (meaning slow crawl speed). I've
| been looking at the problem as well, and so far just
| ignored it as un-crawlable, but I might do something like
| crawl only the about pages for repos that are linked to
| externally some time in the future.
| mdaniel wrote:
| There's an asterisk to that: they serve the underlying
| content through two different APIs so one can side-step
| the HTML wrapper around the bytes: the discovery phase
| has a formal API (both REST and GraphQL) for finding
| repos, and then the in-repo content can be git-cloned and
| one can locally index every branch, commit, and blob,
| without issuing hundreds of thousands of http requests to
| GH. One would still need to hit GH for the issues, if
| that's in scope, but it'd be way less http requests
| unless your repo is named kubernetes or terraform.
| marginalia_nu wrote:
| We're still talking about git clone:ing a hundred
| thousand github repos. Git Repos get big very fast.
| That's a lot of data when you're realistically only
| interested in is a few markdown files per repo.
| [deleted]
| egberts1 wrote:
| Never
|
| Give
|
| Up
|
| This requires less paying attention to negative emotion and more
| "water off my back".
|
| Tweak it. tweak it some more.
|
| Focus on the goal, notably one tiny sub-goal at a time.
|
| Good luck, entrepreneurial spirit is a tough beast to attain.
|
| Whatever you stake on,
|
| NEVER
|
| EVER
|
| GIVE
|
| UP
| the_duke wrote:
| >> https://searchhut.org/about/domains
|
| => 404
| ddevault wrote:
| Here's the current list:
|
| https://paste.sr.ht/~sircmpwn/0cab5e3137c2c2077b5aabf9e2fc8d...
|
| It was intended to be larger prior to launch. Here's some other
| domains I want to index:
|
| https://paste.sr.ht/~sircmpwn/84d052f14a9a282698b5e5f7a9d9d9...
| oefrha wrote:
| sqlite.org is not on the latter list yet. Should be added.
| baisq wrote:
| I congratulate you for the novel approach, but this is
| impossible to scale in a way that would make the engine
| useful.
| mordae wrote:
| > erowid
|
| Cool.
| wraptile wrote:
| Same for https://searchhut.org/request
| ocdtrekkie wrote:
| DDoS Drew for a year and he starts writing code for your next
| competitor.
| AlphaWeaver wrote:
| I love that I can self-host this! Are there plans for federation?
|
| Rather than maintaining a whole separate index for myself, I'd
| love to self-host an instance of this, only indexing sites that
| aren't in the main index, and then falling back to the main index
| / merging it with my index to answer queries. I wonder how easy
| that would be with the current architecture.
| f0xJtpvHYTVQ88B wrote:
| All 4 search results for "searX" (a self hostable meta-search
| engine):
|
| Wikipedia: List of Search engines
|
| Drew's blog: We can do better than DuckDuckGo (perhaps the
| impetus for this project)
|
| Wikipedia: List of free and open source projects
|
| Wikipedia: Internet Privacy
| dijit wrote:
| The point of the project is that it's a curated list of sites
| to crawl, doesnt make sense to crawl other search engines.
| nonrandomstring wrote:
| Search results are presently poor. Mostly Wikipedia pages.
|
| But it passes tests that are very important for me:
|
| 1) It's fully accessible by Tor. No CAPTCHAs or "We don't serve
| your kind in here" messages.
|
| 2) It works in a text browser without JavaScript and renders in a
| sensible way without style requirements.
|
| 10/10 for accessibility. Something Google and other search
| engines could learn from.
| ddevault wrote:
| Good morning, HN. Please note that SearchHut is not done or in a
| presentable state right now, and those who were in the know were
| asked not to share it. Alas. I had planned to announce this next
| week, after we had more time to build up a bigger index, add more
| features, fix up the 404's and stub pages, do more testing, and
| so on, so if you notice any rough edges, this is why.
|
| I went ahead and polished up the announcement for an early
| release:
|
| https://sourcehut.org/blog/2022-07-15-searchhut/
|
| Let me know if you have any questions!
| Chris2048 wrote:
| Do you intend any of this to merge/cooperate with other similar
| initiative?
|
| e.g opencrawl, internet-archive, archiveteam
|
| It strikes me the resources to crawl, update, and manage/index
| data is a common problem.
| ddevault wrote:
| I intend to at least support other search engines by adding
| !bangs for them and recommending them in the UI if you didn't
| find the results you're looking for. I don't think that
| crawling is something that is easily distributed across
| independent orgs, though.
| marginalia_nu wrote:
| Just a warning from a fellow search engine developer.
|
| If you happen to be cloud hosting this, and if you do not have
| a global rate limit, implement one ASAP!
|
| Several independent search engines have been hit _hard_ by a
| botnet soon after they got attention, both mine and wiby.me,
| and I think a few others. I 've had 10-12 QPS, sustained load
| for weeks after weeks from a rotating set of mostly eastern
| european IPs.
|
| It's fine if this is on your own infrastructure, but on the
| cloud, you'll be racking up bills like crazy from something
| like that :-/
| ancientsofmumu wrote:
| I think it would be great if we had Code Forge index to search
| uniquely. In this index are only the myriad of code hosting
| sites around the internet - shared hosting like gitlab, github,
| sourcehut, sourceforge, codeberg, and all the project instances
| like the kernel.org, GNU Savannah, GNOME, KDE, BSD, etc.
| Probably hundreds of them out there, and allow people to submit
| their own self-hosted Gitea/Gitlab/sr.ht/etc. instances to be
| crawled - maybe even suggest a robots.txt entry your crawler
| could key in on as "yes please index me, hutbot".
| ksherlock wrote:
| Long ago -- 2006 to 2011 -- google had a functional source
| code search engine:
| https://en.wikipedia.org/wiki/Google_Code_Search
|
| I don't recall if it supported SourceForge and GitHub (2008)
| but it certainly included gzipped tarballs which were popular
| and prevalent at the time.
| wraptile wrote:
| Could you clarify domain submition rules?
|
| e.g. "Any websites engaging in SEO spam are rejected from the
| index" - how is determined whether something is SEO spam or
| not? More clarification of whats allowed/not allowed would be
| nice!
| ddevault wrote:
| The criteria are documented here:
|
| https://searchhut.org/docs/docs/webadmins/requirements/
|
| And there's some advice for web masters on ways to improve
| your site's ranking without running afoul of this rule:
|
| https://searchhut.org/docs/docs/webadmins/recommendations/
|
| But ultimately, it's subjective, and a judgement call will be
| made. If it's minor you might get a warning, if it's blatant
| then you'll just get de-listed.
| stanleychink wrote:
| jona4s wrote:
| May I know what are the limits of the API?
|
| How many requests per minute / hour are acceptable?
| ddevault wrote:
| The API limits are not documented yet, like many other things,
| due to the early launch. For now I'll just say "be good". Don't
| hit it with a battering ram.
| closedloop129 wrote:
| >SearchHut indexes from a curated set of domains.
|
| Are there already plans to expand it as a service? E.g.
| subreddits could maintain their preferred lists of domains.
| xcambar wrote:
| Also, the curated list is 404.
| fbn79 wrote:
| Bad serp... Searched 'mdn a'. Google return '<a>: The Anchor
| element - HTML: HyperText Markup Language | MDN' SearchHut rerurn
| a generic: 'MDN Web Docs'
| SahAssar wrote:
| It seems like it uses postgresql's FTS, which will generally
| drop stop-words so "the", "a", "and" and similar words are
| dropped. I've been meaning to figure out the best way to deal
| with this myself, and I'm guessing looking for exact matches
| first and then running a FTS query could work.
| sjamaan wrote:
| You can write a custom stemming algorithm and load it as an
| extension library into Postgres, then use that with `CREATE
| TEXT SEARCH DICTIONARY` to create a custom dictionary. It's
| not as difficult as it sounds - you can use the default
| Snowball stemmer as a sample, and tweak it.
| ignaloidas wrote:
| It's not just a custom dictionary. Stop words are usually
| excluded for a reason, you need to understand the nature of
| the query, and when to exclude what looks like stop words
| from being pruned. It's not really a job for a snowball
| stemmer, as you do need to operate over multiple tokens to
| gather context.
| marginalia_nu wrote:
| Most of the time, keywords like this come from external
| anchors as well, which is something that you're gonna be able
| to leverage with this design (as I understand it).
| jayzalowitz wrote:
| Curated set of domains page is down.
|
| Also, selfish plug I think it would be cool if you added
| Hackernoon to that list.
| yessirwhatever wrote:
| I think the point is to avoid including low quality websites...
| p-e-w wrote:
| I like the idea of searching a curated list of domains, but I'm
| not sure that doing the curation yourself is the best approach
| considering the huge number of useful but niche websites in
| existence.
|
| I wonder if simply parsing all of Wikipedia (dumps are available,
| and so are parsers capable of handling them) and building a list
| of all domains used in external links would do the trick.
| Wikipedia already has substantial quality control mechanisms, and
| the resulting list should be essentially free of blog spam and
| other low-quality content. Wikipedia also maintains "official
| website" links for many topics, which can be obtained by parsing
| the infobox from the article.
| marginalia_nu wrote:
| I've looked into this, and found wikipedia's links not to be
| super useful. Wikipedia prefers references that don't change,
| so books primarily, and beyond that websites that don't change,
| so academic journals where you're getting paywalled, and
| WaybackMachine archives of websites (even if they are still
| live). You aren't getting much use out of wikipedia.
| mormegil wrote:
| Oh, and more to the point: This is a role Wikipedia explicitly
| renounced, isn't it. When it became so big and the Google
| PageRank gave it high importance, the spam became unbearable so
| Wikipedia decided it needs to change the incentives and it
| applied the rel=nofollow to all external links, so that it
| could stop working as an unpaid manual spam filter for the
| whole internet. Sure, your new search might ignore the
| rel=nofollow but if you ever become big enough, the incentives
| of spammers would lead to a bigger spamming pressure on
| Wikipedia...
| p-e-w wrote:
| That's easily fixed by relying only on protected or high-
| profile pages. Those already deal (mostly successfully) with
| spam and NPOV violations on a daily basis, so piggybacking on
| those protection mechanisms should yield a fairly high-
| quality pool of curated external links.
| Cthulhu_ wrote:
| While that sounds good in theory, who is to say those that
| can edit protected and high profile pages aren't in SEO
| spammers' pockets?
|
| I mean the edit history is public and there's plenty of
| people that actually pay attention to edits and the like so
| they would be found out soon enough, but still.
|
| I'm sure this is an ongoing discussion when e.g. political
| figures' pages are protected as well - who becomes the
| gatekeeper, and what is their political angle?
| p-e-w wrote:
| Sure, but at that point you're simply discussing
| Wikipedia's quality control system, which may be an
| interesting discussion, but has nothing to do with search
| engines _per se._
|
| Considering that Wikipedia has become a pillar of most
| scientific work (imagine writing a math or computer
| science paper without Wikipedia - utterly unthinkable),
| it's safe to say that knowledgeable people have
| collectively decided that its quality control mechanisms
| are "good enough", or at least better than those of any
| other resource of comparable depth and breadth.
|
| And that puts Wikipedia's link pool leaps and bounds
| ahead of whatever dark magic current search engines are
| using, which mostly seems to be "funnel the entire web
| through some (easily gamed) heuristic".
| nneonneo wrote:
| Ehh, I think you've overestimated the quality of
| Wikipedia on highly specialist topics (like, say, the
| kinds of things you'd write academic papers on). It's not
| so much that it doesn't have the content, it's that the
| coverage is super uneven; sometimes it has extremely
| detailed articles on a niche theorem in a particular
| field, and other times it has the barest stub of
| information on an entire sub-sub-field of study.
| groffee wrote:
| The problem with only using curated lists is that you kill
| discoverability of new sites, but it does have promise like
| we've seen with Braves 'goggles'.
| richardsocher wrote:
| That's our strategy at you.com - we start with the most popular
| sites, crawl them and build apps for them, eg. you.com/apps and
| let users vote in search results.
|
| Full disclosure: I'm the CEO.
| rawgabbit wrote:
| I just learned about you.com from this thread. It looks very
| promising.
| ngshiheng wrote:
| > you.com/apps and let users vote
|
| what is stopping companies/users from abusing/gaming on this
| system with bots?
| agileAlligator wrote:
| Your search actually performed better than Google for me on a
| random query. I queried both engines "What is the specific
| heat of alcohol?", Google threw up a rich search answer that
| linked to some random Mexican site that is clearly exploiting
| SEO [1], you.com linked me to answers.com (which is more
| trustworthy than random mexican website).
|
| [1]:
| http://elempresario.mx/sites/default/files/scith/specific-
| he...
| charcircuit wrote:
| When I search Google I get a rich search answer of
| http://hyperphysics.phy-
| astr.gsu.edu/hbase/Tables/sphtt.html
| mormegil wrote:
| The "official website" links can be retrieved in a machine-
| readable way from Wikidata. E.g. a completely random silly
| example of official websites of embassies: https://w.wiki/5T3R
| 12907835202 wrote:
| Is there documentation for rolling your own?
|
| I've been considering building my own search engine for a
| while for my niche topic which has <50 websites and blogs on
| the web.
|
| I can't tell how useful this will be but it'd be fun to give
| it a go.
| abujazar wrote:
| Elastic App Search would be well suited for something like
| that. It comes with a built in crawler.
| mormegil wrote:
| Your own what? Your own query? Sure; it's just a SPARQL
| query over the Wikidata data model. The documentation
| portal for the query service is at https://www.wikidata.org
| /wiki/Wikidata:SPARQL_query_service/... though you'd need
| some familiarity with Wikidata, its properties, etc. E.g.
| the "wdt:P856" in my query is the "official website"
| property on Wikidata:
| https://m.wikidata.org/wiki/Property:P856
| Karrot_Kream wrote:
| Only problem is that Wikidata is still incomplete when
| compared to Wikipedia itself. But yeah it's "trivial" to
| search it.
| Bromeo wrote:
| The site has just been taken offline by Drew due to the
| unfortunate start. I hope we can come back to this once the
| project has been properly launched, although Drew notes that he
| is "really unhappy with how the roll-out went" and that "my
| motivation for this project has evaporated" [1].
|
| Thanks for all the work Drew, I hope you guys manage to come to a
| conclusion that you are satisfied with!
|
| [1]:
| https://paste.sr.ht/~sircmpwn/048293268d4ed4254659c3cd6abe67...
| sergiotapia wrote:
| Jesus he needs to take it easy. It's not that big of a deal.
| Xeoncross wrote:
| Drew has the right to cancel his projects, but I really hope
| others don't cling to the hopes of "the perfect rollout" with
| their projects.
|
| Startups and side projects are messy and sometimes things don't
| go as planned. Contracts get canceled, DoS takes down your
| homepage when you launch losing all those free leads, people
| leak new features and your sixth deployment erases most of the
| production database.
|
| There are a lot of great ideas that start out as bad as the
| first release of thefacebook, AirBnB, Twitch and Youtube.
| Still, they iterate on these wonky, almost-working sites and
| end up making something great.
|
| YC pushes this idea constantly; put something in front of
| people and iterate. Drew was following that advice and I
| applaud him. https://www.youtube.com/c/ycombinator/videos
| marginalia_nu wrote:
| The amount of tweaking needed to make a search engine work
| well can't be overstated either. When you start out, it's
| inevitably going to be kinda shit. That's fine. Now you need
| to draw the rest of the owl.
| Xeoncross wrote:
| Yeah, I agree. I was certainly underwhelmed with my first
| small search engine. It was so bad even I didn't want to
| use it - and I had spent months and months on it.
|
| Still, most projects aren't a search engine. I see people
| put high expectations on how things will go and often it's
| just really hard to realize some of those hopes.
|
| Sometimes you just have to take what you can get and
| iterate. Don't give up.
| marginalia_nu wrote:
| I think, with search engines, it's best to work with them
| for the problem domain. It is a fractal of interesting
| programming problems touching upon pretty much every area
| of CS, programming and programming-adjacent topics, and
| take whatever comes out of it as an unexpected bonus.
| Xeoncross wrote:
| Well said, my next version will be focused on a niche I
| actually need instead of general search. Still, I haven't
| finished studying the CS books + 47 algorithms I'll need
| to actually implement it.
| marginalia_nu wrote:
| Oh, that's sad :-(
|
| For what it's worth, my search engine got prematurely
| "announced" for me on HN as well, while likewise hilariously
| janky. I don't think the launch is the end of the world. (I
| guess I had the benefit of serving comically bizarre results
| when the queries failed, so it got some love for that)
|
| The bigger struggle is, because a search box is so ambiguous,
| people tend to have very high expectations for what it can do.
| Many people just assume it's exactly like Google. It's
| something a lot of indie search engine developers struggle
| with. Even if your tool is (or potentially could be) very
| useful, how can you make people understand how to use it when
| it looks like something else? Design wise it's a real tough nut
| to crack, a capital H Hard UX problem.
| remram wrote:
| Is the source available somewhere? I'm really interested in the
| internals, regardless of how the "product"/platform/performance
| turns out.
|
| [edit: nevermind, found it easy enough. Seems to be Go and
| PostgreSQL with the RUM extension]
| birken wrote:
| > SearchHut indexes from a curated set of domains. The quality of
| results is higher as a result, but the index covers a small
| subset of the web.
|
| [citation needed]
|
| The quality of the results right now are not very high, and in
| theory I don't understand why one would believe a search engine
| with a hand picked set of domains would be expected to outcompete
| a search engine that can crawl the entire web and determines
| reputation by itself. This also ignores the fact that a lot of
| domains have a mix of high quality content and low quality
| content, for example twitter or medium. If you are going to rely
| on domain-level reputation then your search engine is going to be
| way behind the search engines that can judge content more
| specifically, which is all of the other search engines.
|
| If you were to tell me curated domains is just a bootstrapping
| method and as the search engine evolves it will change, fine, but
| right now the search engine is so simplistic that the theory of
| how it _might_ be good is really the only point. And if that
| underlying theory is dubious, and the infrastructure is
| simplistic and obviously won 't scale, then I don't know what is
| interesting or novel about this _right now_. Doesn 't seem worthy
| of reaching the top of HN.
| p-e-w wrote:
| > If you are going to rely on domain-level reputation then your
| search engine is going to be way behind the search engines that
| can judge content more specifically, which is all of the other
| search engines.
|
| Then why do Google and DuckDuckGo return 90% garbage for most
| queries?
|
| "All of the other search engines" have _completely failed_ to
| keep pages from the results that are not only low-quality, but
| outright spam.
| mda wrote:
| They definitely do not return "90% garbage for most queries".
| This 8s an unsubstantiated claim I see often i HN and
| honestly not backed by any real data. e.g. You can check your
| search history and see it yourself.
| p-e-w wrote:
| I just tried searching for "python str" on Google. I
| expected the top result to be a link to the official Python
| docs for the `str` type, then ideally some relevant
| StackOverflow questions highlighting common Python issues
| with strings, bytes, Unicode etc.
|
| Instead, the top result was W3Schools. Then came the Python
| docs, then 5 pages somewhere between blogspam and poor-
| quality tutorials. Then a ReadTheDocs page dating to 2015.
| And that was it. No more official Python resources, no
| StackOverflow. In the middle of the results some worthless
| "Google Q&A" dropdowns that lead to more garbage quality
| content.
|
| So for this query, using my definition of "garbage", the
| "garbage percentage" is somewhere between 80% and 90+%,
| depending on how many Q&A dropdowns you waste your time
| opening.
| jwilk wrote:
| For me, https://docs.python.org/3/library/stdtypes.html
| is the top result.
| p-e-w wrote:
| The fact that the ranking of results for queries that
| have nothing to do with location-based services depends
| on where you are located (and, possibly, on whether or
| not you are logged in) is one of the worst things about
| Google. And the fact that you can't seem to disable that
| behavior is even worse.
| goldsteinq wrote:
| I just tried searching for "python str" on searchhut and
| the top result is Postgres docs, then Wikipedia article
| for empty strings and then Drew's blog. Official Python
| docs isn't in the index at all.
| jwilk wrote:
| For me the second hit is:
| https://docs.python.org/3/howto/clinic.html
|
| So at least some official Python docs are indexed.
| mda wrote:
| Afaik Python does not have a str type (I think you meant
| string?).
|
| You could instead search for "python string" to find more
| information about python strings.
|
| Even then the very first result for Python str is
| actually relevant for me (Python documentation about
| built in types.
| jwilk wrote:
| > Afaik Python does not have a str type
|
| It does have it: $ python3 -c
| 'print(type(""))' <class 'str'>
| birken wrote:
| > Then why do Google and DuckDuckGo return 90% garbage for
| most queries?
|
| If you can give me a list of 10 normal-ish queries where 9
| out of the first 10 results on Google or DDG are "garbage",
| then I'll concede your point.
|
| I think you are creating an impossible standard for search
| engines, then using it to deem the current ones as failures.
| While at the same time ignoring that this new search engine
| is, as present, unusable with no realistic argument for why
| it might eventually be better.
| p-e-w wrote:
| See my reply on the sibling comment for an illustrative
| example.
| slimsag wrote:
| > Notice! This product is experimental and incomplete. User
| beware!
|
| Seems like your expectations are misplaced. Being at top of HN
| is not an indicator of quality, just interest.
| [deleted]
| yjftsjthsd-h wrote:
| > why one would believe a search engine with a hand picked set
| of domains would be expected to outcompete a search engine that
| can crawl the entire web and determines reputation by itself.
|
| Because SEO manipulation is a well developed field, ensuring
| that the search engines trying to determine reputation
| automatically will (and does) end up with bad results.
| p-e-w wrote:
| Indeed. Whatever "smart" algorithm you use to rank results,
| you can be certain that half the web will turn into
| adversarial examples once your engine becomes popular enough.
| reachableceo wrote:
| Quite passive aggressive if you ask me. Boo-boo someone shared
| your project before you were "ready".
|
| If you don't want something disclosed , don't disclose it.
|
| Only way for three people to keep a secret is if two of them are
| dead.
|
| A thing is in the world. Let it be in the world. Harness the
| collective power and focus it into a force multiplier.
|
| Or don't.
| adhall wrote:
| Where did you read aggression?
| [deleted]
| sevagh wrote:
| Whole thing sounds made up. "Haha oops one of my fans from IRC
| totally misunderstood and got me additional publicity UwU"
| 93po wrote:
| It's not passive aggressive. It's sensitive, but he has a right
| to be if he wants to. He wasn't petty or mean spirited in his
| announcement to take it down. He only expressed that he was
| taking the feedback very hard, which is understandable if you
| had big plans to roll out and make a good first impression.
| reachableceo wrote:
| Meh. Hacker news is the place to get actual real feedback.
| Frying pan to fire etc.
|
| Develop a thick skin or don't read the comments lol!
|
| He chose to disclose it to a few people. Word spreads. That's
| what happens.
|
| Execute NDAs and have a security program if you don't want
| stuff getting out.
| burlesona wrote:
| Too bad this came out before Drew intended, and I hope that after
| having a weekend to rest he'll feel his motivation recover.
|
| One meta-thought, I think projects like this are surfacing
| something interesting: The underlying technology to make a pretty
| good search engine is no longer especially difficult for
| programmers or for server. This is potentially a very good thing,
| as it means the end of the Google era.
|
| I can imagine a future that is almost a blast from the past,
| where there are a lot of different search engines, those engines
| are curated differently, and while none of them index the entire
| Internet, that's what makes them valuable and better than Google
| (which I think cannot defeat spam).
|
| I'm trying to think of a historical parallel, Where are some
| service used to be very difficult to provide and therefore could
| only effectively be done by a single natural monopoly, but
| technology progressed and opened up the playing field, breaking
| the monopoly. Television has some similarities. Perhaps radio vs
| podcasting. What others?
| voidfunc wrote:
| What Google has is marketing and momentum... its the ubiquitous
| search engine.
| marginalia_nu wrote:
| Google also funnels a lot of traffic to itself through
| Chrome's search bar, and Firefox does the same. Sure you can
| replace the search engine, but whatever you replace it with
| needs to have the same capabilities or the entire model falls
| apart. Meanwhile, alternative means of navigating the web
| (such as bookmarks) are made increasingly difficult to
| access, requiring multiple clicks.
|
| I don't mean to be conspiratorial, I'm sure there are good
| intentions behind this, the consequence however is
| effectively locking in Google as the default gateway for the
| Internet.
___________________________________________________________________
(page generated 2022-07-15 23:02 UTC)