[HN Gopher] Marginalia: 3 Years
___________________________________________________________________
Marginalia: 3 Years
Author : latexr
Score : 227 points
Date : 2024-02-25 14:25 UTC (8 hours ago)
(HTM) web link (www.marginalia.nu)
(TXT) w3m dump (www.marginalia.nu)
| marginalia_nu wrote:
| Throwback that gives some indication of how both well and at the
| same time questionably it worked a mere 6 months in:
| https://news.ycombinator.com/item?id=28550764
|
| Though I think now there's a bit too much reddit and
| stackexchange and wikipedia stuff in the default filter.
| RandomWorker wrote:
| I've got this bookmarked and use it to find hyper niche materials
| on numerical modelling. The stuff it finds on solvers, mesh
| generation, and optimization methods is so much better than
| anything I could ever find on Google. Stuff from the 80s and 90s.
| I've found sites written by professionals that I would never find
| on Google. As someone that doesn't just take the commercial
| package of the shelf getting to that knowledge maybe finding
| Fortran code examples is extremely valuable.
| perihelions wrote:
| Do you have an example of a niche expert page you found useful,
| which is easier to find on Marginalia than Google?
| RandomWorker wrote:
| Search "numerical solver" in any of the major search engines
| what do you see?
|
| Now do that on Marginalia; you find https://www.scilab.org
| within the first 10 results, which is an open-source
| numerical solver software, which get's me to code, which gets
| me to examples to use.
|
| To be nuanced, I could change my search to "open-source
| Range-Kutta numerical solver examples" or something better,
| but why? Give me the weird deeply technical stuff first.
|
| Maybe more a HN example; when I wanted to learn about load
| balancers, just search "load balancers."
|
| Google; lots of SEO crap (with soooo many ads), youtube
| videos?, and AWS commercials at the top. No idea, where I'm
| going.
|
| Marginalia; a linux wiki, nginx (official documentation), a
| couple blogs by professionals on the topic. yeah, there are
| some things in here that ain't great.
|
| But If I compare apples to apples the first ten results are
| just so much better I'd say 8/10 in Marginalia for this
| example are great to good, while the first 10 (things I could
| click on) are by companies that don't teach me anything or
| have articles full of ads.
| dewey wrote:
| I don't think your examples are very good, these are very
| generic search terms and if the results are good or bad
| very much depends on the person searching and what they are
| looking for.
| marginalia_nu wrote:
| Underspecified queries ("generic search terms") is
| actually one of the tricky problems in search. The way
| Marginalia Search deals with them caters to a particular
| type of audience and set of usecases. I don't think
| that's wrong. Seems silly to try to cater to everyone. In
| that scenario you're more likely to not cater to anyone.
| subtra3t wrote:
| Ah, this helped clarify my doubts. Thanks.
| eviks wrote:
| Your "domain" filters seems like a good solution to this?
| marginalia_nu wrote:
| Yeah. Some of them are a bit rough still, but that's the
| general idea. Instead of trying to guess what sort of
| content the user wants, it seems like it makes sense to
| just give them the option to express that.
|
| The recipe filter is approaching something I'd want to
| explore further, to be able to provide contextual
| information outside of the search query.
|
| https://search.marginalia.nu/search?query=cookie+recipe
| ignoramous wrote:
| > _particular type of audience and set of usecases_
|
| For signed in accounts (which is pretty much is ~3bn
| Android and/or Chrome users), Google can predict what the
| user might prefer and yet...
| marginalia_nu wrote:
| I disagree with this feature, besides the predictable
| privacy argument, having a search engine transparently
| serve results according to your tastes makes it really
| difficult to find things that are new and outside of your
| existing preferences. It drains the web of serendipity,
| makes every website feel the same.
| 48864w6ui wrote:
| Exactly: marginalia results are good for geeks; I search
| for me, not for J Random Consumer.
| Kamq wrote:
| They're actually perfect examples for this thread.
|
| We've already constrained "what we're looking for" to be
| "niche expert pages" further up thread. If we're seeing
| niche expert pages even for generic search results,
| that's probably a good indication that the search engine
| behaves the way RandomWorker is describing
| kstrauser wrote:
| It's result 10 on Kagi.
| eviks wrote:
| The quoted phrase finds nothing, unquoted also no scilab,
| then realized I've made a typo and it's numerical, not
| numeric, then I get it
|
| Google was ~ top 60, which for such a generic term seems
| fine, not much scrolling down
| bluish29 wrote:
| kagi will give me scilab as the result #13 and this
| probably because I raise arxiv and stackoverflow results
| which will get high.
| unpopularopp wrote:
| Tried my last 3 Google searches
|
| india test cricket lowest total > None of the results are good or
| giving an answer
|
| raid calculator > The results are OK but you still have random
| noise like a Pokemon save/cheat editor page because it contains
| the word raid
|
| all quiet on the western front movie book differences > 0
| results. Like straight up no hits, an empty page
| marginalia_nu wrote:
| > india test cricket lowest total > None of the results are
| good or giving an answer, straight up wrong sites.
|
| The search engine has no ambitions to provide a knowledge graph
| at this point. It's for finding documents on the internet,
| rather than answering questions. Answering questions is a
| definitely something one might want, but it often comes at the
| expense of finding documents.
|
| > raid calculator > The results are OK but you still have
| random noise like a Pokemon save/cheat editor page?
|
| The pokemon result was discussing an application called
| "raidcalc". Seems like a good match, given the search engine
| does not profile you at all and has no clue about what your
| interests are.
|
| > all quiet on the western front movie book differences
|
| Hmm, I think there's an upper bound on the query length you
| hit. Could probably remove this, it's a pretty old, an artifact
| from when the query execution didn't deal with long queries
| well.
|
| --edit--
|
| Hmm, I increased the limit but they're still kinda not very
| good. Although this is definitely squarely within the realm of
| what I'm working on next, which is query understanding and
| execution.
|
| Right now the search engine doesn't really know how group the
| terms. Like a human being can see that you'd want
|
| |all quiet on the western front| in a sequence, preferrably in
| the title or appearing a few times, and 'movie', 'book', and
| 'differences' should be important to the document, but not
| necessarily appear in that exact order.
|
| The search engine currently looks for either documents where
| they all appear in proximity, or all individual words have high
| tf-idf relevance markers. Not great for this query.
| klabb3 wrote:
| Is it possible to just quote the title of the book, old-
| school style, so it becomes a single phrase?
|
| It is arguably a better UI than handing a barrage of words
| and hoping the engine does the sense-making.
| marginalia_nu wrote:
| Not yet, the support for long quoted sentences is a bit
| sketchy. Also within the wheelhouse of what's up next
| though. Having solid support for manual grouping is pretty
| much a prerequisite for automatic grouping anyway.
| mdaniel wrote:
| I hope this tone comes across correctly as just a suggestion:
| I get _a lot_ of mileage out of the "Send Feedback" option
| in DDG, which they claim actual humans do read. It can help
| move bug reports out of these HN threads into a more context-
| aware flow, and also makes me feel like any bad outcome has
| the possibility of improving, unlike systems that don't
| provide a "I feel bad about this experience" button
|
| If you were thus inclined,
| https://gitlab.com/glitchtip/glitchtip#glitchtip is the
| actual open source Sentry implementation which (as far as I
| know) would enable gluing
| https://docs.sentry.io/platforms/javascript/user-
| feedback/#u... to the search results page (that client-side
| library is still MIT: https://github.com/getsentry/sentry-
| javascript/blob/7.102.1/... )
| doubloon wrote:
| yup this is definitely alot like how it used to be.
|
| @unpopularop cant find "all quiet on the western front book movie
| differences". well you couldn't do that with AltaVista either in
| 1998.
|
| however if you just type "all quiet on the western front" you get
| a ton of niche obscure sites talking about it. literally
| someone's personal blog page.
|
| type in 'polytopes' you get a bunch of universities papers and
| code sites.
|
| "rust generics" - again, its a bunch of mailing list discussions,
| blogs, rust discussion groups, personal websites, obscure
| professional discussions.
|
| this IS how it was back in the day.
|
| my only question is how could this possibly be sustainable
| financially in the long run.
| marginalia_nu wrote:
| > my only question is how could this possibly be sustainable
| financially in the long run.
|
| For now I'm funded by grants and donations, got a few years
| runway that way.
|
| The actual operational cost is like $100/month for colocation +
| personal expenses so what money comes in lasts a surprisingly
| long time. In the future, we'll see. There does seem to be a
| lot of people that want this type of thing to exist though, so
| the hope is if I polish it even more, further funding will
| become available from likeminded people, possibly selling API
| access to other search engines.
|
| Search is notoriously hard to make money from (outside of ads),
| though not having a lot of expenses seems like a reasonable
| path to go.
| 48864w6ui wrote:
| It sounds like you only need one person (not as deep pocketed
| as Andrew Carnegie but who has read "gospel of wealth" and
| agrees with it) to have support for decades if not
| perpetuity.
|
| Universities traditionally have done this sort of thing by
| playing golf and naming buildings, but I'm sure in the 21st
| century there are other models. (Fwiw $2k/yr is below a
| typical golf membership)
| gary_0 wrote:
| I think as long as you're not setting out to start a tech
| company with thousands of employees, or branch out into a
| sector with the word "cloud" in it, you'll be fine. Only
| unreasonably big ambitions cost billions.
|
| A project is usually on the road to success when it starts
| with a disclaimer like "just a hobby, won't be big and
| professional like gnu".
|
| I think a larger concern is how you'll address the Bus Factor
| going forward.
| mdaniel wrote:
| > I think a larger concern is how you'll address the Bus
| Factor going forward
|
| I can't speak to how much energy it is to go from code to
| serving requests, but FWIW the code is AGPLv3 and seems to
| be updated regularly https://github.com/MarginaliaSearch/Ma
| rginaliaSearch/blob/v2...
| marginalia_nu wrote:
| I recently put some effort into making it possible to run
| and host the system fairly easily[1]. That said, serving
| basic search data and operating a search engine is two
| different things. To do more than index a couple of blogs
| you inevitably need a fairly deep understanding of the
| system, probably decent hardware, and so on.
|
| But the long term goal is that this is something that's
| relatively easy to operate and extend.
|
| [1] https://www.youtube.com/watch?v=PNwMkenQQ24 (quick
| install and demo)
| ttt3ts wrote:
| Cool engine. Going to check out source soon but "ROME2D16-2T"
| returned relevant results from esoteric sources. Useful.
| NeutralForest wrote:
| Congrats on the progress, I don't use marginalia as much as I
| should because I'm so used to rely on Google. It's a wonderful
| project though and I'll prob use it more since spammy SEO sites
| and AI generated answers seem to get more prevalent.
| marginalia_nu wrote:
| Probably some ways away from daily driver material.
| Optimistically sometime this summer when I'm done with the
| query and execution stuff it'll start approaching that
| territory.
| aqfamnzc wrote:
| I've been impressed by the results I see there. And you've chosen
| a sick name for it.
| behnamoh wrote:
| I just looked up "transformers intuition" and the results blew my
| mind. In comparison, Google's results led me to SEO'd websites
| (mostly Medium) and fancy-looking sites with inferior content.
| Awesome work Marginalia!
| 101008 wrote:
| did a search and first results were from "stack exchange sci fi",
| i was expecting something more nostalgic
| marginalia_nu wrote:
| Try the vintage or tilde filter.
|
| https://search.marginalia.nu/search?query=anime&profile=vint...
|
| https://search.marginalia.nu/search?query=anime&profile=tild...
| renegat0x0 wrote:
| Most important lines for me.
|
| It's proving a bit harder than anticipated, not because the
| software can't handle it, but because the signal to noise ratio
| of the web isn't very good; a huge reason why the search engine
| works relatively well is because of what it doesn't index.
| InvOfSmallC wrote:
| Do you offer API?
| marginalia_nu wrote:
| https://api.marginalia.nu/ :-)
|
| Demo key is always under siege though.
| InvOfSmallC wrote:
| Thanks, I'm building a website focused on Metroidvanias. I
| liked the results so I was thinking I may use it to offer
| some interesting results on the various game pages.
| dreamcompiler wrote:
| Viktor- I'm curious as to whether Common Crawl [0] would be
| useful to you. It's currently around 100TB and 3.35 billion
| pages, so it's going to be a long download unless you process it
| in place on S3. I have no idea what its signal/noise ratio is.
|
| [0] https://commoncrawl.org/overview
___________________________________________________________________
(page generated 2024-02-25 23:01 UTC)