[HN Gopher] Alexandria Search
       ___________________________________________________________________
        
       Alexandria Search
        
       Author : nafnlj
       Score  : 162 points
       Date   : 2022-03-18 16:54 UTC (6 hours ago)
        
 (HTM) web link (www.alexandria.org)
 (TXT) w3m dump (www.alexandria.org)
        
       | drcongo wrote:
       | > _Found 0 results in 0.05s_
       | 
       | I have a test search string that I use to try out search engines,
       | this one didn't do very well.
        
         | reaperducer wrote:
         | Unless you were testing for speed!
        
         | doctor_eval wrote:
         | Negative tests are still tests!
        
       | blinding-streak wrote:
       | Does this support phrase match? (ie "a query in quotes"). A few
       | tries seems to show that it doesn't. Or, if it does, its corpus
       | is tiny.
        
       | Xeoncross wrote:
       | Thank you for building and sharing this. While many people
       | rightly point out that this isn't a replacement for Google yet,
       | the value of a shared working open source code base has been
       | underestimated many times in the past.
       | 
       | I hope this is a project that grows to solve real needs in this
       | space. However, even if it never makes it past this point, there
       | is a chance someone will be inspired by this to construct their
       | own version. Maybe in a different language with a different
       | storage format or a different way of ranking results.
       | 
       | Thank you for sharing your work.
        
       | 0xbadcafebee wrote:
       | Why don't search engines have filters? Every single consumer
       | retail website's search uses filters to help shoppers find
       | something to buy. It is way more convenient than hoping the user
       | can guess the magic search phrase to find the thing they're
       | looking for (if they even know what that thing is).
        
       | kreeben wrote:
       | I love the shortcut Alexandria takes by indexing Common Crawl
       | instead of crawling the web themselves. It's how I would have
       | bootstrapped a new search engine. In a future iteration they can
       | start crawling themselves, if there is sufficient interest from
       | the public.
       | 
       | Searching is screamingly fast.
       | 
       | The index seems stale, though. Alexandria, how old is your index?
       | 
       | How long did it take you to create your current index? Is that
       | your bottleneck, perhaps, that it takes you a long time (and lots
       | of money?) to create a Common Crawl index?
        
         | joshuamorton wrote:
         | > The index seems stale, though. Alexandria, how old is your
         | index?
         | 
         | Common crawl indexes about once every 40 days, the current
         | crawl's data is through January 2022, so it's 1.5 months old at
         | best.
        
       | version_five wrote:
       | There have been a few search engines out recently. I'm curious
       | how people evaluate them quickly.
       | 
       | I've realized my searching is basically optimized for google and
       | the web that has grown up around it. Also, in 1998 I wasn't as
       | aware of what was out there as I am now. It's pretty rare (even
       | if its possible) that I do a search and come across a completely
       | new site that I haven't heard of before, for anything nontrivial.
       | That was different when search began.
       | 
       | Google is now almost a convenience. If I have a coding question,
       | I search for "turn list of tensors into tensor" or whatever but
       | I'm really looking for SO or the pytorch documentation, and I'll
       | ignore the geeksforgeeks and other seo spam that finds it's way
       | it. It's almost like google is a statistical "portal" page, like
       | Yahoo or one of those old cluttered sites were, that lets me
       | quickly get through the menus by just searching. That's different
       | from a blank slate search like we might have done 25 years ago.
       | 
       | I think what's really lacking now is uncorrupted search for
       | anything that can be monetized. Like I tried to search for a
       | frying pan once on google and it was unusable. I'm not sure any
       | better search engine can fix that, that's why everyone appends
       | "reddit" to queries they are looking for a real opinion on,
       | again, because they are optimizing for the current state of the
       | web.
       | 
       | Anyway, all that to say I think there are a lot of problems with
       | (google dominated) search, but they are basically reflected in
       | the current web overall, so just a better search engine, outside
       | of stripping out the ads, can only do so much. Real improved
       | search efforts need to somehow change the content that's out
       | there at the same time as they improve they experience, and let
       | us know how to, in a simple way, get the most out of it. I think
       | google has a much deeper moat than most people realize
        
         | chris123 wrote:
         | I'd love to be able to add a tag to a search to have it exclude
         | sites with any kind of monetization, I know that's not
         | realistic cuz that's where Google makes most of its money (or
         | do they make most of their money somewhere other than
         | advertising these days?). Anyways, yeah, I'm sick of SEO
         | optimized, click optimized, advertising optimized, affiliate
         | link optimized crap.
        
         | demopathos wrote:
         | I'm surprised you are not a fan of geeksforgeeks. While each of
         | their webpages have substantially less content than the pytorch
         | docs or SO result, I find that they get to the point instantly.
         | My mean time to solution from G4G is definitely smaller than
         | SO.
        
           | Seirdy wrote:
           | I generally find that sites like SO, GFG, etc. often play the
           | role of "Reading the Docs as a Service". I prefer using them
           | only after official documentation or specifications fail me.
           | When I want an opinionated answer, I just ping some people I
           | already know or check the personal websites of the developers
           | of the language/tool I'm using. If I have further questions,
           | I check the relevant IRC channel. Sites like SO are a "last
           | resort" for me.
           | 
           | In other words, I'd rather see these at the bottom of the
           | SERP than the top, but I wouldn't want to completely
           | eliminate them.
        
           | version_five wrote:
           | I guess everyone has their go-to sites and their pet peeves.
           | Geeksforgeeks may be less spammy than some, but I still think
           | of it as that annoying site that got in the way of either the
           | SO or documentation answer that I was looking for.
           | 
           | Just to expand, if I want the api reference, say I search for
           | defaultdict (for some reason I like using them but always
           | have to look at the reference), I want the python
           | documentation. I definitely don't want a third party telling
           | me about it.
           | 
           | And if I search a "make list of tensor into tensor" type
           | question, I want SO where someone had asked the same question
           | and got "tensor.stack" as the reply, so I can understand the
           | answer and follow up by looking at the tensor.stack pytorch
           | reference it I want.
           | 
           | Anything else is wasting my time, I think most users with
           | similarly specific queries are not looking for tutorials,
           | they are looking for the names of functions they hypothesize
           | exist, or references. That's why intermediary sites that try
           | to give an explanation are annoying, at least for me.
        
           | RosanaAnaDana wrote:
           | Geeksforgeeks is toxic garbage poisoning the well of good
           | solutions for common problems.
        
           | _bohm wrote:
           | I've found that their content is often inaccurate or written
           | by people who come across as novices. I actually emailed them
           | to correct an inaccuracy in one of their articles once, which
           | they did, so kudos to them for that.
        
         | outcoldman wrote:
         | > I've realized my searching is basically optimized for google
         | 
         | Is it just me, or I feel like Google does not provide anymore
         | good results for me.
         | 
         | Like every time I search something completely out of my
         | knowledge, like "How to purchase a property in Mexico", it will
         | give me 100+ results of some results with autogenerated content
         | like "10 best places to buy property in Mexico". And the only
         | way to fix that would be to add something like
         | `site:reddit.com`
        
           | pygar wrote:
           | > Is it just me, or I feel like Google does not provide
           | anymore good results for me.
           | 
           | I am starting to suspect that there might be nothing to find.
           | 
           | I just don't think people (other then the tech oriented) are
           | creating websites and running forums - and why would they?
           | Reddit might be be only place you _can_ find that type of
           | content. What should search engines do then?
           | 
           | With a tiny number of exceptions, it might be that people
           | chat on reddit, read Wikipedia, ask questions on the
           | stackexchange network/Quora, local communities use facebook
           | groups, and businesses have a wordpress site with nothing
           | more then a bit of fluff, a phone number and an email
           | address.
        
           | daptaq wrote:
           | Might be an instance of Goodhart's law:
           | https://en.m.wikipedia.org/wiki/Goodhart's_law
           | 
           | If all websites try to optimise for SEO, they undermine the
           | assumption that the evaluation of a search engine is the pure
           | consequence of how well a site satisfies a query.
        
         | vishnugupta wrote:
         | > I'm curious how people evaluate them quickly.
         | 
         | Speaking about myself; I cold turkey migrated to DDG ~2 months
         | ago. So far I've had to resort to Google search 10 times or so.
         | 
         | One thing I miss though is Google's nice visualisation of fast
         | changing results e.g., match scores. For example:
         | https://imgur.com/a/Q5nZkjo
        
           | Seirdy wrote:
           | DDG's organic link results are from Bing, sans
           | personalization. DuckDuckGo advertises using "over 400
           | sources", which means that at least 399 sources only power
           | infoboxes ("instant answers") and non-generalist search, such
           | as the Video search.
        
         | ffhhj wrote:
         | > There have been a few search engines out recently
         | 
         | I'd like to try them out, could you mention which?
        
           | version_five wrote:
           | you.com and kagi.com off the top of my head
        
             | Seirdy wrote:
             | I've been keeping my eye on You.com, tracking a few SERPs
             | over time compared to other Bing- and Google-based engines.
             | So far, the results don't seem independent.
             | 
             | Try comparing results with a Bing-based engine (e.g.
             | DuckDuckGo) or a Google-based one (e.g. StartPage, GMX) to
             | see if they differ. (Don't use Google or Bing directly,
             | since results will be personalized based on factors like
             | location, device, your fingerprint, etc.).
        
           | Seirdy wrote:
           | I listed a bunch over at
           | https://seirdy.one/2021/03/10/search-engines-with-own-
           | indexe..., and I'm always adding more.
           | 
           | I first discovered Alexandria in early February: https://git.
           | sr.ht/~seirdy/seirdy.one/commit/935b55f10f9024ee...
           | 
           | Around the same time, I also discovered sengine.info, Artado,
           | Entfer, and Siik. By sheer coincidence they all were
           | mentioned to me or decided to crawl my site within the same
           | couple weeks. So yes, from my perspective there have been
           | more than a few new smaller engines getting active on the
           | heels of bigger names like Neeva, Kagi, Brave Search, etc.
        
           | lolinder wrote:
           | I've been using the Kagi beta for a few months now, and it's
           | awesome: https://kagi.com/
           | 
           | The biggest thing I've found is that when doing technical
           | searches it always turns up the sources I'm actually looking
           | for, actively filtering all the GitHub/StackOverflow copycat
           | sites.
           | 
           | It also seems to up-weight official docs compared to Google.
           | For example, "how to read a json file in python" turns up the
           | Python docs as the second result, where in Google they're
           | nowhere to be found.
        
         | amelius wrote:
         | > I'm curious how people evaluate them quickly.
         | 
         | Are there search benchmarks to be found somewhere?
         | 
         | There must be. If you want to write a search engine, you need a
         | way to validate the results.
        
           | marginalia_nu wrote:
           | There are benchmarks within the adjacent field of information
           | retrieval, but in general it's hard to properly validate a
           | search engine because real data is so noisy and misbehaved,
           | and sample data is so different from real data.
        
             | kreeben wrote:
             | Sure, the problem of information retrieval is not exactly
             | that of web search but they're pretty close. So, from such
             | a knowledgeable person such as yourself, when it comes to
             | this topic, could you remind us, what are some of those
             | benchmarks?
        
               | marginalia_nu wrote:
               | https://en.m.wikipedia.org/wiki/Precision_and_recall
               | 
               | For some standard corpus.
        
               | kreeben wrote:
               | >> Precision and recall
               | 
               | The phrase I was looking for. Thx a bunch! Gonna
               | marginalia that now.
        
               | marginalia_nu wrote:
               | Haha, ironically it lacks both precision and recall for
               | that topic.
        
           | jll29 wrote:
           | The "Web track" task at the annual US NIST TREC conference (
           | https://trec.nist.gov/ ) is an open innovation benchmark that
           | everyone can contribute; participants get a set of queries
           | that they have to run on exactly the same corpus. Then they
           | return the top-k results to a team that evaluates them.
           | 
           | Here is an example (2014) Web track paper of the 23th TREC:
           | https://trec.nist.gov/pubs/trec23/papers/overview-web.pdf
           | (TREC has a plentitude of difference benchmark tasks and you
           | can submit your own:
           | https://trec.nist.gov/pubs/trec29/trec2020.html - recent TREC
           | 2020 papers)
        
         | zeta0134 wrote:
         | I've been wondering for a while now about building a search
         | engine for the ad free web. That is, penalize or outright
         | refuse to index any recognized advertising network, letting
         | through only those sites which don't perform invasive tracking
         | with third party services. Mostly as a curiosity: what would be
         | left? What would rise to the top when you filter all of that
         | out?
        
           | randomsilence wrote:
           | Check the 'Small or non-commercial Web' search engines on
           | this overview page: https://seirdy.one/2021/03/10/search-
           | engines-with-own-indexe...
        
           | not2b wrote:
           | Wikipedia.
        
             | not2b wrote:
             | Evidently someone disagrees, but no ads or trackers except
             | on the home page and its pages rank highly on current
             | search engines, so if you exclude trackers and ads that's
             | what you're going to get.
        
           | version_five wrote:
           | I've thought about something similar, basically "the good
           | internet" that would be a hand curated list of sites that are
           | not there just as a pretext for ads. I think a lot of
           | software project documentation qualifies, lots of stuff on
           | university sites like lectures notes for example. I assume
           | that across different niches there is other stuff like this.
           | I think the key would be something that can't be gamed, like
           | it has to be legitimate content that is online for an
           | existing purpose and not as a pretext.
        
         | danuker wrote:
         | > because they are optimizing for the current state of the web.
         | 
         | I believe people will at least start looking for alternatives.
         | For example, I have been collecting search engines, and
         | whenever I encounter a page with too many commercial-laden SEO-
         | porked results, I use a different search engine in Firefox.
         | 
         | I have enabled the Search Bar, I can do Alt+D, Tab, Tab, enter
         | my query, then click a different search engine, which searches
         | instantly, unlike the main bar, where you have to press Enter
         | once more after clicking.
         | 
         | I just added this one also. See my collection:
         | https://0bin.net/paste/ZSCRYVx1#sxD+jBIpScJismXBYwaoJPh75TH9...
        
           | blewboarwastake wrote:
           | Pro tip: Alt+E takes you directly to the search bar, then you
           | can press Tab for selecting the search engine. The best part
           | is that you never use the mouse this way. You can also use
           | ddg bangs, they contain every search-engine/site by pressing
           | Alt-D if you remember the bang for the site.
        
           | coverband wrote:
           | Could you paste again as text links? Thanks.
        
         | 8n4vidtmkvmk wrote:
         | maybe the solution is to make google itself reddit style. let
         | users downvote the seo spam websites and allow them to be
         | downranked. sure it opens the door for a different kind of
         | abuse... but maybe that problem is more fixable?
        
           | seltzered_ wrote:
           | I feel like this was tried a decade ago: https://developers.g
           | oogle.com/search/blog/2011/03/introducin... ("Introducing the
           | +1 button", Google, 2011)
        
         | karmab wrote:
         | I sEeU
        
         | Seirdy wrote:
         | > I'm curious how people evaluate them quickly.
         | 
         | To paint with a broad brush, I look at three criteria:
         | 
         | 1. Infoboxes ("instant answers") should focus on site previews
         | rather than trying to intelligently answer my question. Most
         | DuckDuckGo infoboxes are good examples of this; Bing and Google
         | ones are too "clever".
         | 
         | 2. Organic results should be unique; most engines are powered
         | by a commercial Bing API or use Google Custom Search. I
         | described my methodology near the bottom of my collection of
         | independent search engines:
         | https://seirdy.one/2021/03/10/search-engines-with-own-indexe...
         | 
         | 3. "other" stuff. Common features I find appealing include
         | area-specific search (Kagi has a "non-commercial lens" mostly
         | powered by its Teclis index; Brave is rolling out "goggles"),
         | displaying additional info about each result (Marginalia and
         | Kagi highlight results with heavy JS or tracking), user-driven
         | SERP personalization (Neeva and Kagi allow promoting/demoting
         | domains), etc.
         | 
         | And always check privacy policies, TOS, GDPR/CCPA compliance,
         | etc.
         | 
         | > Google is now almost a convenience. If I have a coding
         | question, I search for "turn list of tensors into tensor" or
         | whatever but I'm really looking for SO or the pytorch
         | documentation, and I'll ignore the geeksforgeeks and other seo
         | spam that finds it's way it. It's almost like google is a
         | statistical "portal" page,
         | 
         | I like engines like Neeva and Kagi that allow customizing SERPs
         | by demoting irrelevant results; I demote crap like GFG,
         | w3schools, tutorialspoint, dev(.)to, etc. and promote official
         | documentation. Alternatively, you can use an adblocker to block
         | results matching a pattern: https://reddit.com/hgqi5o
        
       | Minor49er wrote:
       | This is really fast and cool. Looking for music-related pages,
       | I've already found some interesting websites, like Wall of
       | Ambient, which caters to ambient labels
       | (https://wallofambient.com/#)
       | 
       | I noticed that if a term can't be found, there will be a random
       | number of results that it says were found, but nothing is
       | actually displayed. Eg:
       | https://www.alexandria.org/?c=&r=&q=moonmusiq
       | 
       | I'll keep trying this out. It seems really promising
        
       | bghfm wrote:
       | How can we help improve the project? Usage, feedback?
        
       | rprenger wrote:
       | For my first search of "GFlowNetworks" (which the search bar
       | suggested) It said: Found 5,887 (or something) results, but
       | showed no results
       | 
       | For my second I searched my name and got a Wikipedia article
       | about a show I've never heard of which didn't have my name
       | anywhere in it.
       | 
       | For my third I searched "GFlowNetworks" again and it said Found
       | 2,656,844 results in 1.61s, but showed no results again
        
         | marginalia_nu wrote:
         | I can't even find results for "GFlowNetworks" on google.
        
       | waynecochran wrote:
       | In a nutshell, what is the fundamental difference with this
       | search engine compared to others?
        
         | stazz1 wrote:
         | >About Alexandria
         | 
         | Alexandria.org is a non-profit, ad free search engine. Our goal
         | is to provide the best available information without
         | compromise.
         | 
         | The index is built on data from Common Crawl and the engine is
         | written in C++. The source code is available here.
         | 
         | We are still at an early stage of development and running the
         | search engine on a shoestring budget.
         | 
         | Please contact us at -email- if you want to get involved, want
         | to support this initiative or have any questions.
        
           | waynecochran wrote:
           | But what is different in terms of its indexing algorithm? The
           | original secret sauce for google was the pagerank algorithm
           | which was mathematically genius. Are you using a similar
           | algorithm.
        
             | josefcullhed wrote:
             | Founder here. We are using harmonic centrality instead of
             | pagerank. But of course much more work needs to be done to
             | make the search engine usable.
        
         | Minor49er wrote:
         | Their About page has you covered:
         | 
         | > Alexandria.org is a non-profit, ad free search engine. Our
         | goal is to provide the best available information without
         | compromise.
         | 
         | > The index is built on data from Common Crawl and the engine
         | is written in C++. The source code is available (at
         | https://github.com/alexandria-org#).
         | 
         | Edit: formatting
        
           | greenyoda wrote:
           | More about Common Crawl:
           | https://en.wikipedia.org/wiki/Common_Crawl
           | 
           | > _Common Crawl is a nonprofit 501(c)(3) organization that
           | crawls the web and freely provides its archives and datasets
           | to the public. Common Crawl 's web archive consists of
           | petabytes of data collected since 2011. It completes crawls
           | generally every month. ..._
        
       | tandr wrote:
       | I think the fact that after a long while there are new search
       | engines (Kagi was introduced very recently on HN, now this)
       | should be a wake up call for Google - their search has lost some
       | shine for quite a while. Hopefully something will come out of
       | this - competition is good.
        
         | blinding-streak wrote:
         | Competition is definitely good. But this thing is a toy
         | compared to not just Google, but all the other major search
         | players out there. Hopefully it will continue to advance.
        
         | bmmayer1 wrote:
         | My first search on Alexandria was "UTC time". Google gives me
         | the current time in UTC, which is all I needed. Alexandria gave
         | me...a lot of links to click to find what I'm looking for.
         | 
         | Google search is a lot better than people give it credit for.
        
           | aghilmort wrote:
           | we're exploring adding instant answers in clean way at
           | Breeze; leaning towards using open-source library &/or
           | external API to compute vs. building in-house
           | 
           | also adding premium tier that's alerts + ad free + feeling
           | lucky that would take user to top result, which is a UTC
           | page, re: https://breezethat.com/?q=UTC+time
        
             | NoahTheDuke wrote:
             | I just tried breezethat and had to scroll past 6 ads (two
             | screenfuls on my iPhone 10) to see a single result. I know
             | ads are necessary but this is punitive.
        
               | aghilmort wrote:
               | yes, mobile is awful right now; we've mostly fixed that
               | issue on laptop / desktop
               | 
               | 4 of 6 are google's and have to include -- iterating some
               | designs internally that refactor how they're presented on
               | mobile
        
           | s0rce wrote:
           | I agree, Google's instant answers are quite good and have
           | improved but actually searching to find a site seems to be
           | getting worse and is riddled with paid sites on the top.
        
           | rambambram wrote:
           | This doesn't add up, at all. I have a clock on my computer.
           | This new search engine doesn't function like a clock for you,
           | so Google Search is better.
        
           | boomboomsubban wrote:
           | When I need a clock I'll be sure to use Google.
        
           | tokai wrote:
           | You still have the current UTC time in the top links. Googles
           | knowledge graph is a part of the problem with their results.
        
           | marginalia_nu wrote:
           | Strictly speaking, that's more in the domain of BonziBuddy or
           | Alexa than internet search engine.
           | 
           | What Google arguably struggles with is surfacing relevant
           | documents, that is... search.
        
           | the-dude wrote:
           | Kagi does this. I have switched 100% to Kagi, not affiliated.
        
         | amelius wrote:
         | At some point, AI and NLP and raw processing power will have
         | progressed so much that "search" is not a problem anymore, and
         | I think we're getting there. Google can up their game but it
         | won't matter much. The only thing they have left is brand
         | recognition.
        
           | tbihl wrote:
           | Google has necessarily arbitrary criteria by which pages are
           | ranked. Because Google is _the_ game in town, anyone with a
           | primary goal of driving traffic will pursue those metrics
           | (i.e. SEO). To the extent that those criteria deviate even
           | slightly from actual good results, large parts of the
           | internet will dilute their content to pursue them, which both
           | lowers their quality and further drives down the gems of the
           | internet.
           | 
           | The ranking would have to vary over an infinite spread of
           | purposes for webpages, and it would have to converge almost
           | perfectly to what is actually most helpful. Among all the
           | technical problems, Google will not optimize correctly
           | against ads for the same reason that websites trying to drum
           | up affiliate purchases and ad revenue won't put content
           | quality above SEO.
           | 
           | When recipes return to having the recipe and ingredients
           | first, followed by an optional life story,I'll revisit my
           | assessment.
        
           | jll29 wrote:
           | Google Research is also one of the top (NLR|IR) R&D gigs in
           | town - they discovered BERT, a model that has re-defined how
           | NLP is down and the respective paper describing it already
           | collected 800 citations by the time it was published based on
           | a pre-print spreading like a wildfire.
           | 
           | This technology is now part of Google search.
        
           | orlp wrote:
           | IMO search has had its goalpost moved. It used to be about
           | scale, technical challenges, bandwidth, storage, etc. It is
           | still about that, but a significantly harder challenge to
           | solve has come up: searching in a malicious environment. SEO
           | crap nowadays completely dominates search, Google has lost
           | the war.
           | 
           | Simply put, I believe that Google sucks at search, in the
           | modern context. It is great at indexing, it has solved
           | phenomenal technical challenges, but search it has not
           | solved. Why do I have to write site:stackoverflow.com or
           | site:reddit.com to skip the crap and go to actual content?
           | Why can my brain detect blogspam garbage in 0.5 seconds of
           | looking but billion dollar company Google will happily
           | recommend it as the most relevant result above a legitimate
           | website?
           | 
           | I feel this 12 year old XKCD is still relevant:
           | https://xkcd.com/810/ .
        
             | new_guy wrote:
             | > but billion dollar company Google will happily recommend
             | it as the most relevant result above a legitimate website?
             | 
             | Because the site most likely is laden with Google Ads, it's
             | in their interest to show you that garbage and not what
             | you're actually looking for.
        
       | lukasb wrote:
       | Excuse me while I get on a hobbyhorse - would love to use web
       | search that lets me boost PageRank for certain sites (which then
       | would carry over to sites they link to.) Could automatically
       | boost PageRank for sites I subscribe to, for example. Expensive
       | in terms of computation or storage? Charge me!
        
         | marginalia_nu wrote:
         | That's easy to do (it's Personalized PageRank), but VERY
         | expensive. Like just tossing them a few dollars doesn't cut it.
         | You basically need your own custom index for that, as the way
         | you achieve fast ranking is by sorting the documents in order
         | of ranking within the index itself. That way you only need to
         | consider a very small portion of the index to retrieve the
         | highest ranking results.
         | 
         | You might get away with having like a custom micro-index where
         | your search basically does a hidden site:-search for your
         | favorite domain and related domains, but that's not quite going
         | to do what you want it to do.
        
           | lukasb wrote:
           | So uh ... 100 petabytes, then.
           | 
           | Ah.
        
             | marginalia_nu wrote:
             | Realistically you could probably get away with something
             | like a couple of terabytes, and then default to the regular
             | index if it isn't found in the neighborhood close to your
             | favored sites, but that's still anything but cheap
             | especially this can't be some slow-ass S3 storage, this
             | storage should ideally be SSDs or a RAID/JBOD-configuration
             | of mechanical drives. That means you're also paying for a
             | lot of I/O bandwidth and overall data logistics.
             | 
             | If you try to rent that sort of compute, you're probably
             | looking at like $100-200/month.
        
       | wilg wrote:
       | which makes it very hard to follow-up on queries.The cursor moves
       | back to the beginning of the search box after you search,
        
       | julienreszka wrote:
       | For a same query the number of results varies dramatically. So
       | weird.
        
       | devmunchies wrote:
       | The initial commit was 11 months ago and written in C++. I
       | haven't done C++ since college ~7 years ago. Is it a good
       | language for greenfield projects these days, or would something
       | like Go or Rust (or Crystal, Nim, Zig) be better for
       | maintainability and acquiring contributors?
        
         | marginalia_nu wrote:
         | Main thing I'd worry about with C++ is heap fragmentation over
         | time. Something like TCMalloc or JEMalloc might help a bit but
         | it's hard to get around doing this type of thing in C++.
        
         | extheat wrote:
         | I don't think the problem is there's a shortage of developers.
         | Much less the language. There's a shortage of people with the
         | experience in working with search engines and the necessary
         | algorithms to make them work as intended reliably.
        
       | 51Cards wrote:
       | So, interesting thing, how when I visit this site for the first
       | time (in Firefox) is the search box showing a drop down with a
       | bunch of my previous searches? I can't tell where they are from
       | but it is all stuff I have searched for in the past. I thought it
       | might be the browser populating a list but that should be based
       | on same domain. So where is it pulling this from? Some of the
       | searche terms are months, perhaps more than a year old.
        
       | pmontra wrote:
       | I'll use it for general searches in the next days because it's
       | the only way to do a fair evaluation.
       | 
       | I just searched for python3 join string and I didn't get the
       | Python docs in the first page. Both DDG and Google got them at
       | position 9 which is way too low. At least I got a different set
       | of random websites and not the usual tutorialpoint, w3schools,
       | geeksforgeeks etc that I usually see in these cases.
        
       | zeruch wrote:
       | I was amused by the name (beyond the obvious reference to the
       | ancient library, it also gender-switches on a heteronym of
       | Fernado Pessoa, Alexander Search)
       | 
       | https://www.brown.edu/Departments/Portuguese_Brazilian_Studi...
        
       | zander312 wrote:
       | getting a 502...
        
       | pmarreck wrote:
       | This actually makes me want to build my own web crawler and
       | search
        
         | marginalia_nu wrote:
         | Do try, it's a very interesting problem domain.
        
         | josefcullhed wrote:
         | Founder here,
         | 
         | I suggest you start by not implementing a crawler but use
         | commoncrawl.org instead. The problem with starting a web
         | crawler is you will need a lot of money and almost all big
         | websites are behind cloudflare so you will be blocked pretty
         | quickly. Crawling is a big issue and most of the issues are
         | non-technical.
        
           | Seirdy wrote:
           | I've heard from other people who run engines (Right Dao,
           | Gigablast) that this is a major problem; Common Crawl does
           | look helpful, but it's not continuously updated. FWIW, Right
           | Dao uses Wikipedia as a starting point for crawling. Kiwix
           | makes pre-indexed dumps of Wikipedia, StackExchange, and
           | other sites available.
           | 
           | Some sort of partnership between crawlers could go a long
           | way. Have you considered contributing content back towards
           | the Common Crawl?
        
       | hadjian wrote:
       | This is really lovely. I think the search results are useful, if
       | you're looking for more static information. Very pleasant, to see
       | only results.
       | 
       | I think, I found a minor bug, while (of course) searching for my
       | homepage:
       | 
       | https://www.alexandria.org/?c=&r=&q=www.hadjian.com
       | 
       | The status line below the search box says, that it found many
       | results, but the results are empty. Also, when hitting F5 a
       | couple of times, the number jumps around.
       | 
       | Keep up the great work. I think there is a lot potential to
       | Common Crawl and things built on top of it.
        
       | qumpis wrote:
       | Is there a web that pools multiple search engines results?
        
         | Seirdy wrote:
         | SearX and Searxng are the most common options, but instances
         | often get blocked by the engines they use. Users need to switch
         | between instances quite often.
         | 
         | eTools.ch uses commercial APIs so it doesn't get blocked, but
         | it might block you instead (very sensitive bot detection).
         | 
         | Dogpile is one of the older metasearch engines, but I think it
         | only uses Bing- and Google-powered engines.
        
         | mikkom wrote:
         | Search engines typically probihit this kind of usage via their
         | TOS
        
         | marginalia_nu wrote:
         | SearX?
        
       | dimitar wrote:
       | Unfortunately it seems it doesn't support Cyrillic or Bulgarian
       | well. I googled the mayor of the city I live in and there are 5
       | results all irrelevant. Unfortunately the experience in 'minor'
       | languages is consistently bad in all alternative search engines.
        
         | tandr wrote:
         | We don't know resources behind this project. But even if they
         | substantial, still - they have to start small. It will come,
         | give it time.
        
         | [deleted]
        
       | endisneigh wrote:
       | https://www.alexandria.org/?c=&r=&q=SPY+current+value
       | 
       | https://www.google.com/search?q=SPY+current+value&rlz=1C1ONG...
       | 
       | https://www.alexandria.org/?c=&r=&q=kggle
       | 
       | https://www.google.com/search?q=kggle&rlz=1C1ONGR_enUS974US9...
       | 
       | Search is hard.
        
         | [deleted]
        
         | marginalia_nu wrote:
         | I'd argue "kggle" should surface this result:
         | 
         | https://stackoverflow.com/questions/44077294/encounter-this-...
        
       | glitcher wrote:
       | Really like the minimal UI and the speed! Great work.
       | 
       | A few of my test searches came up with very useful results.
       | However, one disappointment was searching for a javascript
       | function, for example "javascript array splice", and the MDN site
       | was not in the results. Adding "MDN" or "Mozilla" to the search
       | did not help either.
        
       | andreygrehov wrote:
       | How does it work? The GitHub page is not very descriptive. I
       | tried to search "Putin" and the first link is the NYTimes
       | homepage. Does that mean NYTimes covers the war more than the
       | other publications, or is it backlink-driven?
        
         | throwra620 wrote:
        
       | jstx1 wrote:
       | Randomly picking a search that I needed for work today -
       | searching for "pandas order by list" says that it has 44 results
       | and it shows only 3:
       | 
       | - a Github issue for dask
       | 
       | - an article about panda populations
       | 
       | - some coronavirus article that happens to have an unrelated
       | snippet of pandas code
       | 
       | Google obviously picks the relevant stackoverflow thread as the
       | first response.
        
       | moonshinefe wrote:
       | Unfortunately it seems down for me right now.
        
         | yosito wrote:
         | Yep, I'm getting a 502
        
       | potatoman22 wrote:
       | It doesn't work well for programming queries :(
        
         | xerox13ster wrote:
         | I searched fs js and the nodejs.org documentation was the first
         | result.
        
       | hunter2_ wrote:
       | The privacy settings are defaulting to unchecked, but the
       | description above them suggests that they default to checked.
       | This makes me wonder how the settings are actually being
       | interpreted (i.e., what the actual initial state is).
        
       | Rich_Morin wrote:
       | I just tried out this search engine and was very favorably
       | impressed. It was quite responsive (though that could be affected
       | by demand) and gave good results. I really like the lack of goo
       | (e.g., ads) and the spare, clean presentation. I think it might
       | be a great search engine for visually disabled users who rely on
       | screen readers.
        
       | josefcullhed wrote:
       | Hello,
       | 
       | My name is Josef Cullhed. I am the programmer of alexandria.org
       | and one of two founders. We want to build an open source and non
       | profit search engine and right now we are developing in our spare
       | time and are funding the servers ourselves. We are indexing
       | commoncrawl and the search engine is in a really early stage.
       | 
       | We would be super happy to find more developers who want to help
       | us.
        
         | phrozbug wrote:
         | What will be the USP that makes it a success we are all waiting
         | for? At the moment I'm switching between DDG & Google.
        
           | josefcullhed wrote:
           | I just think that the timing is right. I think we are in a
           | spot in time where it does not cost billions of dollars to
           | build a search engine like it did 20 years ago. The relevant
           | parts of the internet is probably shrinking and Moore's Law
           | is making computing exponentially cheaper so there has to be
           | an inflection point somewhere.
           | 
           | We hope we can become a useful search engine powered by open
           | source and donations instead of ads.
        
         | kreeben wrote:
         | Thanks for sharing this with the world. Did you manage to
         | include all of a common crawl in an index? How long did that
         | take you to produce such an index? Is your index in-memory or
         | on disk?
         | 
         | I'd consider contributing. Seems you have something here.
        
           | josefcullhed wrote:
           | The index we are running right now are all URLs in
           | commoncrawl from 2021 but only URLs with direct links to
           | them. This is mostly because we would need more servers to
           | index more URLs and that would increase the cost.
           | 
           | It takes us a couple of days to build the index but we have
           | been coding this for about 1 year.
           | 
           | All the indexes are on disk.
        
             | kreeben wrote:
             | >> All the indexes are on disk.
             | 
             | Love it. Makes for a cheaper infrastructure, since SSD is
             | cheaper than RAM.
             | 
             | >> It takes us a couple of days to build the index
             | 
             | It's hard for me to see how that could be done much faster
             | unless you find a way to parallelize the process, which in
             | itself is a terrifyingly hard problem.
             | 
             | I haven't read your code yet, obviously, but could you give
             | us a hint as to what kind of data structure you use for
             | indexing? According to you, what kind of data structure
             | allows for the fastest indexing and how do you represent it
             | on disk so that you can read your on-disk index in a
             | forward-only mode or "as fast as possible"?
        
               | josefcullhed wrote:
               | Yes it would be impossible to keep the index in RAM.
               | 
               | >> It's hard for me to see how that could be done much
               | faster unless you find a way to parallelize the process
               | 
               | We actually parallelize the process. We do it by
               | separating the URLs to three different servers and
               | indexing them separately. Then we just make the searches
               | on all three servers and merges the result URLs.
               | 
               | >> I haven't read your code yet, obviously, but could you
               | give us a hint as to what kind of data structure you use
               | for indexing?
               | 
               | It is not very complicated, we use hashes a lot to
               | simplify things. The index is basically a really large
               | hash table with the word_hash -> [list of url hashes]
               | Then if you search for "The lazy fox" we just take the
               | intersection between the three lists of url hashes to get
               | all the urls which have all words in them. This is the
               | basic idea that is implemented right now but we will of
               | course try to improve.
               | 
               | details are here: https://github.com/alexandria-
               | org/alexandria/blob/main/src/i...
        
               | josefcullhed wrote:
               | We are currently just doing an intersection and then we
               | make a lookup in a forward index to get the urls, titles
               | and snippets.
               | 
               | I actually don't know what roaring bitmaps are, please
               | enlighten me :)
        
               | kreeben wrote:
               | If you are solely supporting union or solely supporting
               | intersection then roaring bitmaps is probably not a
               | perfect solution to any of your problems.
               | 
               | There are some algorithms that have been optimized for
               | intersect, union, remove (OR, AND, NOT) that work
               | extremely well for sorted lists but the problem is
               | usually: how to efficiently sort the lists that you wish
               | to perform boolean operations on, so that you can then
               | apply the roaring bitmap algorithms on them.
               | 
               | https://roaringbitmap.org/
        
               | kreeben wrote:
               | I realize I'm asking for a free ride here, but could you
               | explain what happens after the index scan? In a phrase
               | search you'd need to intersect, union or remove from the
               | results. Are you using roaring bitmaps or something
               | similar?
        
         | badrabbit wrote:
         | The UI is amazing. Don't change it significantly!
        
           | [deleted]
        
         | cocoafleck wrote:
         | I was trying to learn more about the ranking algorithm that
         | Alexandria uses, and I was a bit confused by the documentation
         | on Github for it. Would I be correct in that it uses "Harmonic
         | Centrality"
         | (http://vigna.di.unimi.it/ftp/papers/AxiomsForCentrality.pdf)
         | at least for part of the algorithm?
        
           | josefcullhed wrote:
           | Hi,
           | 
           | Yes our documentation is probably pretty confusing. It works
           | like this, the base score for all URLs to a specific domain
           | is the harmonic centrality (hc). Then we have two indexes,
           | one with URLs and one with links (we index the link text).
           | Then we first make a search on the links, then on the URLs.
           | We then update the score of the urls based on the links with
           | this formula: domain_score = expm1(5 * link.m_score) + 0.1;
           | url_score = expm1(10 * link.m_score) + 0.1;
           | 
           | then we add the domain and url score to url.m_score
           | 
           | where link.m_score is the HC of the source domain.
        
             | jll29 wrote:
             | The main scoring function seems to be
             | index_builder<data_record>::calculate_score_for_record() in
             | line 296 of https://github.com/alexandria-
             | org/alexandria/blob/main/src/i..., and it mentions support
             | for BM25 (Sparck Jones, Walker and Robertson, 1976) and
             | TFIDF (Sparck Jones, 1972) term weighting, pointing to the
             | respective Wikipedia pages.
        
               | josefcullhed wrote:
               | This is actually not used yet. Working on implementing
               | that as a factor.
        
       | strongpigeon wrote:
       | Slightly tangential, but does anyone know if there is a way to
       | submit links to the Common Crawl (which Alexandria Search relies
       | on)? I haven't seen any traffic from CCBot and my site doesn't
       | seem to show up in Alexandria's results (compared to 2nd/3rd on
       | Google for a bunch of queries).
        
         | kreeben wrote:
         | You can verify whether or not your site exists in the CC data
         | set by searching for it here: https://index.commoncrawl.org/
        
           | strongpigeon wrote:
           | Thanks for that! It does look like it's in there and got
           | crawled in January. I probably didn't search back far enough
           | in my logs...
        
       | unmole wrote:
       | I'm getting a 502 :(
        
       | byteski wrote:
       | ive found several search engines/services besides G and ddg and
       | the only thing that i cant figure out is do these search services
       | have seo technique or its just random list of all resources? i
       | mean how does it order search results
        
       | outcoldman wrote:
       | Tried a few searches.
       | 
       | https://www.alexandria.org/?c=&r=&q=real+estates+puerto+esco... -
       | 3 results only :( If you correct it to "real estate puerto
       | Escondido" - that works better
       | https://www.alexandria.org/?c=&r=&q=real+estate+puerto+escon...
       | 
       | A lot to improve. But a good start
        
       ___________________________________________________________________
       (page generated 2022-03-18 23:00 UTC)