[HN Gopher] Alexandria Search
___________________________________________________________________
Alexandria Search
Author : nafnlj
Score : 162 points
Date : 2022-03-18 16:54 UTC (6 hours ago)
(HTM) web link (www.alexandria.org)
(TXT) w3m dump (www.alexandria.org)
| drcongo wrote:
| > _Found 0 results in 0.05s_
|
| I have a test search string that I use to try out search engines,
| this one didn't do very well.
| reaperducer wrote:
| Unless you were testing for speed!
| doctor_eval wrote:
| Negative tests are still tests!
| blinding-streak wrote:
| Does this support phrase match? (ie "a query in quotes"). A few
| tries seems to show that it doesn't. Or, if it does, its corpus
| is tiny.
| Xeoncross wrote:
| Thank you for building and sharing this. While many people
| rightly point out that this isn't a replacement for Google yet,
| the value of a shared working open source code base has been
| underestimated many times in the past.
|
| I hope this is a project that grows to solve real needs in this
| space. However, even if it never makes it past this point, there
| is a chance someone will be inspired by this to construct their
| own version. Maybe in a different language with a different
| storage format or a different way of ranking results.
|
| Thank you for sharing your work.
| 0xbadcafebee wrote:
| Why don't search engines have filters? Every single consumer
| retail website's search uses filters to help shoppers find
| something to buy. It is way more convenient than hoping the user
| can guess the magic search phrase to find the thing they're
| looking for (if they even know what that thing is).
| kreeben wrote:
| I love the shortcut Alexandria takes by indexing Common Crawl
| instead of crawling the web themselves. It's how I would have
| bootstrapped a new search engine. In a future iteration they can
| start crawling themselves, if there is sufficient interest from
| the public.
|
| Searching is screamingly fast.
|
| The index seems stale, though. Alexandria, how old is your index?
|
| How long did it take you to create your current index? Is that
| your bottleneck, perhaps, that it takes you a long time (and lots
| of money?) to create a Common Crawl index?
| joshuamorton wrote:
| > The index seems stale, though. Alexandria, how old is your
| index?
|
| Common crawl indexes about once every 40 days, the current
| crawl's data is through January 2022, so it's 1.5 months old at
| best.
| version_five wrote:
| There have been a few search engines out recently. I'm curious
| how people evaluate them quickly.
|
| I've realized my searching is basically optimized for google and
| the web that has grown up around it. Also, in 1998 I wasn't as
| aware of what was out there as I am now. It's pretty rare (even
| if its possible) that I do a search and come across a completely
| new site that I haven't heard of before, for anything nontrivial.
| That was different when search began.
|
| Google is now almost a convenience. If I have a coding question,
| I search for "turn list of tensors into tensor" or whatever but
| I'm really looking for SO or the pytorch documentation, and I'll
| ignore the geeksforgeeks and other seo spam that finds it's way
| it. It's almost like google is a statistical "portal" page, like
| Yahoo or one of those old cluttered sites were, that lets me
| quickly get through the menus by just searching. That's different
| from a blank slate search like we might have done 25 years ago.
|
| I think what's really lacking now is uncorrupted search for
| anything that can be monetized. Like I tried to search for a
| frying pan once on google and it was unusable. I'm not sure any
| better search engine can fix that, that's why everyone appends
| "reddit" to queries they are looking for a real opinion on,
| again, because they are optimizing for the current state of the
| web.
|
| Anyway, all that to say I think there are a lot of problems with
| (google dominated) search, but they are basically reflected in
| the current web overall, so just a better search engine, outside
| of stripping out the ads, can only do so much. Real improved
| search efforts need to somehow change the content that's out
| there at the same time as they improve they experience, and let
| us know how to, in a simple way, get the most out of it. I think
| google has a much deeper moat than most people realize
| chris123 wrote:
| I'd love to be able to add a tag to a search to have it exclude
| sites with any kind of monetization, I know that's not
| realistic cuz that's where Google makes most of its money (or
| do they make most of their money somewhere other than
| advertising these days?). Anyways, yeah, I'm sick of SEO
| optimized, click optimized, advertising optimized, affiliate
| link optimized crap.
| demopathos wrote:
| I'm surprised you are not a fan of geeksforgeeks. While each of
| their webpages have substantially less content than the pytorch
| docs or SO result, I find that they get to the point instantly.
| My mean time to solution from G4G is definitely smaller than
| SO.
| Seirdy wrote:
| I generally find that sites like SO, GFG, etc. often play the
| role of "Reading the Docs as a Service". I prefer using them
| only after official documentation or specifications fail me.
| When I want an opinionated answer, I just ping some people I
| already know or check the personal websites of the developers
| of the language/tool I'm using. If I have further questions,
| I check the relevant IRC channel. Sites like SO are a "last
| resort" for me.
|
| In other words, I'd rather see these at the bottom of the
| SERP than the top, but I wouldn't want to completely
| eliminate them.
| version_five wrote:
| I guess everyone has their go-to sites and their pet peeves.
| Geeksforgeeks may be less spammy than some, but I still think
| of it as that annoying site that got in the way of either the
| SO or documentation answer that I was looking for.
|
| Just to expand, if I want the api reference, say I search for
| defaultdict (for some reason I like using them but always
| have to look at the reference), I want the python
| documentation. I definitely don't want a third party telling
| me about it.
|
| And if I search a "make list of tensor into tensor" type
| question, I want SO where someone had asked the same question
| and got "tensor.stack" as the reply, so I can understand the
| answer and follow up by looking at the tensor.stack pytorch
| reference it I want.
|
| Anything else is wasting my time, I think most users with
| similarly specific queries are not looking for tutorials,
| they are looking for the names of functions they hypothesize
| exist, or references. That's why intermediary sites that try
| to give an explanation are annoying, at least for me.
| RosanaAnaDana wrote:
| Geeksforgeeks is toxic garbage poisoning the well of good
| solutions for common problems.
| _bohm wrote:
| I've found that their content is often inaccurate or written
| by people who come across as novices. I actually emailed them
| to correct an inaccuracy in one of their articles once, which
| they did, so kudos to them for that.
| outcoldman wrote:
| > I've realized my searching is basically optimized for google
|
| Is it just me, or I feel like Google does not provide anymore
| good results for me.
|
| Like every time I search something completely out of my
| knowledge, like "How to purchase a property in Mexico", it will
| give me 100+ results of some results with autogenerated content
| like "10 best places to buy property in Mexico". And the only
| way to fix that would be to add something like
| `site:reddit.com`
| pygar wrote:
| > Is it just me, or I feel like Google does not provide
| anymore good results for me.
|
| I am starting to suspect that there might be nothing to find.
|
| I just don't think people (other then the tech oriented) are
| creating websites and running forums - and why would they?
| Reddit might be be only place you _can_ find that type of
| content. What should search engines do then?
|
| With a tiny number of exceptions, it might be that people
| chat on reddit, read Wikipedia, ask questions on the
| stackexchange network/Quora, local communities use facebook
| groups, and businesses have a wordpress site with nothing
| more then a bit of fluff, a phone number and an email
| address.
| daptaq wrote:
| Might be an instance of Goodhart's law:
| https://en.m.wikipedia.org/wiki/Goodhart's_law
|
| If all websites try to optimise for SEO, they undermine the
| assumption that the evaluation of a search engine is the pure
| consequence of how well a site satisfies a query.
| vishnugupta wrote:
| > I'm curious how people evaluate them quickly.
|
| Speaking about myself; I cold turkey migrated to DDG ~2 months
| ago. So far I've had to resort to Google search 10 times or so.
|
| One thing I miss though is Google's nice visualisation of fast
| changing results e.g., match scores. For example:
| https://imgur.com/a/Q5nZkjo
| Seirdy wrote:
| DDG's organic link results are from Bing, sans
| personalization. DuckDuckGo advertises using "over 400
| sources", which means that at least 399 sources only power
| infoboxes ("instant answers") and non-generalist search, such
| as the Video search.
| ffhhj wrote:
| > There have been a few search engines out recently
|
| I'd like to try them out, could you mention which?
| version_five wrote:
| you.com and kagi.com off the top of my head
| Seirdy wrote:
| I've been keeping my eye on You.com, tracking a few SERPs
| over time compared to other Bing- and Google-based engines.
| So far, the results don't seem independent.
|
| Try comparing results with a Bing-based engine (e.g.
| DuckDuckGo) or a Google-based one (e.g. StartPage, GMX) to
| see if they differ. (Don't use Google or Bing directly,
| since results will be personalized based on factors like
| location, device, your fingerprint, etc.).
| Seirdy wrote:
| I listed a bunch over at
| https://seirdy.one/2021/03/10/search-engines-with-own-
| indexe..., and I'm always adding more.
|
| I first discovered Alexandria in early February: https://git.
| sr.ht/~seirdy/seirdy.one/commit/935b55f10f9024ee...
|
| Around the same time, I also discovered sengine.info, Artado,
| Entfer, and Siik. By sheer coincidence they all were
| mentioned to me or decided to crawl my site within the same
| couple weeks. So yes, from my perspective there have been
| more than a few new smaller engines getting active on the
| heels of bigger names like Neeva, Kagi, Brave Search, etc.
| lolinder wrote:
| I've been using the Kagi beta for a few months now, and it's
| awesome: https://kagi.com/
|
| The biggest thing I've found is that when doing technical
| searches it always turns up the sources I'm actually looking
| for, actively filtering all the GitHub/StackOverflow copycat
| sites.
|
| It also seems to up-weight official docs compared to Google.
| For example, "how to read a json file in python" turns up the
| Python docs as the second result, where in Google they're
| nowhere to be found.
| amelius wrote:
| > I'm curious how people evaluate them quickly.
|
| Are there search benchmarks to be found somewhere?
|
| There must be. If you want to write a search engine, you need a
| way to validate the results.
| marginalia_nu wrote:
| There are benchmarks within the adjacent field of information
| retrieval, but in general it's hard to properly validate a
| search engine because real data is so noisy and misbehaved,
| and sample data is so different from real data.
| kreeben wrote:
| Sure, the problem of information retrieval is not exactly
| that of web search but they're pretty close. So, from such
| a knowledgeable person such as yourself, when it comes to
| this topic, could you remind us, what are some of those
| benchmarks?
| marginalia_nu wrote:
| https://en.m.wikipedia.org/wiki/Precision_and_recall
|
| For some standard corpus.
| kreeben wrote:
| >> Precision and recall
|
| The phrase I was looking for. Thx a bunch! Gonna
| marginalia that now.
| marginalia_nu wrote:
| Haha, ironically it lacks both precision and recall for
| that topic.
| jll29 wrote:
| The "Web track" task at the annual US NIST TREC conference (
| https://trec.nist.gov/ ) is an open innovation benchmark that
| everyone can contribute; participants get a set of queries
| that they have to run on exactly the same corpus. Then they
| return the top-k results to a team that evaluates them.
|
| Here is an example (2014) Web track paper of the 23th TREC:
| https://trec.nist.gov/pubs/trec23/papers/overview-web.pdf
| (TREC has a plentitude of difference benchmark tasks and you
| can submit your own:
| https://trec.nist.gov/pubs/trec29/trec2020.html - recent TREC
| 2020 papers)
| zeta0134 wrote:
| I've been wondering for a while now about building a search
| engine for the ad free web. That is, penalize or outright
| refuse to index any recognized advertising network, letting
| through only those sites which don't perform invasive tracking
| with third party services. Mostly as a curiosity: what would be
| left? What would rise to the top when you filter all of that
| out?
| randomsilence wrote:
| Check the 'Small or non-commercial Web' search engines on
| this overview page: https://seirdy.one/2021/03/10/search-
| engines-with-own-indexe...
| not2b wrote:
| Wikipedia.
| not2b wrote:
| Evidently someone disagrees, but no ads or trackers except
| on the home page and its pages rank highly on current
| search engines, so if you exclude trackers and ads that's
| what you're going to get.
| version_five wrote:
| I've thought about something similar, basically "the good
| internet" that would be a hand curated list of sites that are
| not there just as a pretext for ads. I think a lot of
| software project documentation qualifies, lots of stuff on
| university sites like lectures notes for example. I assume
| that across different niches there is other stuff like this.
| I think the key would be something that can't be gamed, like
| it has to be legitimate content that is online for an
| existing purpose and not as a pretext.
| danuker wrote:
| > because they are optimizing for the current state of the web.
|
| I believe people will at least start looking for alternatives.
| For example, I have been collecting search engines, and
| whenever I encounter a page with too many commercial-laden SEO-
| porked results, I use a different search engine in Firefox.
|
| I have enabled the Search Bar, I can do Alt+D, Tab, Tab, enter
| my query, then click a different search engine, which searches
| instantly, unlike the main bar, where you have to press Enter
| once more after clicking.
|
| I just added this one also. See my collection:
| https://0bin.net/paste/ZSCRYVx1#sxD+jBIpScJismXBYwaoJPh75TH9...
| blewboarwastake wrote:
| Pro tip: Alt+E takes you directly to the search bar, then you
| can press Tab for selecting the search engine. The best part
| is that you never use the mouse this way. You can also use
| ddg bangs, they contain every search-engine/site by pressing
| Alt-D if you remember the bang for the site.
| coverband wrote:
| Could you paste again as text links? Thanks.
| 8n4vidtmkvmk wrote:
| maybe the solution is to make google itself reddit style. let
| users downvote the seo spam websites and allow them to be
| downranked. sure it opens the door for a different kind of
| abuse... but maybe that problem is more fixable?
| seltzered_ wrote:
| I feel like this was tried a decade ago: https://developers.g
| oogle.com/search/blog/2011/03/introducin... ("Introducing the
| +1 button", Google, 2011)
| karmab wrote:
| I sEeU
| Seirdy wrote:
| > I'm curious how people evaluate them quickly.
|
| To paint with a broad brush, I look at three criteria:
|
| 1. Infoboxes ("instant answers") should focus on site previews
| rather than trying to intelligently answer my question. Most
| DuckDuckGo infoboxes are good examples of this; Bing and Google
| ones are too "clever".
|
| 2. Organic results should be unique; most engines are powered
| by a commercial Bing API or use Google Custom Search. I
| described my methodology near the bottom of my collection of
| independent search engines:
| https://seirdy.one/2021/03/10/search-engines-with-own-indexe...
|
| 3. "other" stuff. Common features I find appealing include
| area-specific search (Kagi has a "non-commercial lens" mostly
| powered by its Teclis index; Brave is rolling out "goggles"),
| displaying additional info about each result (Marginalia and
| Kagi highlight results with heavy JS or tracking), user-driven
| SERP personalization (Neeva and Kagi allow promoting/demoting
| domains), etc.
|
| And always check privacy policies, TOS, GDPR/CCPA compliance,
| etc.
|
| > Google is now almost a convenience. If I have a coding
| question, I search for "turn list of tensors into tensor" or
| whatever but I'm really looking for SO or the pytorch
| documentation, and I'll ignore the geeksforgeeks and other seo
| spam that finds it's way it. It's almost like google is a
| statistical "portal" page,
|
| I like engines like Neeva and Kagi that allow customizing SERPs
| by demoting irrelevant results; I demote crap like GFG,
| w3schools, tutorialspoint, dev(.)to, etc. and promote official
| documentation. Alternatively, you can use an adblocker to block
| results matching a pattern: https://reddit.com/hgqi5o
| Minor49er wrote:
| This is really fast and cool. Looking for music-related pages,
| I've already found some interesting websites, like Wall of
| Ambient, which caters to ambient labels
| (https://wallofambient.com/#)
|
| I noticed that if a term can't be found, there will be a random
| number of results that it says were found, but nothing is
| actually displayed. Eg:
| https://www.alexandria.org/?c=&r=&q=moonmusiq
|
| I'll keep trying this out. It seems really promising
| bghfm wrote:
| How can we help improve the project? Usage, feedback?
| rprenger wrote:
| For my first search of "GFlowNetworks" (which the search bar
| suggested) It said: Found 5,887 (or something) results, but
| showed no results
|
| For my second I searched my name and got a Wikipedia article
| about a show I've never heard of which didn't have my name
| anywhere in it.
|
| For my third I searched "GFlowNetworks" again and it said Found
| 2,656,844 results in 1.61s, but showed no results again
| marginalia_nu wrote:
| I can't even find results for "GFlowNetworks" on google.
| waynecochran wrote:
| In a nutshell, what is the fundamental difference with this
| search engine compared to others?
| stazz1 wrote:
| >About Alexandria
|
| Alexandria.org is a non-profit, ad free search engine. Our goal
| is to provide the best available information without
| compromise.
|
| The index is built on data from Common Crawl and the engine is
| written in C++. The source code is available here.
|
| We are still at an early stage of development and running the
| search engine on a shoestring budget.
|
| Please contact us at -email- if you want to get involved, want
| to support this initiative or have any questions.
| waynecochran wrote:
| But what is different in terms of its indexing algorithm? The
| original secret sauce for google was the pagerank algorithm
| which was mathematically genius. Are you using a similar
| algorithm.
| josefcullhed wrote:
| Founder here. We are using harmonic centrality instead of
| pagerank. But of course much more work needs to be done to
| make the search engine usable.
| Minor49er wrote:
| Their About page has you covered:
|
| > Alexandria.org is a non-profit, ad free search engine. Our
| goal is to provide the best available information without
| compromise.
|
| > The index is built on data from Common Crawl and the engine
| is written in C++. The source code is available (at
| https://github.com/alexandria-org#).
|
| Edit: formatting
| greenyoda wrote:
| More about Common Crawl:
| https://en.wikipedia.org/wiki/Common_Crawl
|
| > _Common Crawl is a nonprofit 501(c)(3) organization that
| crawls the web and freely provides its archives and datasets
| to the public. Common Crawl 's web archive consists of
| petabytes of data collected since 2011. It completes crawls
| generally every month. ..._
| tandr wrote:
| I think the fact that after a long while there are new search
| engines (Kagi was introduced very recently on HN, now this)
| should be a wake up call for Google - their search has lost some
| shine for quite a while. Hopefully something will come out of
| this - competition is good.
| blinding-streak wrote:
| Competition is definitely good. But this thing is a toy
| compared to not just Google, but all the other major search
| players out there. Hopefully it will continue to advance.
| bmmayer1 wrote:
| My first search on Alexandria was "UTC time". Google gives me
| the current time in UTC, which is all I needed. Alexandria gave
| me...a lot of links to click to find what I'm looking for.
|
| Google search is a lot better than people give it credit for.
| aghilmort wrote:
| we're exploring adding instant answers in clean way at
| Breeze; leaning towards using open-source library &/or
| external API to compute vs. building in-house
|
| also adding premium tier that's alerts + ad free + feeling
| lucky that would take user to top result, which is a UTC
| page, re: https://breezethat.com/?q=UTC+time
| NoahTheDuke wrote:
| I just tried breezethat and had to scroll past 6 ads (two
| screenfuls on my iPhone 10) to see a single result. I know
| ads are necessary but this is punitive.
| aghilmort wrote:
| yes, mobile is awful right now; we've mostly fixed that
| issue on laptop / desktop
|
| 4 of 6 are google's and have to include -- iterating some
| designs internally that refactor how they're presented on
| mobile
| s0rce wrote:
| I agree, Google's instant answers are quite good and have
| improved but actually searching to find a site seems to be
| getting worse and is riddled with paid sites on the top.
| rambambram wrote:
| This doesn't add up, at all. I have a clock on my computer.
| This new search engine doesn't function like a clock for you,
| so Google Search is better.
| boomboomsubban wrote:
| When I need a clock I'll be sure to use Google.
| tokai wrote:
| You still have the current UTC time in the top links. Googles
| knowledge graph is a part of the problem with their results.
| marginalia_nu wrote:
| Strictly speaking, that's more in the domain of BonziBuddy or
| Alexa than internet search engine.
|
| What Google arguably struggles with is surfacing relevant
| documents, that is... search.
| the-dude wrote:
| Kagi does this. I have switched 100% to Kagi, not affiliated.
| amelius wrote:
| At some point, AI and NLP and raw processing power will have
| progressed so much that "search" is not a problem anymore, and
| I think we're getting there. Google can up their game but it
| won't matter much. The only thing they have left is brand
| recognition.
| tbihl wrote:
| Google has necessarily arbitrary criteria by which pages are
| ranked. Because Google is _the_ game in town, anyone with a
| primary goal of driving traffic will pursue those metrics
| (i.e. SEO). To the extent that those criteria deviate even
| slightly from actual good results, large parts of the
| internet will dilute their content to pursue them, which both
| lowers their quality and further drives down the gems of the
| internet.
|
| The ranking would have to vary over an infinite spread of
| purposes for webpages, and it would have to converge almost
| perfectly to what is actually most helpful. Among all the
| technical problems, Google will not optimize correctly
| against ads for the same reason that websites trying to drum
| up affiliate purchases and ad revenue won't put content
| quality above SEO.
|
| When recipes return to having the recipe and ingredients
| first, followed by an optional life story,I'll revisit my
| assessment.
| jll29 wrote:
| Google Research is also one of the top (NLR|IR) R&D gigs in
| town - they discovered BERT, a model that has re-defined how
| NLP is down and the respective paper describing it already
| collected 800 citations by the time it was published based on
| a pre-print spreading like a wildfire.
|
| This technology is now part of Google search.
| orlp wrote:
| IMO search has had its goalpost moved. It used to be about
| scale, technical challenges, bandwidth, storage, etc. It is
| still about that, but a significantly harder challenge to
| solve has come up: searching in a malicious environment. SEO
| crap nowadays completely dominates search, Google has lost
| the war.
|
| Simply put, I believe that Google sucks at search, in the
| modern context. It is great at indexing, it has solved
| phenomenal technical challenges, but search it has not
| solved. Why do I have to write site:stackoverflow.com or
| site:reddit.com to skip the crap and go to actual content?
| Why can my brain detect blogspam garbage in 0.5 seconds of
| looking but billion dollar company Google will happily
| recommend it as the most relevant result above a legitimate
| website?
|
| I feel this 12 year old XKCD is still relevant:
| https://xkcd.com/810/ .
| new_guy wrote:
| > but billion dollar company Google will happily recommend
| it as the most relevant result above a legitimate website?
|
| Because the site most likely is laden with Google Ads, it's
| in their interest to show you that garbage and not what
| you're actually looking for.
| lukasb wrote:
| Excuse me while I get on a hobbyhorse - would love to use web
| search that lets me boost PageRank for certain sites (which then
| would carry over to sites they link to.) Could automatically
| boost PageRank for sites I subscribe to, for example. Expensive
| in terms of computation or storage? Charge me!
| marginalia_nu wrote:
| That's easy to do (it's Personalized PageRank), but VERY
| expensive. Like just tossing them a few dollars doesn't cut it.
| You basically need your own custom index for that, as the way
| you achieve fast ranking is by sorting the documents in order
| of ranking within the index itself. That way you only need to
| consider a very small portion of the index to retrieve the
| highest ranking results.
|
| You might get away with having like a custom micro-index where
| your search basically does a hidden site:-search for your
| favorite domain and related domains, but that's not quite going
| to do what you want it to do.
| lukasb wrote:
| So uh ... 100 petabytes, then.
|
| Ah.
| marginalia_nu wrote:
| Realistically you could probably get away with something
| like a couple of terabytes, and then default to the regular
| index if it isn't found in the neighborhood close to your
| favored sites, but that's still anything but cheap
| especially this can't be some slow-ass S3 storage, this
| storage should ideally be SSDs or a RAID/JBOD-configuration
| of mechanical drives. That means you're also paying for a
| lot of I/O bandwidth and overall data logistics.
|
| If you try to rent that sort of compute, you're probably
| looking at like $100-200/month.
| wilg wrote:
| which makes it very hard to follow-up on queries.The cursor moves
| back to the beginning of the search box after you search,
| julienreszka wrote:
| For a same query the number of results varies dramatically. So
| weird.
| devmunchies wrote:
| The initial commit was 11 months ago and written in C++. I
| haven't done C++ since college ~7 years ago. Is it a good
| language for greenfield projects these days, or would something
| like Go or Rust (or Crystal, Nim, Zig) be better for
| maintainability and acquiring contributors?
| marginalia_nu wrote:
| Main thing I'd worry about with C++ is heap fragmentation over
| time. Something like TCMalloc or JEMalloc might help a bit but
| it's hard to get around doing this type of thing in C++.
| extheat wrote:
| I don't think the problem is there's a shortage of developers.
| Much less the language. There's a shortage of people with the
| experience in working with search engines and the necessary
| algorithms to make them work as intended reliably.
| 51Cards wrote:
| So, interesting thing, how when I visit this site for the first
| time (in Firefox) is the search box showing a drop down with a
| bunch of my previous searches? I can't tell where they are from
| but it is all stuff I have searched for in the past. I thought it
| might be the browser populating a list but that should be based
| on same domain. So where is it pulling this from? Some of the
| searche terms are months, perhaps more than a year old.
| pmontra wrote:
| I'll use it for general searches in the next days because it's
| the only way to do a fair evaluation.
|
| I just searched for python3 join string and I didn't get the
| Python docs in the first page. Both DDG and Google got them at
| position 9 which is way too low. At least I got a different set
| of random websites and not the usual tutorialpoint, w3schools,
| geeksforgeeks etc that I usually see in these cases.
| zeruch wrote:
| I was amused by the name (beyond the obvious reference to the
| ancient library, it also gender-switches on a heteronym of
| Fernado Pessoa, Alexander Search)
|
| https://www.brown.edu/Departments/Portuguese_Brazilian_Studi...
| zander312 wrote:
| getting a 502...
| pmarreck wrote:
| This actually makes me want to build my own web crawler and
| search
| marginalia_nu wrote:
| Do try, it's a very interesting problem domain.
| josefcullhed wrote:
| Founder here,
|
| I suggest you start by not implementing a crawler but use
| commoncrawl.org instead. The problem with starting a web
| crawler is you will need a lot of money and almost all big
| websites are behind cloudflare so you will be blocked pretty
| quickly. Crawling is a big issue and most of the issues are
| non-technical.
| Seirdy wrote:
| I've heard from other people who run engines (Right Dao,
| Gigablast) that this is a major problem; Common Crawl does
| look helpful, but it's not continuously updated. FWIW, Right
| Dao uses Wikipedia as a starting point for crawling. Kiwix
| makes pre-indexed dumps of Wikipedia, StackExchange, and
| other sites available.
|
| Some sort of partnership between crawlers could go a long
| way. Have you considered contributing content back towards
| the Common Crawl?
| hadjian wrote:
| This is really lovely. I think the search results are useful, if
| you're looking for more static information. Very pleasant, to see
| only results.
|
| I think, I found a minor bug, while (of course) searching for my
| homepage:
|
| https://www.alexandria.org/?c=&r=&q=www.hadjian.com
|
| The status line below the search box says, that it found many
| results, but the results are empty. Also, when hitting F5 a
| couple of times, the number jumps around.
|
| Keep up the great work. I think there is a lot potential to
| Common Crawl and things built on top of it.
| qumpis wrote:
| Is there a web that pools multiple search engines results?
| Seirdy wrote:
| SearX and Searxng are the most common options, but instances
| often get blocked by the engines they use. Users need to switch
| between instances quite often.
|
| eTools.ch uses commercial APIs so it doesn't get blocked, but
| it might block you instead (very sensitive bot detection).
|
| Dogpile is one of the older metasearch engines, but I think it
| only uses Bing- and Google-powered engines.
| mikkom wrote:
| Search engines typically probihit this kind of usage via their
| TOS
| marginalia_nu wrote:
| SearX?
| dimitar wrote:
| Unfortunately it seems it doesn't support Cyrillic or Bulgarian
| well. I googled the mayor of the city I live in and there are 5
| results all irrelevant. Unfortunately the experience in 'minor'
| languages is consistently bad in all alternative search engines.
| tandr wrote:
| We don't know resources behind this project. But even if they
| substantial, still - they have to start small. It will come,
| give it time.
| [deleted]
| endisneigh wrote:
| https://www.alexandria.org/?c=&r=&q=SPY+current+value
|
| https://www.google.com/search?q=SPY+current+value&rlz=1C1ONG...
|
| https://www.alexandria.org/?c=&r=&q=kggle
|
| https://www.google.com/search?q=kggle&rlz=1C1ONGR_enUS974US9...
|
| Search is hard.
| [deleted]
| marginalia_nu wrote:
| I'd argue "kggle" should surface this result:
|
| https://stackoverflow.com/questions/44077294/encounter-this-...
| glitcher wrote:
| Really like the minimal UI and the speed! Great work.
|
| A few of my test searches came up with very useful results.
| However, one disappointment was searching for a javascript
| function, for example "javascript array splice", and the MDN site
| was not in the results. Adding "MDN" or "Mozilla" to the search
| did not help either.
| andreygrehov wrote:
| How does it work? The GitHub page is not very descriptive. I
| tried to search "Putin" and the first link is the NYTimes
| homepage. Does that mean NYTimes covers the war more than the
| other publications, or is it backlink-driven?
| throwra620 wrote:
| jstx1 wrote:
| Randomly picking a search that I needed for work today -
| searching for "pandas order by list" says that it has 44 results
| and it shows only 3:
|
| - a Github issue for dask
|
| - an article about panda populations
|
| - some coronavirus article that happens to have an unrelated
| snippet of pandas code
|
| Google obviously picks the relevant stackoverflow thread as the
| first response.
| moonshinefe wrote:
| Unfortunately it seems down for me right now.
| yosito wrote:
| Yep, I'm getting a 502
| potatoman22 wrote:
| It doesn't work well for programming queries :(
| xerox13ster wrote:
| I searched fs js and the nodejs.org documentation was the first
| result.
| hunter2_ wrote:
| The privacy settings are defaulting to unchecked, but the
| description above them suggests that they default to checked.
| This makes me wonder how the settings are actually being
| interpreted (i.e., what the actual initial state is).
| Rich_Morin wrote:
| I just tried out this search engine and was very favorably
| impressed. It was quite responsive (though that could be affected
| by demand) and gave good results. I really like the lack of goo
| (e.g., ads) and the spare, clean presentation. I think it might
| be a great search engine for visually disabled users who rely on
| screen readers.
| josefcullhed wrote:
| Hello,
|
| My name is Josef Cullhed. I am the programmer of alexandria.org
| and one of two founders. We want to build an open source and non
| profit search engine and right now we are developing in our spare
| time and are funding the servers ourselves. We are indexing
| commoncrawl and the search engine is in a really early stage.
|
| We would be super happy to find more developers who want to help
| us.
| phrozbug wrote:
| What will be the USP that makes it a success we are all waiting
| for? At the moment I'm switching between DDG & Google.
| josefcullhed wrote:
| I just think that the timing is right. I think we are in a
| spot in time where it does not cost billions of dollars to
| build a search engine like it did 20 years ago. The relevant
| parts of the internet is probably shrinking and Moore's Law
| is making computing exponentially cheaper so there has to be
| an inflection point somewhere.
|
| We hope we can become a useful search engine powered by open
| source and donations instead of ads.
| kreeben wrote:
| Thanks for sharing this with the world. Did you manage to
| include all of a common crawl in an index? How long did that
| take you to produce such an index? Is your index in-memory or
| on disk?
|
| I'd consider contributing. Seems you have something here.
| josefcullhed wrote:
| The index we are running right now are all URLs in
| commoncrawl from 2021 but only URLs with direct links to
| them. This is mostly because we would need more servers to
| index more URLs and that would increase the cost.
|
| It takes us a couple of days to build the index but we have
| been coding this for about 1 year.
|
| All the indexes are on disk.
| kreeben wrote:
| >> All the indexes are on disk.
|
| Love it. Makes for a cheaper infrastructure, since SSD is
| cheaper than RAM.
|
| >> It takes us a couple of days to build the index
|
| It's hard for me to see how that could be done much faster
| unless you find a way to parallelize the process, which in
| itself is a terrifyingly hard problem.
|
| I haven't read your code yet, obviously, but could you give
| us a hint as to what kind of data structure you use for
| indexing? According to you, what kind of data structure
| allows for the fastest indexing and how do you represent it
| on disk so that you can read your on-disk index in a
| forward-only mode or "as fast as possible"?
| josefcullhed wrote:
| Yes it would be impossible to keep the index in RAM.
|
| >> It's hard for me to see how that could be done much
| faster unless you find a way to parallelize the process
|
| We actually parallelize the process. We do it by
| separating the URLs to three different servers and
| indexing them separately. Then we just make the searches
| on all three servers and merges the result URLs.
|
| >> I haven't read your code yet, obviously, but could you
| give us a hint as to what kind of data structure you use
| for indexing?
|
| It is not very complicated, we use hashes a lot to
| simplify things. The index is basically a really large
| hash table with the word_hash -> [list of url hashes]
| Then if you search for "The lazy fox" we just take the
| intersection between the three lists of url hashes to get
| all the urls which have all words in them. This is the
| basic idea that is implemented right now but we will of
| course try to improve.
|
| details are here: https://github.com/alexandria-
| org/alexandria/blob/main/src/i...
| josefcullhed wrote:
| We are currently just doing an intersection and then we
| make a lookup in a forward index to get the urls, titles
| and snippets.
|
| I actually don't know what roaring bitmaps are, please
| enlighten me :)
| kreeben wrote:
| If you are solely supporting union or solely supporting
| intersection then roaring bitmaps is probably not a
| perfect solution to any of your problems.
|
| There are some algorithms that have been optimized for
| intersect, union, remove (OR, AND, NOT) that work
| extremely well for sorted lists but the problem is
| usually: how to efficiently sort the lists that you wish
| to perform boolean operations on, so that you can then
| apply the roaring bitmap algorithms on them.
|
| https://roaringbitmap.org/
| kreeben wrote:
| I realize I'm asking for a free ride here, but could you
| explain what happens after the index scan? In a phrase
| search you'd need to intersect, union or remove from the
| results. Are you using roaring bitmaps or something
| similar?
| badrabbit wrote:
| The UI is amazing. Don't change it significantly!
| [deleted]
| cocoafleck wrote:
| I was trying to learn more about the ranking algorithm that
| Alexandria uses, and I was a bit confused by the documentation
| on Github for it. Would I be correct in that it uses "Harmonic
| Centrality"
| (http://vigna.di.unimi.it/ftp/papers/AxiomsForCentrality.pdf)
| at least for part of the algorithm?
| josefcullhed wrote:
| Hi,
|
| Yes our documentation is probably pretty confusing. It works
| like this, the base score for all URLs to a specific domain
| is the harmonic centrality (hc). Then we have two indexes,
| one with URLs and one with links (we index the link text).
| Then we first make a search on the links, then on the URLs.
| We then update the score of the urls based on the links with
| this formula: domain_score = expm1(5 * link.m_score) + 0.1;
| url_score = expm1(10 * link.m_score) + 0.1;
|
| then we add the domain and url score to url.m_score
|
| where link.m_score is the HC of the source domain.
| jll29 wrote:
| The main scoring function seems to be
| index_builder<data_record>::calculate_score_for_record() in
| line 296 of https://github.com/alexandria-
| org/alexandria/blob/main/src/i..., and it mentions support
| for BM25 (Sparck Jones, Walker and Robertson, 1976) and
| TFIDF (Sparck Jones, 1972) term weighting, pointing to the
| respective Wikipedia pages.
| josefcullhed wrote:
| This is actually not used yet. Working on implementing
| that as a factor.
| strongpigeon wrote:
| Slightly tangential, but does anyone know if there is a way to
| submit links to the Common Crawl (which Alexandria Search relies
| on)? I haven't seen any traffic from CCBot and my site doesn't
| seem to show up in Alexandria's results (compared to 2nd/3rd on
| Google for a bunch of queries).
| kreeben wrote:
| You can verify whether or not your site exists in the CC data
| set by searching for it here: https://index.commoncrawl.org/
| strongpigeon wrote:
| Thanks for that! It does look like it's in there and got
| crawled in January. I probably didn't search back far enough
| in my logs...
| unmole wrote:
| I'm getting a 502 :(
| byteski wrote:
| ive found several search engines/services besides G and ddg and
| the only thing that i cant figure out is do these search services
| have seo technique or its just random list of all resources? i
| mean how does it order search results
| outcoldman wrote:
| Tried a few searches.
|
| https://www.alexandria.org/?c=&r=&q=real+estates+puerto+esco... -
| 3 results only :( If you correct it to "real estate puerto
| Escondido" - that works better
| https://www.alexandria.org/?c=&r=&q=real+estate+puerto+escon...
|
| A lot to improve. But a good start
___________________________________________________________________
(page generated 2022-03-18 23:00 UTC)