[HN Gopher] Internet Archive Scholar: Search Millions of Researc...
___________________________________________________________________
Internet Archive Scholar: Search Millions of Research Papers
Author : bnewbold
Score : 144 points
Date : 2021-03-09 18:06 UTC (4 hours ago)
(HTM) web link (blog.archive.org)
(TXT) w3m dump (blog.archive.org)
| nathias wrote:
| archive.org is really one of the few things still good on the
| internet, while studying it has been invaluable for my studies, I
| can't imagine what the previous generations that could only
| access 5% of sources were even doing.
| 8bitsrule wrote:
| Oh yeah! Tried this on several specific topics I've looked at
| recently (2 years ago, 7ya, and 150ya) and the results were fast
| and on the mark. I'll certainly favor using Scholar over IA
| searches. Congratulations!
| BugsJustFindMe wrote:
| I couldn't find a list of what sources (like which journals)
| they're archiving from. Does anyone know where to find that? It
| would be nice to see what subject categories the archive covers.
| jahewson wrote:
| I took one look at that logo and concluded "this is not for me".
| throwaway8451 wrote:
| Here is an appropriate soundtrack for browsing the results:
|
| https://www.youtube.com/watch?v=x8gBfEDoEbY
| simonw wrote:
| I had the exact opposite reaction. That logo is fabulous.
| AnimalMuppet wrote:
| If you're going to judge it by the logo rather than by the
| search results, it almost certainly is not for you...
| betamaxthetape wrote:
| This is amazing. I had a play around with it whilst it was in
| beta, and was blown away by the variety of papers returned. On a
| whim I searched for a very obscure topic that I'd researched
| before (just for personal interest) in the past using worldcat /
| google scholar, and to my surprise was presented with several
| highly relevant papers I'd never come across before, that were
| _exactly_ what I was looking for.
| carbocation wrote:
| Interesting. For my field (cardiovascular genetics), the results
| weren't really what I was expecting. I think that my expectations
| probably fit pretty well with a PageRank graph of citations. So
| my guess is that the "relevancy" is semantic only?
| sundarurfriend wrote:
| (OffTopic) All this talk about the logo here made me check the
| page out, instead of moving on after reading just the comments as
| I might otherwise have done. Perhaps that's a HN strategy to use,
| to get people to actually click through - add a bikesheddy thing
| to the page that's likely to be divisive, but doesn't require
| thought. Gives us a cheap way to have an opinion, and thus an
| incentive to click!
| endisneigh wrote:
| I'm curious, how does the Internet Archive handle copyright with
| all of its services?
| marcodiego wrote:
| The internet archive is becoming an alternative good internet. It
| has a web archive, film archive, software archive, media
| archive... and now research papers archive. That is the internet
| as a giant library as we dreamed in early 90's.
| Black101 wrote:
| Way too centralized (Centranet?), but it is very nice for now.
| It's a bit like the library of Alexandria, so it could
| change/disappear at any time.
| dbrereton wrote:
| I'm sure they'd be willing to decentralize it if there was a
| good way to do that. Maybe this can be done with something
| like IPFS [0].
|
| [0] https://ipfs.io/
| zucker42 wrote:
| The amount of data is absolutely insane.
| Black101 wrote:
| Yes, they have very good intentions right now, but what if
| the leader gets hit by a bus.
| musicale wrote:
| Presumably it would be be acquired, paywalled, and
| monetized by a private equity firm (or some suitably
| hostile intellectual property rightsholder organization)
| before going bankrupt and shutting down for good.
|
| Thanks for an incredible journey.
| puddingnomeat wrote:
| Is it easy to have a local copy?
| capableweb wrote:
| Internet Archive strikes again! I love Internet Archive, not just
| for archiving websites but for archiving everything and making it
| easily accessible. This is another great service that'll help a
| lot of researchers and hobby-researchers, which is lovely to see.
|
| Don't forget to donate if you also like Internet Archive, they
| need every penny: https://archive.org/donate/?origin=hn
| bnewbold wrote:
| This service was hinted at back in September, but is now formally
| announced and live at https://scholar.archive.org
|
| Related previous post:
| https://news.ycombinator.com/item?id=24485444
|
| Much of the catalog functionality can be accessed from the
| fatcat.wiki API (https://api.fatcat.wiki/redoc). Scholar adds a
| search index over the body content of papers, and we are still
| thinking through how to make this available through a public API
| without slowing down query latency even more.
|
| Folks here might also be interested in this CLI for interfacing
| with the catalog and making edits:
| https://gitlab.com/bnewbold/fatcat-cli
| breck wrote:
| I absolutely love everything about it (the logo <3).
|
| Super fast. All my test searches returned what I was looking
| for.
|
| What is your relationship with semantic scholar like?
|
| Any plans to integrate ranking signals like references, etc?
|
| I'm going to double my monthly donation. This is great.
| bnewbold wrote:
| Thank you for the kind words!
|
| We are friendly with Semantic Scholar, and have used their
| "open corpus" dumps as one of several URL seed lists for
| crawling in the past. Their search and discovery tech is more
| sophisticated than ours is likely to be any time soon
| (https://medium.com/ai2-blog/building-a-better-search-
| engine-...). We would love to get to the place where groups
| like AI2, which are primarily research-oriented, could build
| on an existing open catalog and corpus, and not need to
| duplicate time crawling, merging catalogs, cleaning metadata,
| etc. As of today Microsoft Academic (used by Semantic
| Scholar) might be a better option.
|
| Want to be thoughtful about ranking signals, and are deeply
| skeptical of journal impact factor, h-index, and most
| bibliometrics. "Has this been cited more than a handful of
| times" seems like a reasonable coarse boost. Hope to include
| more curated signals, like "won a paper prize", "journal in
| DOAJ and other reviewed indices", etc.
|
| Have been working on a citation graph, keep an eye out for
| something about that in coming months. One cool thing we hope
| to do with the citation graph is find "missing works" not yet
| in the catalog (eg, don't have a DOI, especially for pre-1990
| era).
___________________________________________________________________
(page generated 2021-03-09 23:00 UTC)