[HN Gopher] Internet Archive Scholar
___________________________________________________________________
Internet Archive Scholar
Author : nabla9
Score : 389 points
Date : 2022-12-09 10:35 UTC (12 hours ago)
(HTM) web link (scholar.archive.org)
(TXT) w3m dump (scholar.archive.org)
| zozbot234 wrote:
| Not from the IA, but see https://scholia.toolforge.org for an
| especially nice presentation of freely-available scholarly
| metadata.
| jboynyc wrote:
| You may also be interested in OpenAlex.org which also uses
| wikidata (along with DOIs, ORCIDs, ISSNs and a few other
| standard identifiers) to classify publications.
| photochemsyn wrote:
| After a little testing, this looks like a good information
| source, although the combination of Google Scholar and sci-hub is
| probably still the best option, i.e. I couldn't find anything
| with Internet Scholar that wasn't available with the other
| options, and the quality of results on searchs is somewhat higher
| with Google Scholar (this may be because Google Scholar utilizes
| citation count as a search parameter, which Internet Archive
| Scholar doesn't seem to do).
|
| Internet Archive is a great resource, however, it should get
| state funding as it provides a fundamentally important archival
| service. It's too bad it has to rely so much on private
| philanthropic donations (although state support comes with
| possible political interference, i.e. censorship of material that
| some politician doesn't like, maybe that's less of a problem with
| private donations, although then you could have some billionaire
| doing the same thing).
| macrolime wrote:
| I tried to search for scientific authors from the 1800s and
| they're there.
|
| Google Scholar on the other hand brings me to paywalls, even
| though the articles are so old they should be out of copyright.
| bnewbold wrote:
| The content in scholar.archive.org has been indexed in to
| Google Scholar (and other indices are likewise welcome to crawl
| the sitemap). There was some content "only" in
| scholar.archive.org, but now it should basically all be in
| Google Scholar. We haven't gotten around to describing this
| publicly, but it was an explicit decision and partnership
| between the organizations.
|
| Indeed scholar.archive.org does not currently use citation
| count in search rankings. We have a decent citation graph,
| which we are working to expose in scholar (it is visible in
| fatcat.wiki today). Would probably only ever use citation count
| as a weak boost in search rankings (eg, "any citations at all",
| "more than 25 citations" as boosts, nothing beyond that), don't
| want to create too strong a feedback loop influencing future
| citations.
|
| scholar.archive.org specifically was partially funded by the
| Mellon Foundation (and partially through donations and other
| service revenue). IA overall has diverse funding, including
| grants and service revenue from the USA (Library of Congress,
| IMLS, etc); other national governments (paid crawl services);
| foundation grants; universities and libraries (crawl,
| preservation, and digitization services); and of course general
| donations. The last category of course has the fewest strings
| and lets us pursue new projects which might be hard to get
| traditional funding for. Remember that the whole premise of web
| archiving was considered radical and quixotic at the beginning!
|
| (source: I work at IA on scholar)
| veqq wrote:
| Some days, nearly half the links I click are dead so I've found
| myself relying on the waybackmachine more and more over the past
| few months. It's really shocking just how fast digital
| obsolescence reared its ugly head. Of course angelcites etc. were
| a clear early blow, but nowadays...
|
| I've started saving the html (including the css seems like too
| much overhead, and often it's incomplete or relies on downloads
| still - and screenshots are not searchable) of every interesting
| article I find online, downloading quite a few videos too with
| yt-dlp. I'd long copypasted all interesting comments into a txt
| file, but now it seems like data hoarding's the way to go - at
| least in moderation, focusing on things I'll actually refer back
| to.
|
| I remember 15 years ago, discovering pdf dumps on random sites
| like a kid in a candy store. Perhaps it'll be like that again,
| with people presenting museums of their favorite old pages.
| b1zguy wrote:
| I've been wanting to run my own search engine sorta thingy that
| indexes websites I feed it. I sometimes find little nooks of
| the net that post resources I may need in the future. Like my
| own mini-Google that indexes a list of sites.
|
| How can I go about creating this? Are their off-the-shelf
| solutions, will I need to say combine scrapy with elastic
| search? The links in this thread look promising.
| graderjs wrote:
| Some people use this tool (of mine) for saving web content from
| either bookmarks or just everything you browse:
| https://github.com/crisdosyago/Diskernet
|
| There's also plentY= of other similar tools:
|
| - https://github.com/ArchiveBox/ArchiveBox
|
| - https://github.com/gildas-lormeau/SingleFile
| randomguy12 wrote:
| Zotero also saves snapshots of pages if you already cite
| academic pages
| staringback wrote:
| zote wrote:
| How have I never seen your tool before.
|
| >22120 archives content exactly as it is received and
| presented by a browser, and it also replays that content
| exactly as if the resource were being taken from online.
|
| I've been looking for this for a really long time.
| arminiusreturns wrote:
| Beware the very strange and bad license for Diskernet, which
| is "Polyform Strict License 1.0.0"
| tetris11 wrote:
| For people looking for more info on these strange licenses:
|
| https://www.reddit.com/r/linux/comments/coazye/what_does_rl
| i...
| martyvis wrote:
| That Polyform licence looks truly awful. Lack of mention of
| how I could use it while working.
| detaro wrote:
| > _how I could use it while working._
|
| The readme clearly links where to buy a license for that.
| leslielurker wrote:
| I'd never heard of SingleFile before but it looks excellent.
| It would be great if Firefox could incorporate it into its
| save function too. Firefox save page works but as shown on
| the SingleFile demo video, it's not really what a user would
| expect, it's often not complete and splitting it across
| multiple files/directories isn't ideal either.
| Springtime wrote:
| It's unfortunate as Firefox used to have excellent MHTML
| support (which similarly achieves an all-in-one file) via
| addons, particularly the feature-rich _UnMHT_. While
| Chromium and its derivatives support MHTML saving natively
| (and in the past Opera Presto and IE).
|
| If they brought back MHTML saving support it'd be a great
| win.
| tetris11 wrote:
| > Coming to a future release, soon!: The ability to publish
| your own search engine that you curated with the best
| resources based on your expert knowledge and experience.
|
| This would be fantastic, being able to browse a curated
| internet made of accumulated lists from other trusted users
| on the net, similar to how ad blocking lists work today.
|
| You are genuinely trying to steer the internet into what it
| used to be: a museum of knowledge and expert discussion.
|
| Edit: Ah but wow, Polyform license. Huh.
| msravi wrote:
| I use the markdownload extension[1] on firefox and move the .md
| file into my notes folder (notable[2]). Works very well.
|
| 1. https://addons.mozilla.org/en-US/firefox/addon/markdownload/
|
| 2. https://notable.app/
| dspillett wrote:
| _> I 've started saving_
|
| I do similar. I've had https://github.com/ArchiveBox/ArchiveBox
| bookmarked for a while as something to try better organise all
| that, but like a great many things I haven't go around to it
| yet.
| circustaco wrote:
| I use Raindrop for this. It's a pretty great bookmark manager
| made by an indie dev, but it also can create archives of
| bookmarked pages.
|
| https://help.raindrop.io/backups#permanent-library
| dspillett wrote:
| "Only available in Pro plan". No information on whether
| that is a general limitation or if these features are
| present if self-hosted, so I assume the former. And there
| seems to be little obvious information for self-hosting.
|
| So probably not one for my use case.
| dspillett wrote:
| Thanks, I'll add that to the list of things to try out.
| patrickdward wrote:
| I use Raindrop but didn't know abou that feature. Thanks.
| gwbas1c wrote:
| I just save the pdf of a site that's _really_ important to me.
|
| When I did a major college project in 2003, I made sure to make
| pdfs of any academic article that I referenced. It actually
| saved me, because some articles disappeared by the time I went
| to revise my references.
| 082349872349872 wrote:
| While the author(s) are still alive, they are often a
| productive contact.
|
| (In one case I was able to give back: bundling up the several
| scans an author had of a half-century old paper from their
| student days into a single, hopefully cromulent, PDF)
|
| Edit: recall also that accepting that links are one-way and
| might be dead was _the key_ simplification that allowed HTTP to
| take off after prior attempts at hypermedia had failed.
| mrybczyn wrote:
| Good (Edit) point! It's good that the web accepts dead links
| by design, we can't expect perfection from our distributed
| information, but it seems that the bitrot of information is
| too high compared to the information storage technologies
| available.
|
| 2 spinning rust drive can store the library of congress. ~
| 2,000 drives would store the web (1). How many millions of
| these drives get manufactured per year? Our technology
| systems are failing us - all those words are being lost, like
| tears in rain.
|
| (1)Back of the envelope estimation:
| https://www.worldwidewebsize.com/ ~ estimates 50 billion
| websites, with some estimates ~ 6 pages of information per
| website. Let's say 1mb per page on average. So ~2,000 drives
| would store the entire web.
| brewdad wrote:
| Even if that estimate is off by an order of magnitude,
| which given the weight of modern web pages it easily could
| be, 20,000 drives to store the entire web seems way more
| doable than I ever would have imagined.
| 8bitsrule wrote:
| 2000 drives x $100 = $200,000. Double that for backup,
| $400,000. Admin, maintenance, let's say total, $1M/year.
| So, Wikipedia could end its own deadlink problem (IF the
| reference sources would agree.
|
| But stuff goes missing at Wayback because people don't
| agree to their pages being backed-up. Copyright, whatever.
| So it's like Global Heating, the tech is there, but people
| just can't agree. So 'pirate' backer-uppers go to jail. And
| island-nations and expensive ocean-side properties are
| being submerged. So it goes.
| toomuchtodo wrote:
| The Internet Archive has a bot that updates dead
| Wikipedia references to point to archived content.
|
| https://meta.wikimedia.org/wiki/InternetArchiveBot
| leephillips wrote:
| I find the SingleFile extension superb for this.
| [deleted]
| riskable wrote:
| This seems like the type of thing that will become the search
| engine of _first_ resort in the future as AI-generated propaganda
| and nonsense pollutes the spectrum of websites and search
| results.
| [deleted]
| maggu123 wrote:
| 082349872349872 wrote:
| cf https://gallica.bnf.fr/accueil/en/content/accueil-
| en?mode=de...
| schmudde wrote:
| I've already made my donation to IA this year but I might need to
| make another.
|
| Somehow it's the IA's job to fix problems that we all know are
| problems, sadly.
| wnkrshm wrote:
| Wikipedia reminded me multiple times to donate to the Internet
| Archive this year.
| xattt wrote:
| Is this so Wikipedia can be archived by the IA?
| shepherdjerred wrote:
| It's a joke. Hacker News doesn't like donating to
| Wikipedia; many choose to donate to Internet Archive
| instead.
| wnkrshm wrote:
| I always take the wikipedia donation drive as a reminder to
| donate to archive.org instead.
| nix23 wrote:
| Haha, i know what you mean...my mind works the same way ;)
| password4321 wrote:
| Another chance to upvote the donation link to top thread on
| an IA story since a direct submission got swallowed by the
| dupe detector! They are doing such much amazing things.
|
| https://archive.org/donate
|
| _Your Donation Will Be Matched 2-to-1! [...] Right now, we
| have a 2-to-1 Matching Gift Campaign, tripling the impact of
| every donation._ (from the home page)
| brookst wrote:
| Same. Oh hey these scammers are asking for money again? Wait,
| I haven't given to IA in a while.
| coldpie wrote:
| If you're able/comfortable, please consider setting up a
| recurring donation. For long-term planning reasons, it's
| helpful for organizations to have a consistent recurring
| revenue stream that they can use to project assets further into
| the future. One-off donations are good, too! But if you're
| going to consistently send them money anyway, you may as well
| do it in a predictable manner to help their accounting.
| pabs3 wrote:
| I wonder if this includes anything from Sci-Hub or are they
| unrelated.
| toomuchtodo wrote:
| Artifacts that are of questionable legality due to copyright
| are archived but not made public for obvious reasons (typically
| referred to being "darked").
| aaroninsf wrote:
| PSA the Internet Archive is a 501c3 non-profit (library!) and
| survives on donations and grants.
|
| A huge percentage of the operating budget is from small donors.
| The funding is preposterously small compared to other public-
| service public interest such as Wikipedia.
|
| A lot of us take it for granted and assume there is e.g. support
| from FAANG companies proportionate to the degree they lean on it.
|
| This is 100% NOT THE CASE.
|
| Please advocate for recurring institutional donations from your
| firm. The audience reading this has a lot of voice in a lot of
| organizations who could without a though sign up to make annual
| 10K, 100K, 1M donations...
|
| ...and essentially, none do.
|
| Please help change that!!!
|
| https://archive.org/donate/
| doc_gunthrop wrote:
| Anyone who uses amazon.com can set the Internet Archive as
| their preferred charity and shop using smile.amazon.com. A
| percentage of your purchases amount will go to the IA.
| sfusato wrote:
| archive.org is an alternative good internet as a giant library,
| as dreamed in early 90's: web archive, film archive, software
| archive, media archive... and now research papers archive.
| dang wrote:
| Related:
|
| _Internet Archive Scholar_ -
| https://news.ycombinator.com/item?id=26419782 - March 2021 (3
| comments)
|
| _Internet Archive Scholar: Search Millions of Research Papers_ -
| https://news.ycombinator.com/item?id=26401568 - March 2021 (47
| comments)
___________________________________________________________________
(page generated 2022-12-09 23:00 UTC)