[HN Gopher] Sci-Hub statistics and database
___________________________________________________________________
Sci-Hub statistics and database
Author : NmAmDa
Score : 317 points
Date : 2022-02-12 17:40 UTC (5 hours ago)
(HTM) web link (sci-hub.ru)
(TXT) w3m dump (sci-hub.ru)
| gw67 wrote:
| How they are able to store data without being seized?
| modeless wrote:
| Did Sci-Hub start working again? Last time I checked it wasn't
| adding new papers because of some legal thing going on in India.
| phoe-krk wrote:
| Yes - see
| https://twitter.com/ringo_ring/status/1492419986291408898
| Ansil849 wrote:
| The site is working in the sense of you can download old
| papers, but I don't believe any new papers from the last year
| are accessible.
| DoItToMe81 wrote:
| I accessed a paper from late last year not so long ago. I
| think it's working fine.
| Ansil849 wrote:
| > I accessed a paper from late last year not so long ago.
| I think it's working fine.
|
| It is not. A large batch of new papers was added
| manually, but the old service of typing in a DOI and
| having a paper be retrieved automatically is not working.
| Pick 10 random DOIs from 2022 and see how many Scihub
| will return.
| lamontcg wrote:
| AFAIK its not really working again? I think there was an upload
| of a bunch of papers in a batch recently, but not ones that I
| was hoping for. I'm sort of worried about the past-tense
| language in this page suggesting that it isn't starting back up
| again.
| nefitty wrote:
| Here's a notebook that fetches Sci-Hub mirrors from Wikidata and
| tests them. I also included an iOS Shortcut to add to your Share
| screen. When you're on a site that Sci-Hub recognizes and you use
| the shortcut it will try to fetch the paper.
|
| https://observablehq.com/@iz/sci-hub
| raziel2701 wrote:
| Alexandra Elbakyan is a titan and a saint. I couldn't have been
| able to finish my research without access to papers my
| institution wasn't subscribed to.
| [deleted]
| OmicronCeti wrote:
| I snuck her into my own dissertation acknowledgements:
| https://imgur.com/bDgtBAE
| jdrc wrote:
| Now i feel foolish for not acknowledging her, especially in
| elsevier papers.
| [deleted]
| p1esk wrote:
| Interesting. Medical field dominates research in terms of
| publications. Chemistry produces double the papers compared to
| physics, and humanities are smaller than biology but larger than
| physics. I wonder where machine learning papers fit in - CS or
| Math or both?
| sgillen wrote:
| I would imagine it depends on the particular paper, the more
| experimental ones in CS, the more theory ones in math.
|
| Do note though that most math and ML practitioners use arxiv
| over sci-hub.
| remuskaos wrote:
| It is note worthy that most of physics (at least high energy
| physics) ist published on arxiv.org and open access.
|
| I don't know if sci hub bothers with publications that are
| available freely from an official source.
| p1esk wrote:
| Good point. If so, this data is a lot less interesting :(
| philipkglass wrote:
| Sci-hub will grab and serve anything with a DOI (or at least
| used to; I don't know if they have started ingesting papers
| again after turning it off a while ago). I have found open
| access papers there before. It's simpler to just paste the
| DOI into sci-hub than to check to see if it's one of the few
| open access articles in a mostly paywalled journal.
| anon_123g987 wrote:
| > published on arxiv.org and open access
|
| Don't use the term "open access" like this. A paper published
| on arXiv is free to read, and was freely published. "Open
| access" is a scam by the big publishers, where they don't
| take money from the _readers_ , but make the _authors_ pay.
| Or, putting it another way, anyone can pay their way in those
| journals and publish (sometimes sub par) papers.
| nicoburns wrote:
| No, "open access" means that the paper is available to
| readers for free. Making the authors pay is typically
| termed "gold open access".
| mNovak wrote:
| I've never heard the term "gold open access", but I know
| plenty of "open access" journals that charge a fee to
| authors.
| remuskaos wrote:
| I wasn't aware that there were different distinct forms
| of "open access", so I had to read it up on Wikipedia.
| From what I understand, publications on arxiv are either
| gratis or libre open access.
|
| Either way, we don't pay anyone any fee to publish on
| arxiv.
| 13415 wrote:
| Not that I want to defend open access fees but the way you
| describe it is incorrect. Paying for open access fees with
| large publishers like Springer is an option that is
| separate from the review system, you can only choose it
| once your paper has been reviewed and accepted.
| remuskaos wrote:
| As I wrote on another comment, I wasn't aware that there
| are multiple forms of open access. Since it appears that
| arxiv (again, at least high energy physics) employs mostly
| either gratis or libre open access, and since the Wikipedia
| article explicitly calls it an open access archive, I see
| no harm in calling it that either.
|
| "arXiv (pronounced "archive"--the X represents the Greek
| letter chi [kh])[1] is an open-access repository of
| electronic preprints and postprints[...] "
| The_rationalist wrote:
| Machine learning publication rate is small, at least by
| assuming that paperswithcode contains most of the publications.
| [deleted]
| ok123456 wrote:
| What's the most popular paper on all of scihub? By field?
| mmettler wrote:
| Alexandra should get the Nobel prize.
| iqanq wrote:
| na85 wrote:
| I mean, even if you limited yourself to just the Peace prize
| (arguably the most controversial), you'd still have to
| reconcile your statement with the fact that people like
| Malala Yousafzai have won.
| iqanq wrote:
| allisdust wrote:
| With the rent seeking companies being from Europe? Not a
| chance.
|
| Nobel is a political tool that's mostly there to make a point
| (especially that peace prize).
| anon_123g987 wrote:
| He said "should", not "will". Both of you are right.
| [deleted]
| 2Gkashmiri wrote:
| i asked this question here and at many places before. why do
| people "rely" on an organization that sifts through hundreds of
| thousands of papers and then charge exorbitant prices for
| providing this service? if we use the amazon analogy, is amazon
| with millions of products worse than a boutique cat food seller
| that specializes in a specific cat food for a specific cat breed?
| maybe. but what about the "rest" of products?
|
| why are our scientists made to rely on elsevier et al to sift
| through the junk and find for them the perfect paper instead of
| doing it themselves? is science now such a cutthroat quick
| competition that it requires you to give a company the priviledge
| to work for you so that you dont have to do your own due
| diligence?
|
| in india, we have a lot of local research that is done on open
| databases like shodh ganga and many more. but if you have to
| access foreign research material, better luck your university has
| an agreement with elsevier and others to pay them millions for a
| login. the alternative, go to scihub and find what you need.
|
| i understand the whole quality/delivery debate but doesnt the
| average user already know who the big players in the specific
| domain are and who are trusted? or you want discoverability at
| the hands of a "trusted third party" without doing the legwork
| yourself.
|
| then at the other end you have non-academics like me. I might
| have heard of a research paper in some article and i cannot read
| it without paying an arm and a leg. why? if we use the whole
| ebook/book argument that compensation is commensurate to the
| sales so more popular book means more money to the author but
| here authors arent compensated but elsevier so why should i pay
| elsevier? because they filtered through 1000 papers to provide 10
| and for that privilege, they require unlimited royalty for ever?
| why?
| Hendrikto wrote:
| > why are our scientists made to rely on elsevier et al to sift
| through the junk and find for them the perfect paper instead of
| doing it themselves?
|
| Scientists do do that themselves. That's why it is called peer
| review. Journals take scientists work for free, they just pre-
| select papers, but don't do the review.
| slater wrote:
| There's some truth to "publish or perish". Scientists are
| expected to publish in prestigious journals.
| Qem wrote:
| Not only expected, but actually forced. In many places, a
| streak of a few years with no publications in prestigious
| journals can unrecoverably sink a researcher career.
| OmicronCeti wrote:
| A typical PhD dissertation these days is 3 publications in
| high-quality journals. It is explicitly required at most
| schools.
| f6v wrote:
| Because otherwise you'll have to sift through tons of garbage
| "research". It's already a common knowledge that many articles
| coming from certain countries are fraudulent. There's a lower
| chance of having those in journals like Nature Medicine.
| aurizon wrote:
| If only the Nobel Committee would say:- The Nobel Committee will
| only consider research published under an Open Source Access
| repository in reviewing published papers for consideration for
| the Nobel Prize after ~~ June 30, 2022. This would unleash a
| horde of hungry cats among those fat pigeons that are the
| paywalled journals. There would be a crying and wailing - ending
| with piles of feathers,(and purring cats), and researchers all
| over the world, and especially in the many 'third world'
| Universities whose minds are currently held hostage to budgets
| and local politics. The world would gain immeasurably by this
| simple act!
| [deleted]
| gumby wrote:
| 100 TB is pretty small. I wonder if she will start torrenting it
| so people can back it up and share the load.
| logifail wrote:
| > 100 TB is pretty small. I wonder if she will start torrenting
| it so people can back it up and share the load
|
| This has been ongoing for a while now:
|
| _Rescue Mission for Sci-Hub and Open Science: We are the
| library_
| https://www.reddit.com/r/DataHoarder/comments/nc27fv/rescue_...
| gumby wrote:
| Excellent, thanks!
| intunderflow wrote:
| Remember to donate to sci-hub to keep it going! Even a small
| donation helps and is way more than the extortionate prices we'd
| all have to pay without it :D
| [deleted]
| The_rationalist wrote:
| How come we don't have extensive software for helping doctor
| decision making by making use e.g of bayesian inference while
| feeding on the available superintelligence that enable those 24
| millions paper? Expert systems long passed the hype curve and
| it's time for them to cycle up again!
| f6v wrote:
| Because research can be controversial. There're papers in my
| field saying patients have increased frequency of certain
| cells. There're other papers saying they're not. Go figure.
| Qem wrote:
| Nailed it. With publish or perish incentivizing shenanigans
| like "p-hacking", many of those papers are the research-
| equivalent of spam.
| [deleted]
| nefitty wrote:
| I think Watson does something like this.
| roywiggins wrote:
| It didn't seem to actually work though.
|
| https://slate.com/technology/2022/01/ibm-watson-health-
| failu...
| monkeybutton wrote:
| I wonder how many years that sets back the field. Who will
| want to invest in something that could end up being Watson
| 2.0?
| dagw wrote:
| Did Watson fail because they where bad at their job or
| because the problem is much harder than people assumed?
| nefitty wrote:
| I think the marketing got ahead of the tech. I would
| classify that as a business failure.
| kilburn wrote:
| An older comment of mine
| https://news.ycombinator.com/item?id=30049522 fits well here.
| I'll adapt it to your question ;)
|
| Basically: medicine as a whole is already some sort of expert
| system.
|
| - Data collection and cleanup: Researchers conduct experiments
| to produce meaningful data and extract conclusions from that
| data.
|
| This part isn't more automated because we have strict rules
| that prevent medical data collection and analysis without a
| clear purpose. Otherwise we'd be able to collect a lot more
| information to try and extract results from it using more
| inference-oriented techniques (deep learning and the like).
|
| - Modeling & training: Expert panels produce guidelines from
| the results of that research. These panels are the "training
| part" of the system.
|
| As a sibling comment said, replacing these panels with ML-based
| techniques isn't trivial because the data produced in the
| previous step is fairly noisy (p-value hacking, difficulty of
| capturing all the variables, etc.). Furthermore, the techniques
| that yield best results nowadays also produce them without
| clear explanations on why they hold, which is not something we
| are prepare to accept in medicine.
|
| - Execution: Doctors diagnose and treat following said
| guidelines. In fact, they use decision flows that they
| themselves call... algorithms!
|
| The main reason why execution is not automated is that we do
| not have the technology for machines to capture the contextual
| and communication nuances that doctors pick up on. There can be
| a world of difference between the exact same statement given by
| two different patients or even the same patient in two
| different situations. Likewise, the effect of a doctors'
| statement can be quite literally the opposite depending on who
| the patient is and their state of mind. One of the most
| important aspects of the GP's job is to handle these
| differences to achieve the best possible outcomes for their
| patients.
|
| All that being said, there are companies trying to produce
| expert systems to help doctors diagnose. See
| https://infermedica.com/product/infermedica-api for instance.
| [deleted]
| [deleted]
| belter wrote:
| It looks like the torrents have all subjects. Anybody aware if
| there are torrents only for Math or Comp.sci ?
| Ansil849 wrote:
| Is there any word about when sci hub is going to start adding new
| articles again? It's currently only useful as an archive of old
| research articles. New papers from the last year are not
| available. I never understood the rationale for stopping new
| content, though I believe it had some relation to some court case
| in India...but I don't understand why that was a reason to stop
| adding articles, and why it hasn't been restarted yet.
| derbOac wrote:
| What I read was that the Indian judicial system tends to be
| favorable to things like Sci Hub in its interpretation of
| copyright, and Sci Hub wanted to act in good faith with regard
| to that court, so as to have a fairly solid basis in
| international law for operating, should it rule in Sci Hub's
| favor. I might be off in this understanding, but that's what I
| understood.
| Ansil849 wrote:
| Yeah, I have heard this reasoning, but it seems muddled. How
| is keeping the site online so old articles are available but
| no new articles are added acting in "good faith"? It's not
| like the old articles are any less copyrighted than the new
| articles, so this doesn't make sense to me.
|
| The court case has also been delayed for over a year now, so
| if it is delayed indefinitely, like it seems to be, then we
| will also not get access to new articles, also indefinitely?
| That's ridiculous. The last update from the court proceedings
| claimed that there would be a new update over a month ago,
| which in turn got delayed yet again to a few days ago, and
| there's been nothing [1].
|
| [1] https://delhihighcourt.nic.in/dhc_case_status_oj_list.asp
| ?pn...
| baybal2 wrote:
| In India, courts have famously few remedies against no-show
| from plaintiffs.
| joshuaissac wrote:
| They had resumed adding articles after receiving legal advice
| that the Delhi High Court injunction only applied for a few
| months.
|
| https://mobile.twitter.com/ringo_ring/status/143435621720862...
| Ansil849 wrote:
| I saw that tweet, but it doesn't change the material reality:
| try plugging in some DOIs from recent article from the last
| year, and they will not be there.
|
| Scihub used to be a great resource, now it's only a resource
| for old research. Still useful for background material, but
| not for current work.
|
| I also don't understand why the Indian court case has any
| impact on new article availability. The owner is not Indian.
| The servers and domains are not Indian. There doesn't seem to
| be any actual reason to stop adding new articles, other than
| some idiotic halfbaked point that only hurts the people who
| need the articles, like when Project Gutenberg banned anyone
| from a German IP, except this is much worse since there is no
| way around it for people who need new papers.
| [deleted]
| generationP wrote:
| I have a hunch that the downfall of the "Plato" real-time
| downloader wasn't the Indian court case but rather the fact
| that it helped publishers trivially identify the university
| accounts through which the downloads were happening. Even
| if the appearance of papers were delayed by a random number
| of days, there are other pitfalls now, and most
| importantly, publishers started caring. In particular,
| Elsevier now slaps UUIDs onto all PDFs you download from
| them, and no, I'm not just talking of visible watermarks.
| Other publishers seem to be doing similar things (there was
| a recent twitter thread on this, retweeted by @textfiles,
| which I can't find). The rational solution for Sci-Hub
| seems to be to buffer their uploads and release them in
| yearly batches, maybe programmatically removing various
| kinds of watermarks and diffing against the same paper
| downloaded from a second IP. If this is what they are
| doing, I'm not surprised. Not sure how much of a winning
| strategy they have in the long run, though.
|
| Guys: post your papers on the arXiv.
| mohammad_ali85 wrote:
| This might be the twitter thread you're referring to?
| https://twitter.com/json_dirs/status/1486120144141123584
| generationP wrote:
| Yep, thank you!
| joshuaissac wrote:
| > I also don't understand why the Indian court case has any
| impact on new article availability. The owner is not
| Indian. The servers and domains are not Indian.
|
| Because Sci-Hub has a good chance of winning the case. The
| court in question has previously backed a very broad
| definition of what constitutes fair dealing.
|
| https://en.m.wikipedia.org/wiki/University_of_Oxford_v._Ram
| e...
| Ansil849 wrote:
| > Because Sci-Hub has a good chance of winning the case.
|
| I understand that this is the party line that is parroted
| whenever this issue comes up, but it does not make any
| sense as a rationale for keeping new articles off the
| site. How is not adding any new articles (but, for
| example, keeping old articles accessible) assisting the
| possible winning of the case? And more to the point, why
| does it matter at all if it wins or loses the case? As
| stated, neither the owner or the infrastructure is
| Indian, so of what relevancy is this jurisdiction?
|
| And further still, the case appears to have been delayed
| indefinitely. That last update claims that there was
| going to be an update a few days ago, but there was not.
| The proceedings are just now a list of one postponement
| after another [1]. Given that new articles are being held
| hostage, it thus very obviously benefits the legal system
| and the prosecution to continue to delay the case
| indefinitely.
|
| [1] https://delhihighcourt.nic.in/dhc_case_status_oj_list
| .asp?pn...
| sa1 wrote:
| The owner might not be Indian, but she's actively
| defending the case(through lawyers) in India. Not
| following the injunction would lead her to losing the
| case, which is why she followed through. She didn't have
| to fight the case in India, but she chose to. Why keep
| old papers and stop adding new papers - that probably
| depends on the terms of the injunction.
| Ansil849 wrote:
| > Why keep old papers and stop adding new papers - that
| probably depends on the terms of the injunction.
|
| As per the official tweet that has already been mentioned
| in this thread [1]:
|
| > how about the lawsuit in India you may ask: our lawyers
| say that restriction is expired already
|
| So according to the owner's official Twitter, this is no
| longer a valid reason, and yet new papers are still not
| accessible. Why is that?
|
| [1] https://mobile.twitter.com/ringo_ring/status/14343562
| 1720862...
| sa1 wrote:
| Haven't got around to adding yet?
| Ansil849 wrote:
| That is not how scihub used to function. Scihub used to
| have an engine, named Plato, which would fetch papers
| automatically if not already in their database. For the
| last year now, this essential service has not been
| operational. This is what the issue I am raising is
| about.
| sa1 wrote:
| It's clear what you're talking about. Software bitrots
| over time. Plato might need fixes, might have a huge
| backlog, lots of stuff can happen.
| pmoriarty wrote:
| It's interesting how sci-hub's papers on medicine dwarf those in
| many other fields like comp-sci, math, and physics. I wonder if
| that reflects the number of papers in those fields, or if sci-hub
| just has a non-representative sample. If the latter, why?
| p1esk wrote:
| It does appear to be the latter. I just searched for several
| famous ML papers (attention is all you need, lottery ticket
| hypothesis, capsules, etc) and they are not there. I think if
| someone counted all papers that have been ever published
| anywhere, the picture would be a lot different.
| pmoriarty wrote:
| So does that mean that vastly more people in medicine use
| sci-hub than do people from other fields?
|
| Or is there some other reason for the discrepancy?
| p1esk wrote:
| Could be. I've been an ML researcher for 8 years and I
| haven't used sci-hub until today. Ironically one of my
| (very obscure) papers is available there.
| Qem wrote:
| I guess today is much easier to find new noteworthy,
| publishable facts in medicine than physics. New diseases are
| discovered every year, and old diseases are poorly understood
| (e.g Alzheimer disease), and the treatments for many of them
| are still sub-optimal, or even inexistent. Every patient is
| different, individual cases are research-worth. We only got
| antibiotics in the 1940s. On the other hand, most big
| breakthroughs of physics happened between the 17th century and
| the first decades of the 20th century. After the general case
| is cracked in physics, individual cases have very little
| publishing value.
| _Wintermute wrote:
| I think it's due to the sheer number of biomedical papers
| published each year, coupled with comp-sci, maths and physics
| papers being less likely to be behind a paywall.
___________________________________________________________________
(page generated 2022-02-12 23:00 UTC)