[HN Gopher] Pirate Library Mirror: Preserving 7TB of books (that...
___________________________________________________________________
Pirate Library Mirror: Preserving 7TB of books (that are not in
Libgen)
Author : ValentineC
Score : 243 points
Date : 2022-07-03 20:39 UTC (2 hours ago)
(HTM) web link (pilimi.org)
(TXT) w3m dump (pilimi.org)
| krick wrote:
| So, if I get it right. First there was Libgen, which is
| mirrorable. Then, some Z-Library copied Libgen and added some
| more books, without making it mirrorable. The goal is to make
| these new books, which are not mirrorable -- mirrorable (i.e. to
| "preserve" them).
|
| So, why not just re-upload them to Libgen, then? I guess somebody
| will do that now anyway, but you could easily done it in the
| first place, without making your own mirror, which is not a
| mirror of Libgen. Just upload them to Libgen and make a mirror of
| Libgen.
| facethewolf wrote:
| From their FAQ:
|
| > _Q: Should the Z-Library collection be added to Library
| Genesis?_
|
| > _A: Yes! However, it is tricky. Library Genesis splits out
| its collection between non-fiction and fiction. They also have
| relatively high quality standards. If you are interested in
| organizing all the books to meet their requirements, let us
| know._
| krick wrote:
| Oh, it didn't come up to me to navigate "backwards". Thanks.
| Actually, their whole FAQ is quite more enlightening than
| just the linked page.
|
| http://pilimi.org/faq.html
| [deleted]
| sacrosanct wrote:
| This really needs to be hosted on a Tor hidden service. Clearnet
| sites are easy to take down.
| fsflover wrote:
| Or, better on torrents inside I2P: http://geti2p.net.
| progman32 wrote:
| The homepage has a link to the onion site.
| sacrosanct wrote:
| So why does it have a clearnet address? To have more reach?
| What's their threat model such that a clearnet presence could
| possibly out the people behind this?
| progman32 wrote:
| I don't know. I'd guess reach. Probably wouldn't be on HN
| if it were darknet only.
| ipaddr wrote:
| There is no ssl so no cert fingerprint shodan matching leak
| metadat wrote:
| Please don't flag it this time. Folks deserve the option to at
| least be aware of it and decide for themselves if they wish to
| pursue it.
| [deleted]
| yeetsFromHellL3 wrote:
| AdmiralAsshat wrote:
| HTTP-only makes me weary of visiting a self-professed piracy
| site.
|
| They couldn't even spring for a Let's Encrypt cert?
| 1vuio0pswjnm7 wrote:
| https://web.archive.org/web/20220701204054if_/http://pilimi....
| mkl wrote:
| *wary
|
| For some reason I'm seeing this mistake more and more lately.
| https://en.wiktionary.org/wiki/weary vs
| https://en.wiktionary.org/wiki/wary
| nerdponx wrote:
| I've seen it too, and it's a weird mistake to me because I am
| certain that at least one of my friends who makes the mistake
| did not make it in the past.
|
| I wonder if there's a kind of memetic effect happening
| online, where people who lack confidence in their English
| spelling ability see somebody make this mistake and somehow
| think that it's correct, so they switch how they write it.
| gjvc wrote:
| cf. revert / reply
| jamiek88 wrote:
| People have mixed up leery and wary and it has memetically
| become 'weary'.
|
| I notice it more and more too.
| InCityDreams wrote:
| No room for auto-correct?
| brewdad wrote:
| Even so, there are enough ESL readers on here, along with
| native speakers who may not understand the difference, that
| it makes sense to point it out once in a while.
|
| Otherwise we end up in lose/loose situation where I see
| more people use it wrongly than correctly.
| mkl wrote:
| Seems unlikely, as "wary" appears to be the more common
| word (71M Google results vs 55M). Maybe some keyboard
| mistakenly does it though.
| bbarnett wrote:
| Not sure why the count is relevant here. Both are valid
| words.
| wccrawford wrote:
| I've heard people _say_ it. People who I think should know
| better. It 's really frustrating.
| quazar wrote:
| It's a read only blog with 2 pages. What do you gain for
| putting this over HTTPS?
| ac0lyte wrote:
| Trust?
| na85 wrote:
| Malicious sites can still use letsencrypt
| SquareWheel wrote:
| Do you really want your ISP to know which piracy sites you
| frequent? This is all being sent in plain text. Or they could
| change the content, insert a redirect, or inject ads without
| your knowledge. TLS is needed on all websites - not just
| those with interaction.
| cookiengineer wrote:
| I hate to break it to you, but why do you think ISPs
| override the DNS responses with TTL set to 0?
|
| TLS itself is only useful if you also rely on DNS over
| HTTPS/TLS. Well, setting the issues with TLS 1.2 and
| earlier aside.
| SquareWheel wrote:
| Those are problems too, but they aren't exploited nearly
| as often as MITMing cleartext has been historically. The
| solution you mention is already becoming widely-
| supported, as are newer protocols like QUIC that
| discourage snooping.
|
| There's no reason to ignore a good solution just because
| it's not 100% perfect.
| ars wrote:
| They still know which sites you visit even with https.
| kenniskrag wrote:
| only the destination IP. TLS encryption is inside tcp and
| around the http protocol.
| hombre_fatal wrote:
| You don't need to copy and paste your reply everywhere
| it's relevant on HN. Even us flea brains can carry your
| remarks in our head and apply them to similar comments.
| geoffeg wrote:
| And, unless you setup an appropriate DNS server and the
| default from your ISP, then they also know that you
| looked up the site's hostname(s).
| generalizations wrote:
| https won't keep your ISP from knowing you visited the
| site. And the rest of those? For a text-only blog, they
| seem kinda trivial.
| simlevesque wrote:
| > https won't keep your ISP from knowing you visited the
| site
|
| If you use DoH, yes it does. Unless I'm mistaken. They
| only know the IP address of the remote server.
| kevin_thibedeau wrote:
| And nobody would ever think of keeping a reverse DNS
| index.
| Anunayj wrote:
| and the SNI, until ECH is widely adopted, SNI is leaked
| in plaintext when connecting to a server, it needs to
| because how will the server know which TLS cert to reply
| with?
| cookiengineer wrote:
| > They only know the IP address of the remote server.
|
| It's the internet. Everyone can scrape links and
| measure/correlate which assets were on them to correlate
| likely visited websites.
|
| Especially if every web page these days is pretty unique
| in terms of what kind of assets (network streams) with
| what kind of byte size were loaded at which point in the
| document loading timeline.
|
| Now include the TLS fingerprint of your web browser and
| well, privacy went to shit.
|
| HTTP needs an upgrade with scattering and rerouting on
| the fly, otherwise these deanonymization techniques can
| never be fixed.
| 1vuio0pswjnm7 wrote:
| Not disagreeing but presenting a hypothetical:
|
| If the user requests the page from Internet Archive,
| Common Crawl or even Google Cache, how does the ISP know
| what the user requested. (NB. Neither IA nor Google Cache
| require sending SNI,^1 so the ISP may only see IP
| addresses).
|
| With IA, the IP address alone does not reveal which IA
| site or page the user is requesting. There is more to IA
| than only Wayback Machine.
|
| With Common Crawl, the user can send the Cloudfront
| domain name instead of a commoncrawl.org domain. Are all
| ISPs going to know that this is Common Crawl. Even if
| they expend the effort to learn, what benefit is
| achieved.
|
| With Google Cache, the IP address alone does not reveal
| which Google site the user is accessing. Needless to say,
| there are many, many domains using these IP addresses.
|
| There is nothing that requires any web user to retrieve
| web pages from a given host. The page may be mirrored at
| a number of hosts. Some of those hosts might offer HTTPS,
| support TLS1.3 and not require plaintext SNI/offer
| encrypted ClientHello.
|
| Even assuming an ISP can determine what domain name a
| customer is sending in a Host header or ClientHello
| packet, it would still be necessary to subpoena the
| archive/CDN/cache to figure out precisely what pages were
| being requested.
|
| 1. The same party is controlling all the server
| certificates. IA controls the certificates for all IA
| domains, Amazon (issues and) controls all the
| certificates for Cloudfront customers and Google controls
| all the certificates for Google domains. Perhaps there
| are web users commenting on HN who believe that
| ingress/egress traffic for site saved/hosted/cached at an
| archive/CDN/cache is somehow private as against the
| company running the archive/CDN/cache in a meaningful
| way. I am not one of them.
|
| As for the question of an ISP modifying the contents of
| web pages, this is an issue that could be addressed
| contractually in a subscriber agreement. It stands to
| reason that if this was a serious issue and not merely a
| hypothetical one raised by nerds debating the merits of
| TLS then it would be addressed in such agreements.
|
| As for the "injection of advertising" issue as a argument
| in favour of the way TLS^2 is being administered on the
| web, IMO this is a bit silly since (a) it is trivial to
| filter such advertising (e.g., Javascript in the examples
| I saw) out out of the page and/or block it from
| running/connecting/loading and (b) the amount of "tech"
| company-mediated advertising that web users endure in
| spite of using TLS is enormous. More likely than being
| seen as a threat to web users, the injection of
| advertising by ISPs was seen as a threat to the
| advertising revenue of "tech" companies. The later are
| responsible for facilitating the injection of advertising
| (by their customers, not their competitors, i.e., ISPs),
| not preventing it.
|
| 2. By "TLS administration" I do not mean encryption as a
| concept nor certificates as a concept. I mean TLS
| administration measures designed to support "tech"
| companies first and web users second, if at all. A system
| where the questions of "threat model" and "trust" are
| both decided by "tech" companies not users.
| robonerd wrote:
| > _For a text-only blog_
|
| If somebody MITMs it, they can serve you anything they
| want.
| enriquto wrote:
| > they can serve you anything they want.
|
| Great. More books!
|
| No really, I don't understand this argument. A static
| site served by plain http is perfectly appropriate. It's
| like a poster hanging on the wall for all to see. Of
| course people can paint over it, but it doesn't really
| matter.
| robonerd wrote:
| They could serve you javascript that exploits your
| browser. At the very least, they could replace that
| bitcoin donation address with their own. That's a
| tempting target if nothing else.
| SquareWheel wrote:
| And "they" isn't just your ISP. It's also that free wifi
| hotspot you connected to, or the hotel service, or your
| company's network. Even if you trust your ISP (and you
| probably shouldn't), there are other bad actors to be
| aware of.
| criddell wrote:
| HTTP connections can be used as a weapon against others.
| One example is China's Great Cannon.
|
| https://citizenlab.ca/2015/04/chinas-great-cannon/
| kenniskrag wrote:
| only the destination IP. TLS encryption is inside tcp and
| around the http protocol.
| Nextgrid wrote:
| Until ESNI becomes mainstream (and browsers offer the
| ability to enforce it), the domain name is also sent out
| in plaintext.
| syntheticcorp wrote:
| ESNI has been dropped, a new spec alters how it works and
| renames it Encrypted client hello (ECH)
|
| https://blog.mozilla.org/security/2021/01/07/encrypted-
| clien...
| btdmaster wrote:
| ECH looks quite interesting, but isn't it quite easy to
| do a reverse DNS lookup for most domains?
| notriddle wrote:
| The same thing you always get. Assurance that your free Wi-Fi
| hotspot isn't tampering with the page.
| krick wrote:
| This, and also MITM (like ISP) needs to make their own
| request to this site, to know what I read. And,
| technically, they cannot really be sure it's what I've
| read, since nothing says that this site is static.
|
| I'm not that offended, and torrents are only available via
| TOR anyway, but I do actually appreciate the sentiment.
| There's no reason to be not using TLS.
| uniqueuid wrote:
| It's really funny to think about how the advances of technology
| keeps changing how we perceive books.
|
| 7TB is even a commodity disk these days. And it's a lot less than
| the torrent of scientific papers that floated around some time
| ago (that was ~18TB IIRC).
| nonrandomstring wrote:
| I foresee storage density reaching the point that for most
| ordinary people "online" becomes rather unimportant. What would
| be the effects of technology when computers behave as in early
| science fiction, as stand-alone oracles? [1]
|
| [1]
| https://www.timeshighereducation.com/opinion/2048-informatio...
| themodelplumber wrote:
| How would the appeal of streamers and live data/content
| settle out in that case? Sometimes context is available in
| the moment that makes it easier for all parties to consume
| and analyze in that moment as well.
|
| Since transient, ethereal meme culture is also basically
| emergent culture now, it's difficult not to also foresee a
| greater cultural divide in such a case. This is saying
| nothing of live data tools as well, even weather data...
| ars wrote:
| It's 7TB compressed. If it's text you'd need about 70TB to
| decompress it. It's probably mostly images though, so probably
| not quite that bad.
| emj wrote:
| I've tried to do lossy compression of epubs with some lines
| of bash scripts; i.e. removing the images and fonts that were
| not needed. Many epubs could be downsized to a third of their
| size, but then I found a book that needed the supplied fonts
| and gave up. When doing lossy compressions can not have those
| kind of bugs.
|
| What I also found was that many of the images in the epubs
| were already unuseable and nothing like their counter parts
| in phsyical books.
| solarmist wrote:
| I don't understand this. Are they epubs of comics or
| something? Epubs are already compressed (zip).
| [deleted]
| robonerd wrote:
| It's not terribly uncommon to find an epub with several
| megabytes of cover art and a few hundred kilobytes of
| text.
| generalizations wrote:
| I wonder if there are any search engines dedicated to indexing
| these kinds of libraries. I know there's a decent one just for
| scihub, but it would be awesome if I could do a Google-style
| search that returned the contents of books, magazines and journal
| articles instead of just websites.
| delusional wrote:
| Wasn't that what google books was supposed to be?
| voisin wrote:
| I assume Google abandoned this along with all their earlier
| mission statements in favour of building another chat app.
| bbarnett wrote:
| I can't imagine a life at google. So much promise, all
| turned to ash.
| lupire wrote:
| Could you take a moment to check it Google Books search
| still exists?
|
| I'll give you a hint: https://books.google.com/?hl=en
|
| > Search the world's most comprehensive index of full-text
| books.
| voisin wrote:
| I know it exists, but it has appeared to languish for
| years. They rely on third parties now for inclusion of
| books, whereas in the early days they innovated on their
| own with specialized scanning technology. They seemed
| quite proud of it a decade ago. When was the last time
| Google has touted their books project? Have they even
| integrated searching books into their main search (which
| was supposed to catalogue And make searchable all the
| world's information)?
|
| It hasn't been killed but it is clearly a zombie.
| londons_explore wrote:
| It was paralyzed by legal disputes with book publishers.
|
| In the years the lawsuits were going on, nearly everyone
| left the project. And then the lawyers have put in so
| many red lines that it's nearly impossible to make any
| changes to it.
| zozbot234 wrote:
| Book metadata is widely available via sites like e.g. Open
| Library. With good metadata, full text search is not as
| relevant.
| aaron695 wrote:
| sacrosanct wrote:
| There is the Imperial Library of Trantor: https://trantor.is/
|
| They offer a clearnet and a hidden service .onion incase you
| don't want ISPs blocking access to it.
| lupire wrote:
| books.google.com
| 2OEH8eoCRo0 wrote:
| Your movie recommendations suck, ValentineC ;)
| ars wrote:
| 7TB of compressed text? I don't think humanity has generated that
| much written words in it's entire existence. Although it would be
| an interesting Fermi Problem to estimate (and don't forget just
| how well text compresses).
|
| This has to be a lot of duplicates or bad formats (images). This
| would be far more useful to people with some curating.
| khazhoux wrote:
| These are most likely PDFs, not plaintext.
| [deleted]
| Larrikin wrote:
| Is there an index of what's included. 7 tb is alot to ask for
| simply upholding an ideal
| robonerd wrote:
| It apparently comes with a index in the form of a MySQL
| database that contains title, author, description, and
| filetype.
| robonerd wrote:
| > _We will release the data in stages, as we are still processing
| the files. Right now the metadata file and a few of the torrents
| are available. Note that the torrent files are only available
| through our TOR mirror._
|
| Presently, only the first four of several dozens of parts are
| available.
| dkjaudyeqooe wrote:
| To be fair, Z-Library doesn't charge unless you want to download
| more than 10 books per 24 hour period. That's per account and
| although they ask you not to open multiple accounts they don't
| seem to do anything to stop you.
| PostOnce wrote:
| What kind of fairness can there be in charging for stolen
| books?
|
| I believe in free access to education, but charging for these
| books they have no rights to is a whole other thing.
| mdp2021 wrote:
| Running costs.
| Sparkle-san wrote:
| It would depend on how they use the funds. I wouldn't be
| surprised if bandwidth expenses made up a majority of what it
| cost to run Z-Library and that money has to come from
| somewhere.
| II2II wrote:
| At this point is worth noting that there are reputable
| sources for free books, such as public libraries and Project
| Gutenberg.
|
| I realize that neither source will satisfy many of the people
| on HN, simply because there is a need for current technical
| books.
| fancyfredbot wrote:
| You don't have to pay them if you go steal them yourself. If
| you find it more convenient to pay another thief to do it for
| you then I don't think that's significantly less fair.
| UmbertoNoEco wrote:
| All those India/Malasya/Guatemala kids stealing a 200 USD
| PDE book from Pearson should go to jail... in America...
| Assange style
| KMnO4 wrote:
| They're not charging for the books; they're charging for the
| bandwidth.
| dkjaudyeqooe wrote:
| The "to be fair" is against the claim in the article that
| they charge for books.
|
| It's a figure of speech, not a comment on what they do.
| jl6 wrote:
| Yeah, grey hat shenanigans easily become black hat as soon as
| money is involved.
| mgaunard wrote:
| Technically, they're not charging for the books, they're
| charging for the bandwidth.
| krick wrote:
| I don't quite agree. I mean, they provide useful service, and
| it costs money to run it. It's ok that they earn (even if
| it's actually making a profit, not just covering the costs).
|
| That being said, 10 downloads/day feels a bit restrictive to
| me. I'd get if it was 100, or 50, heck, maybe even 20. I
| mean, I don't appreciate that it's not mirrorable in the
| first place, but maybe they cannot afford it, I don't know...
| But 10 feels less than somebody researching a new topic might
| need to access in a day, even if he won't read them all
| immediately.
|
| ...That being said as well, it has some really nice UI. I
| wish somebody did it for Libgen.
| lupire wrote:
| You read more than 10 books per day?
| krick wrote:
| Read -- no, I don't. Download to skim and see the
| contents -- yes (even though I don't do it everyday,
| obviously). In fact, I rarely download less than 4 books
| at once, except for occasions when it's a new book of my
| favorite writer (in which case I can as well just buy
| it). Instead, there is some topic, some reason why I need
| these books, and I somehow can gather a dozen of
| recommendations, maybe more, then I need to actually get
| a look inside of them, to see what I'll be reading (if
| anything). It also happens that I kinda know the book,
| but not precisely enough, because some authors really
| like to milk the topic by publishing 5 books kinda the
| same as the first successful one, and if they are
| technical they can have 5 revisions each. I may not read
| them at all, or I may be reading them during the whole
| next year, but I'll need to get them all at once at
| first.
|
| And if we also count papers, which this site provides too
| -- easily.
| swayvil wrote:
| >What kind of fairness...?
|
| Well they are providing a great public service and their
| system takes money to maintain. It's just a small fee.
___________________________________________________________________
(page generated 2022-07-03 23:00 UTC)