[HN Gopher] LibGen's Bloat Problem
___________________________________________________________________
LibGen's Bloat Problem
Author : liberalgeneral
Score : 226 points
Date : 2022-08-21 12:22 UTC (10 hours ago)
(HTM) web link (liberalgeneral.neocities.org)
(TXT) w3m dump (liberalgeneral.neocities.org)
| idealmedtech wrote:
| I'd love to see that distribution at the end with a log-axis for
| the file size! Or maybe even log-log, depending. Gives a much
| better sense of "shape" when working with these sorts of
| exponential distributions
| Retr0id wrote:
| In an ideal world, every book could be given an "importance"
| score, for some arbitrary value of importance. For example, how
| often it is cited. This could be customised on a per-user basis,
| depending on which subjects and time periods you're interested
| in.
|
| Then you can specify your disk size, and solve the knapsack
| problem to figure the optimal subset of files that you should
| store.
|
| Edit: Curious to see this being downvoted. Is it really that bad
| of an idea? Or just off-topic?
| ssivark wrote:
| Seems like a perfectly good idea to me! Basically how proposing
| that we decide caching by some score, and the details of the
| score function be tweaked to handle the different aspects we
| care for.
|
| I wonder whether this idea is already used for locating data in
| distributed systems -- from clusters all the way to something
| like IPFS.
| DiggyJohnson wrote:
| Not to sound blunt, but answering your question on the
| downvotes (which you probably didn't deserve, especially
| without reply).
|
| The concept of an importance score feels very centralized and
| against the federated / free nature of the site. Towards what
| end?
|
| If the "importance score" impacts curation, I am strongly
| against it. Not only is it icky, but how is it different than a
| function of popularity?
| Retr0id wrote:
| I'm not suggesting reducing the size of the LibGen
| collection, I'm thinking along the lines of "I have 2TB of
| disk space spare, and I want to fill it with as much
| culturally-relevant information as possible".
|
| If the entire collection were availble as a torrent (maybe it
| already is?), I could select which files I wish to download,
| and then seed.
|
| Those who have 52TB to spare would of course aim to store
| everything, but most people don't.
|
| Just as the proposal in the OP would result in the remaining
| 32.59 TB of data being less well replicated, my approach has
| the problem that less "popular" files would be poorly
| replicated, but you could solve that by _also_ selecting some
| files at random. (e.g. 1.5TB chosen algorithmically, 0.5TB
| chosen at random).
| liberalgeneral wrote:
| I don't think you've deserved the downvotes, and I don't
| think it's a bad idea either; indeed some coordination as
| to how to seed the collection is really needed.
|
| For instance phillm.net maintains a dynamically updated
| list of LibGen and Sci-Hub torrents with less than 3
| seeders so that people can pick some at random and start
| seeding: https://phillm.net/libgen-seeds-needed.php
| [deleted]
| wishfish wrote:
| Has anyone ever stumbled across an executable on LibGen? The
| article mentioned finding them but I've never seen one.
|
| I agree with the other comments that LibGen shouldn't purge the
| larger books. But, in terms of mirrors, it would be nice to have
| a slimmed down archive I could torrent. 19 TB would be
| manageable. And would be nice to have a local copy of most of the
| books.
| liberalgeneral wrote:
| > Has anyone ever stumbled across an executable on LibGen? The
| article mentioned finding them but I've never seen one.
|
| Here is a list of .exe files in LibGen:
| https://paste.debian.net/hidden/1c82739a/
|
| And a breakdown of file extensions:
| https://paste.debian.net/hidden/579e319c/
|
| > And would be nice to have a local copy of most of the books.
|
| Yes! That was my intention--I wasn't advocating for a purge of
| content but a leaner and more practical version would be
| amazing.
| marcosdumay wrote:
| So, 1000 exes and 500 isos (that may be problematic, but most
| probably aren't). Everything else seems to be what one would
| expect.
|
| That's way cleaner than I could possibly expect. Do people
| manually review suspect files?
| macintux wrote:
| > Yes! That was my intention--I wasn't advocating for a purge
| of content but a leaner and more practical version would be
| amazing.
|
| Your piece doesn't make that obvious at all, and given how
| many people here are misunderstanding that point, you might
| want to update it.
| liberalgeneral wrote:
| You are right, added a paragraph at the end.
| wishfish wrote:
| Thanks for the lists. I was genuinely curious about the exes.
| Nice to know where they originate. Interesting that over half
| of them have titles in Cyrillic. I guess not so many English
| language textbooks (with included CDs) have been uploaded
| with the data portion intact.
| pnw wrote:
| You can search by extension - there's a lot of .exe files,
| mostly Russian AFAIK.
| hoppyhoppy2 wrote:
| I saw a book on antenna design on libgen that originally
| included a CD with software, and that disk image had been
| uploaded to the site.
| johndough wrote:
| That graph of file size vs. number of files would be much easier
| to read if it were logarithmic. I guess OP is using matplotlib.
| In this case, use plt.loglog instead of plt.plot. Also, consider
| plt.savefig("chart.svg") instead of png.
| liberalgeneral wrote:
| Here is the raw data if you are interested:
| https://paste.debian.net/hidden/77876d00/
| johndough wrote:
| Thanks. Here is a logarithmic plot as SVG:
| https://files.catbox.moe/zbf35r.svg
|
| On a second thought, a logarithmic histogram might convey
| even more information, but that would require all file sizes
| to recompute the bin sizes.
| kqr wrote:
| Huh, this distribution is not the power law I would have
| expected. Maybe because it's limited to one media type
| (books)?
| cbarrick wrote:
| This is a complete nit, but s/an utopia/a
| utopia/
|
| Even though "utopia" is spelled starting with a vowel, it is
| pronounced as /ju:'toUpi@/, like "yoo-TOH-pee-@", with a
| consonant sound at the start.
|
| Since the word starts with a consonant sound, the proper
| indefinite article is "a".
| kevin_thibedeau wrote:
| Now you have to convince the intelligentsia how to use the
| proper article with "history".
| boarush wrote:
| I don't think OP takes into account that there seem to be
| multiple editions of the same book which are often required by
| people to refer to. Not everyone wants the latest edition when
| the class you're in is using some old edition.
| liberalgeneral wrote:
| If you are referring to my duplication comments, sure (but even
| then I believe there are duplicates of the exact same edition
| of the same book). Though the filtering by filesize is
| orthogonal to editions etc. so has nothing to do with that.
| xenr1k wrote:
| I agree. There are duplicates. I have seen it.
|
| I have found the same book with multiple sized pdf, with same
| content. Someone maybe uploaded a poorly scanned pdf when the
| book was first released but later Someone else uploaded a
| OCRed version, but the first one just stayed hogging a large
| amount of storage.
| MichaelCollins wrote:
| How do you automate the process of figuring out which
| version is better? It's not safe to assume the smaller
| versions are always better, nor the inverse. Particularly
| for books with images, one version of the book may have
| passable image quality while the other compressed the
| images to jpeg mush. And there are considerations that are
| difficult to judge quantitatively, like the quality of
| formatting. Even something seemingly simple like testing
| whether a book's TOC is linked correctly entails a huge
| rats nest of heuristics and guesswork.
| macintux wrote:
| I don't think anyone is arguing it can be fully
| automated, but automating the selection of books to
| manually review is certainly viable.
| rinze wrote:
| As the previous reply said, I've also seen duplicates while
| browsing. Would it be possible to let users flag duplicates
| somehow? It involves human unreliability, which is like
| automated unreliability, only different.
| generationP wrote:
| In practice, it's more often the same file with minor edits
| such as a PDF table of contents added or page numbers
| corrected. Say, how many distinct editions of this standard
| text on elementary algebraic geometry are in the following
| list?
|
| http://libgen.rs/search.php?req=cox+little+o%27shea+ideals&o...
|
| Fun fact: the newest one (the 2018 corrected version of the
| 2015 fourth edition) is not among them.
| ZeroGravitas wrote:
| I notice they have a place to store the OpenLibrary ID,
| though I've not seen one filled in as yet.
|
| OpenLibrary provides both Work and Edition ids, which helps
| connect different versions.
|
| Their database is not perfect either, but it might make more
| sense to keep the bibliographic data seperate from the
| copyright contents anyway.
|
| https://openlibrary.org/works/OL1849157W/Ideals_varieties_an.
| ..
| boarush wrote:
| I like to think that LibGen also serves as a historical
| database wherein there is a record that a book of a specific
| edition had its errors corrected. (Although it would be
| better if errata could be appended to the same file if
| possible)
|
| Yes, for very minor edits, those files should obviously not
| exist, but for that there would need to be someone who
| verifies this, which is such an enormous task that likely no
| one would take up.
| keepquestioning wrote:
| Can we put LibGen on the blockchain?
| FabHK wrote:
| In case that was not a joke:
|
| No. LibGen is not a trivial amount of data (a few hard disks
| full). The blockchain can only handle tiny amounts of data very
| slowly.
| rolling_robot wrote:
| The graph should be in logarithmic scale to be readable,
| actually.
| remram wrote:
| https://news.ycombinator.com/item?id=32540202
| bagrow wrote:
| > by filtering any "books" (rather, files) that are larger than
| 30 MiB we can reduce the total size of the collection from 51.50
| TB to 18.91 TB
|
| I can see problems with a hard cutoff in file size. A long
| architectural or graphic design textbook could be much larger
| than that, for instance.
| mananaysiempre wrote:
| While it's a bit of an extreme case, the file for a single
| 15-page article on Monte Carlo noise in rendering[1] is over
| 50M (as noise should specifically not be compressed out of the
| pictures).
|
| [1] https://dl.acm.org/doi/10.1145/3414685.3417881
| TigeriusKirk wrote:
| I was just checking my PDFs over 30M because of this post and
| was surprised to see the DALL-E 2 paper is 41.9M for 27
| pages. Lots of images, of course, it was just surprising to
| see it clock in around a group of full textbooks.
| elteto wrote:
| If I remember correctly images in PDFs can be stored full
| res but are then rendered to final size, which more often
| than not in double column research papers end up being
| tiny.
| _Algernon_ wrote:
| My main issue with libgen is its awful search. Can't search by
| multiple criteria, shitty fuzzy search, and cant filter by file
| type.
| MichaelCollins wrote:
| Have you ever used a card catalogue?
| _Algernon_ wrote:
| No. Your point being?
| MichaelCollins wrote:
| In that case your expectations are understandable. People
| in your generation are accustomed to finding anything in
| mere seconds. Not very long ago, if it took you a few
| minutes to find a book in the catalogue you would count
| yourself lucky. And if your local library didn't have the
| book you're looking for, you could spend weeks waiting for
| the book to arrive from another library in the system.
|
| Libgen's search certainly isn't as good as it could be, but
| it's _more_ than good enough. If you can 't bear spending a
| few minutes searching for a book, can you even claim to
| want that book in the first place? It's hard for me to even
| imagine being in such a rush that a few minutes searching
| in a library is too much to tolerate. But then again, I
| wasn't raised with the expectations of your generation.
| liberalgeneral wrote:
| Z-Library has been innovating a great deal in that regard.
| Sadly they are not as open/sharing as LibGen mirrors in giving
| back to the community (in terms of database dumps, torrents,
| and source code).
| Invictus0 wrote:
| Storage space is not a problem, especially not on the order of
| terabytes. If you want to download all of libgen on a cheap
| drive, perhaps limit yourself to epub files only. No one needs
| all of libgen anyway except archivists and data hoarders.
| liberalgeneral wrote:
| https://news.ycombinator.com/item?id=32540854
| Invictus0 wrote:
| Yes, that makes you a data hoarder. Normal people would just
| use one of the many other methods of getting free books, like
| legal libraries, googling it on Yandex, torrents, asking a
| friend, etc. Or just actually pay for a book.
| liberalgeneral wrote:
| My target audience is not normal people though, and I don't
| mean this in the "edgy" sense. The fact that we are having
| this discussion is very abnormal to begin with, and I think
| it's great that there are some deviants from the norm who
| care about the longevity of such projects.
|
| I can imagine many students and researchers hosting a
| mirror of LibGen for their fellows for example.
| Invictus0 wrote:
| In that case, just pay whatever it costs to store the
| data. With AWS glacier it would cost $50 a month.
| [deleted]
| RcouF1uZ4gsC wrote:
| > I chose 30 MiB somewhat arbitrarily based on my personal e-book
| library, thinking "30 MiB ought to be enough for anyone"
|
| There are books on art and photography and pathology that have
| multiple high resolution photographs.
|
| I don't think limiting by file size is a good idea.
| c-fe wrote:
| This is a bit anecdotal, but I did upload a book to libgen. I am
| am avid user of the site, and during my thesis research I was
| looking for a specific book and could not find it on there. I did
| however find it on archive.org. I spent the better half of one
| afternoon extracting the book from archive.org with some Adobe
| software, since I had to circumvent some DRM and other things,
| and all of this was also novel to me. In the end I got a scanned
| PDF, which had several hundred MB. I managed to reduce it to 47
| MB, however further reduction was not easily possible at least
| not with the means I knew or had at my disposal. I uploaded this
| version to libgen.
|
| I do agree that there may be some large files on there, however I
| dont agree with removing them. I spent some hours to put this
| book on there so others who need it can access it within seconds.
| Removing it because it is too large would void all this effort
| and require future users to go through a similar process than i
| did just to browse through the book.
|
| Also any book published today is most likely available in some
| ebook format, which is much smaller in size, so I dont think that
| the size of libgen will continue to grow at the same pace as it
| is doing now.
| culi wrote:
| I've always wanted to contribute to LibGen. Got me through
| college and has powered my Wikipedia editing hobby
|
| Are there any good guides out there for best practices for
| minimizing files, scanning books, etc?
| generationP wrote:
| There's a bunch. Here's what I do (for black-and-white text;
| I'm not sure how to deal with more complex scenarios):
|
| Scan with 600dpi resolution. Nevermind that this gives huge
| output files; you'll compress them to something much smaller
| at the end, and the better your resolution, the stronger
| compression you can use without losing readability.
|
| While scanning, periodically clean the camera or the scanner
| screen, to avoid speckles of dirt on the scan.
|
| The ideal output formats are TIF and PNG; use them if your
| scanner allows. PDF is also fine (you'll then have to extract
| the pages into TIF using pdfimages or using ScanKromsator).
| Use JPG only as a last resort, if nothing else works.
|
| Once you have TIF, PNG or JPG files, put them into a folder.
| Make sure that the files are sorted correctly: IIRC, the
| numbers in their names should match their order (i.e.,
| blob030 must be an earlier page than blah045; it doesn't
| matter whether the numbers are contiguous or what the non-
| numerical characters are). (I use the shell command mmv for
| convenient renaming.)
|
| Import this folder into ScanTailor (
| https://github.com/4lex4/scantailor-advanced/releases ), save
| the project, and run it through all 6 stages.
|
| Stage 1 (Fix Orientation): Use the arrow buttons to make sure
| all text is upright. Use Q and W to move between pages.
|
| Stage 2 (Split Pages): You can auto-run this using the |>
| button, but you should check that the result is correct. It
| doesn't always detect the page borders correctly. (Again, use
| Q and W to move between pages.)
|
| Stage 3 (Deskew): Auto-run using |>. This is supposed to
| ensure that all text is correctly rotated. If some text is
| still skew, you can detect and fix this later.
|
| Stage 4 (Select Content): This is about cutting out the
| margins. This is the most grueling and boring stage of the
| process. You can auto-run it using |>, but it will often cut
| off too much and you'll have to painstakingly fix it by hand.
| Alternatively (and much more quickly), set "Content Box" to
| "Disable" and manually cut off the most obvious parts without
| trying to save every single pixel. Don't worry: White space
| will not inflate the size of the ultimate file; it compresses
| well. The important thing is to cut off the black/grey parts
| beyond the pages. In this process, you'll often discover
| problems with your scan or with previous stages. You can
| always go back to previous stages to fix them.
|
| Stage 5 (Margins): I auto-run this.
|
| Stage 6 (Output): This is important to get right. The
| despeckling algorithm often breaks formulas (e.g., "..."s get
| misinterpreted as speckles and removed), so I typically
| uncheck "Despeckle" when scanning anything technical (it's
| probably fine for fiction). I also tend to uncheck "Savitzki-
| Golay smoothing" and "Morphological smoothing" for some
| reason; don't remember why (probably they broke something for
| me in some case). The "threshold" slider is important:
| Experiment with it! (Check which value makes a typical page
| of your book look crisp. Be mindful of pages that are paler
| or fatter than others. You can set it for each page
| separately, but most of the time it suffices to find one
| value for the whole book, except perhaps the cover.) Note the
| "Apply To..." buttons; they allow you to promote a setting
| from a single page to the whole book. (Keep in mind that
| there are two -- the second one is for the despeckling
| setting.)
|
| Now look at the tab on the right of the page. You should see
| "Output" as the active one, but you can switch to "Fill
| Zones". This lets you white-out (or black-out) certain
| regions of the page. This is very useful if you see some
| speckles (or stupid write-ins, or other imperfections) that
| need removal. I try not to be perfectionistic: The best way
| to avoid large speckles is by keeping the scanner clean at
| the scanning stage; small ones aren't too big a deal; I often
| avoid this stage unless I _know_ I got something dirty. Some
| kinds of speckles (particularly those that look like
| mathematical symbols) can be confusing in a scan.
|
| There is also a "Picture Zones" rider for graphics and color;
| that's beyond my paygrade.
|
| Auto-run stage 6 again at the end (even if you think you've
| done everything -- it needs to recompile the output TIFFs).
|
| Now, go to the folder where you have saved your project, and
| more precisely to its "out/" subfolder. You should see a
| bunch of .tif files, each one corresponding to a page. Your
| goal is to collect them into one PDF. I usually do this as
| follows: tiffcp *.tif ../combined.tif
| tiff2pdf -o ../combined.pdf ../combined.tif rm -v
| ../combined.tif
|
| Thus you end up with a PDF in the folder in which your
| project is.
|
| Optional: add OCR to it; add bookmarks for chapters and
| sections; add metadata; correct the page numbering (so that
| page 1 is actual page 1). I use PDF-XChangeLite for this all;
| but use whatever tool you know best.
|
| At that point, your PDF isn't super-compressed (don't know
| how to get those), but it's reasonable (about 10MB per 200
| pages), and usually the quality is almost professional.
|
| Uploading to LibGen... well, I think they've made the UI
| pretty intuitive these days :)
|
| PS. If some of this is out of date or unnecessarily
| complicated, I'd love to hear!
| crazygringo wrote:
| > _At that point, your PDF isn 't super-compressed (don't
| know how to get those)_
|
| As far as I know, it's making sure your text-only pages are
| monochrome (not grayscale) and to use Group4 compression
| for them, which is actually what fax machines use (!) and
| is optimized specifically for monochrome text. Both TIFF
| and PDF's support Group4 -- I use ImageMagick to take a
| scanned input page and run grayscale, contrast, Group4
| monochrome encoding, and PDF conversion in one fell swoop
| which generates one PDF per page, and then "pdfunite" to
| join the pages. Works like a charm.
|
| I'm not aware of anything superior to Group4 for regular
| black and white text pages, but would love to know if there
| is.
| generationP wrote:
| Oh, I should have said that I scan in grayscale, but
| ScanTailor (at stage 6) makes the output monochrome;
| that's what the slider is about (it determines the
| boundary between what will become black and what will
| become white). So this isn't what I'm missing.
|
| I am not sure if the result is G4-compressed, though. Is
| there a quick way to tell?
| liberalgeneral wrote:
| Thank you for your efforts!
|
| To be clear, I am not advocating for the removal of any files
| larger than 30 MiB (or any other arbitrary hard limits). It'd
| be great of course to flag large files for further review, but
| the current software doesn't do a great job at crowdsourcing
| these kinds of tasks (another one being deduplication) sadly.
|
| Given the very little amount of volunteer-power, I'm suggesting
| that a "lean edition" of LibGen can still be immensely useful
| to many people.
| ssivark wrote:
| Files are a very bad unit to elevate in importance, and
| number of files or file size are really bad proxy metrics,
| especially without considering the statistical distribution
| of downloads (leave alone the question of what is more
| "important"!). Eg: Junk that's less than the size limit is
| implicitly being valued over good content that happens to be
| larger in size. Textbooks & reference books will likewise get
| filtered out with higher likelihood -- and that would screw
| students in countries where they cannot afford them (which
| might arguable be a more important audience to some, compared
| to those downloading comics). Etc.
|
| After all this, the most likely human response from people
| who really depend on this platform would be to slice a big
| file into volumes under the size limit. Seems to be a
| horrible UX downgrade in the medium to long term for no other
| reason than satisfying some arbitrary metric of
| legibility[1].
|
| Here's a different idea -- might it be worthwhile to convert
| the larger files to better compressed versions eg. PDF ->
| DJVU? This would lead to a duplication in the medium term,
| but if one sees a convincing pattern that users switch to the
| compressed versions without needing to come back to the
| larger versions, that would imply that the compressed version
| works and the larger version could eventually be garbage
| collected.
|
| Thinking in an even more open-ended manner, if this corpus is
| not growing at a substantial rate, can we just wait out a
| decade or so of storage improvements before this becomes a
| non-issue? How long might it take for storage to become 3x,
| 10x, 30x cheaper?
|
| [1]: https://www.ribbonfarm.com/2010/07/26/a-big-little-idea-
| call...
| didgetmaster wrote:
| > can we just wait out a decade or so of storage
| improvements before this becomes a non-issue?
|
| I'm not sure that there is anything on the horizon which
| would make duplicate data a 'non-issue'. Capacities are
| certainly growing, so within a decade we might see 100TB
| HDDs available and affordable 20TB SSDs. But that does not
| solve the bandwidth issues. It still takes a long, long
| time to transfer all the data.
|
| The fastest HDD is still under 300MB/s which means it takes
| a minimum of 20 hours to read all the data off a 20TB HDD.
| That is if you could somehow get it to read the whole thing
| at the maximum sustained read speed.
|
| SSDs are much faster, but it will always be easier to
| double the capacity than it is to double the speed.
| fragmede wrote:
| The problem isn't the technology, it's the cost. Given a
| far larger budget, you wouldn't run the hard drives at
| anywhere near capacity, in order to gain a read speed
| advantage by running a ton in parallel. That'll let you
| read 20 TB in a hour if you can afford it. Put it this
| way; Netflix is able to do 4k video and that's far more
| intensive.
| jtbayly wrote:
| Agreed. Deduplication should be the bigger goal, in my opinion.
| CamperBob2 wrote:
| Have to be careful there. A jihad against duplication means
| that poor-quality scans will drive out good ones, or prevent
| them from ever being created. Especially if you're misguided
| enough to optimize for minimum file size.
|
| I agree with samatman's position below: as long as the format
| is the slightest bit lossy -- and it always will be --
| aggressive deduplication has more downsides than upsides.
| willnonya wrote:
| While intended to agree the duplicates need to be easily
| identifiable and preferably filterable by quality for bulk
| downloads.
| exmadscientist wrote:
| Deduplication doesn't have to mean removal. It might be
| just tagging. It would be very nice to be able to fetch the
| "best filesize" version of the entire collection, then pull
| down the "best quality" editions of only a few things I'm
| particularly interested in.
| signaru wrote:
| Probably only safe in cases where the files in question are
| exactly the same binaries (if binary diffing can be automated
| somehow).
| DiggyJohnson wrote:
| Even then, I wouldn't want a file with text + illustrations
| to be considered a dupe of a text-only copy of the same work.
| ajsnigrutin wrote:
| Plus there are a lot of books, where one version is a high
| quality scan, but no OCR, and the other is OCRed scan (with
| a bunch of errors, but searching works 80% of the time) and
| horrible scan quality.
|
| Also, some books included appendices, that are scanned in
| some versions but not in others, plus large posters, that
| are shrunk to a4 size in one version, split onto multiple
| a4 pages in another, and one huge page in a third version.
|
| Then there are zips of books, containing 1 pdf + eg.
| example code, libraries, etc (eg progrmaming books).
| samatman wrote:
| IMHO a process which is lossy should never be described as
| deduplication.
|
| What would work out fairly well for this use case is to
| group files by similarity, and compress them with an
| algorithm which can look at all 'editions' of a text.
|
| This should mean that storing a PDF with a (perhaps badly,
| perhaps brilliantly) type-edited version next to it would
| 'weigh' about as much as the original PDF plus a patch.
| duskwuff wrote:
| > IMHO a process which is lossy should never be described
| as deduplication.
|
| Depends. There are going to be some cases where files
| aren't _literally_ duplicates, but the duplicates don 't
| add any value -- for example, MOBI conversions of EPUB
| files, or multiple versions of an EPUB with different
| publisher-inserted content (like adding a preview of a
| sequel, or updating an author's bibliography).
| samatman wrote:
| Splitting those into two cases: I think getting rid of
| format conversions (which can, after all, be performed
| again) is worthwhile, but isn't deduplication, that's
| more like pruning.
|
| Multiple versions of an EPUB with slightly different
| content is exactly the case where a compression algorithm
| with an attention span, and some metadata to work with
| can, get the multiple copies down enough in size that
| there's no point in disposing of the unique parts.
| ad404b8a372f2b9 wrote:
| That's funny, I did the same analysis with sci-hub. Back when
| there was an organized drive to back it up.
|
| I downloaded parts of it and wanted to figure out why it was so
| heavy, seeing as you'd expect articles to be mostly text and very
| light.
|
| There was a similar distribution of file sizes. My immediate
| instinct was also to cut off the tail-end, but looking at the
| larger files I realized it was a whole range of good articles
| that included high quality graphics that were crucial to the
| research being presented, not poor compression or useless bloat.
| dredmorbius wrote:
| It can be illuminating to look at the size of ePub documents.
| This is in general an HTML container (and compressed), such
| that file sizes tend to be quite small. A book-length text
| (~250 pp or more) might be from 0.3 -- 5 MB, and often at the
| lower end of the scale.
|
| Books with a large number of images or graphics, however, can
| still bloat to 40-50 MB or even more.
|
| Otherwise, generally, text-based PDFs (as opposed to scans) are
| often in the 2--5 MB range, whilst scans can run 40--400 MB.
| The largest I'm aware of in my own collection is a copy of
| Lyell's _Geography_ , sourced from Archive.org. It is of course
| scans of the original 19th century typography. Beautiful to
| read, but a bit on the weighty side.
| liberalgeneral wrote:
| I think Sci-Hub is the opposite since 1 DOI = 1 PDF in its
| canonical form (straight from the publisher) so neither
| duplication nor low-quality is the case.
| dredmorbius wrote:
| It does depend on when the work was published. Pre-digital
| works scanned in without OCR can be larger in size. That's
| typically works from the 1980s and before.
|
| Given the explosion of scientific publishing, that's likely a
| small fraction of the archive by _work_ though it may be
| significant in terms of _storage_.
| aaron695 wrote:
| > by filtering any "books" (rather, files) that are larger than
| 30 MiB we can reduce the total size of the collection from 51.50
| TB to 18.91 TB, shaving a whopping 32.59 TB
|
| Books greater than 30 MiB are all the textbooks.
|
| You are killing the knowledge.
|
| Also killing a lot of rare things.
|
| If you want to do something amazing and small, OCR them.
|
| As an example of greater than 30 meg, I grabbed a short story by
| Greg Bear the other day not available digitally, it was in a 90
| meg copy of a 1983 Analog Science Fiction and Fact
|
| Side note de-duping is an incredibly hard project, how will you
| diff a mobi and a epub and then make a decision? Or a decision
| between a mobi and a mobi?
|
| Books also change with time. Even in the 90's kids books from the
| 60's had been 'edited' These can be hidden gems to collectors.
| Cover art also.
| gizajob wrote:
| One of my favourite places on the internet too. The thing is, you
| just search for what you want and spend 10 seconds finding the
| right book and link. While I'd love to mirror whole archive
| locally, it would really be superfluous because I can only read a
| couple of quality books at a time anyway, so building my own
| small archive of annotated PDFs (philosophy is my drug of choice)
| is better than having the whole. I think it's actually remarkably
| free of bloat and cruft considering, but maybe I'm not trawling
| the same corners as you are. Do kind of wish they'd clear out the
| mobi and djvu versions and make it unified however.
| sitkack wrote:
| > djvu versions
|
| This would be disastrous for preservation. Often the djvu
| versions have no digital version, the books not in print and
| the publisher isn't around. The djvu archives are often
| specifically because some old book, _really_ has and had value
| to people.
| crazygringo wrote:
| Yeah, I always convert DVJU to PDF (pretty easy) but it never
| compresses quite as nicely.
|
| DJVU is pretty clever in how it uses a mask layer for more
| efficient compression, and as far as I know, converting to
| PDF is always done "dumb" -- flattening the DJVU file into a
| single image and then encoding that image traditionally in
| PDF.
|
| I wonder if it's possible to create a "lossless" DJVU to PDF
| converter, or something close to it, if the PDF primitives
| allow it? I'm not sure if they do, if the "mask layer" can be
| efficiently reproduced in PDF.
| sitkack wrote:
| If you smoke enough algebra, you could use the DJVU
| algorithm to implement DJVU in PDF with layers. Or heck you
| could do it in SVG.
| napier wrote:
| Is there a torrent available that would allow straightforward
| setup of locally storable and accessible Libgen library? For
| the storage rich but internet connection reliability poor,
| something like this would be a godsend.
| mdaniel wrote:
| They have a dedicated page where they offer torrents, so pick
| one of the currently available hostnames:
| https://duckduckgo.com/?q=libgen+torrent&ia=web
|
| Obviously, folks can disagree on the "straightforward" part
| of your comment given the overwhelming number of files we're
| discussing
| liberalgeneral wrote:
| > While I'd love to mirror whole archive locally, it would
| really be superfluous because I can only read a couple of
| quality books at a time anyway, [...]
|
| I'd love to agree but as a matter of fact LibGen and Sci-Hub
| are (forced to be) "pirates" and they are more vulnerable to
| takedowns than other websites. So while I feel no need to
| maintain a local copy of Wikipedia, since I'm relatively
| certain that it'll be alive in the next decade, I cannot say
| the same about those two with the same certainty (not that I
| think there are any imminent threats to either, just reasoning
| a priori).
| jart wrote:
| Well when a site claims it's for scientific research
| articles, and you search for "Game Of Thrones" and find this:
|
| https://libgen.is/search.php?req=game+of+thrones&lg_topic=li.
| ..
|
| Someone's going to prison eventually, like The Pirate Bay
| founders. It's only a matter of time.
| contingencies wrote:
| First, SciHub != LibGen. Allied projects that clearly share
| a support base but not identical.
|
| Second, please provide a citation for the assertion that
| sharing copies of printed fiction erodes sales volume. At
| this point, one may assume that anything that helps to sell
| computer games and offline swag is cash-in-bank for content
| producers. Whether original authors get the same royalties
| is an interesting question.
|
| Third, the former Soviet milieu probably isn't currently in
| the mood to cooperate with western law enforcement.
| BossingAround wrote:
| Speaking of mirroring, is there a way to download one big
| "several-hundred-GB" blob with the full content of the sites
| for archival purposes?
|
| Surely that would act as a failsafe to your problem.
| charcircuit wrote:
| I think it's split into a several different torrents since
| it's so big.
| scott_siskind wrote:
| Why would they clear out djvu? It's one of the best/most
| efficient storage format for scanned books.
| nsajko wrote:
| I'm not for clearing out djvu, but it sure is frustrating
| when a PDF isn't available.
|
| It's not just about laziness preventing one from installing
| the more obscure ebook readers which support djvu. It's about
| security: I only trust PDFs when I create them myself with
| TeX or similar, otherwise I need to use the Chromium PDF
| reader to be (relatively) safe. I don't trust the readers
| that support Djvu to be robust enough against maliciously
| malformed djvu files, as I'm guessing the readers are
| implemented in some ancient dialect of C or C++ and I doubt
| they're getting much if any scrutiny in the way of security.
| crazygringo wrote:
| It's super easy to convert a DJVU file to PDF though.
| There's an increase in filesize but it's not the end of the
| world.
|
| And since you're creating the PDF yourself seems like you
| can trust it? Since nothing malicious could survive the
| DJVU to PDF conversion since it's just "dumb" bitmap-based.
| xdavidliu wrote:
| djvu is really quite a marvellous format, but I'm only able
| to read them on Evince (the default pdf reader that comes
| with Debian, Fedora, and probably a bunch of other distros).
| For my macbook I need to download a Djvu reader, and for my
| ipad, I didn't even bother trying because the experience
| would likely be much worse than Preview / Ibooks.
| eru wrote:
| Apparently you can install Evince on MacOS as well. But I
| haven't tried it there.
|
| Evince doesn't come by default with Archlinux (my desktop
| distribution of choice), but I still install it everywhere.
| nsajko wrote:
| > Evince doesn't come by default with Archlinux (my
| desktop distribution of choice)
|
| This doesn't make sense; nothing comes "by default" on
| Arch, but evince _is_ in the official repos as far as I
| see.
| dredmorbius wrote:
| DJVU is supported by numerous book-reading applications,
| including (in my experience) FB Reader (FS/OSS),
| Pocketbook, and Onyx's Neoreader.
|
| As a format for preserving full native scan views (large,
| but often strongly preferable for visually-significant
| works or preserving original typesetting / typography),
| DJVU is highly useful.
|
| I _do_ wish that it were more widely supported by both
| toolchains and readers. That will come in time, I suspect.
| MichaelCollins wrote:
| Calibre supports djvu on any platform. Deleting djvu books
| just because Microsoft and Apple don't see fit to support
| it by default would be a travesty.
| gizajob wrote:
| My comment about djvu was mostly just about my own laziness,
| because (kill me if you need to) I like using Preview on the
| Mac for reading and annotating, and it doesn't read them, and
| once they have to live in a djvu viewer, I tend not to read
| them or mark them up. Same goes for Adobe Acrobat Reader when
| I'm on Windows on my University's networked PCs.
| repple wrote:
| This book has a great overview of the origins of library genesis.
|
| Shadow Libraries: Access to Knowledge in Global Higher Education
|
| https://libgen.is/search.php?req=shadow+libraries
| gmjoe wrote:
| Honestly, it's not a big problem.
|
| First of all, bloat has nothing to do with file size -- EPUB's
| are often around 2 MB, typeset PDF's are often 2-10 MB (depending
| on quantity of illustrations), and scanned PDF's are anywhere
| from 10 MB (if reduced to black and white) to 100 MB (for colors
| scans, like where necessary for full-color illustrations).
|
| The idea of a 30 MB cutoff does nothing to reduce bloat, it just
| removes many of the most essential textbooks. :( Also it's very
| rare to see duplicates of 100 MB PDF's.
|
| Second, file duplication is there, but it's not really an
| unwieldy problem right now. Probably the majority of titles have
| only a single file, many have 2-5 versions, and a tiny minority
| have 10+. But they're often useful variants -- different editions
| (2nd, 3rd, 4th) plus alternate formats like reflowable EBUB vs
| PDF scan. These are all genuinely useful and need to be kept.
|
| Most of the unhelpful duplication I see tends to fall into three
| categories:
|
| 1) There are often 2-3 versions of the identical typeset PDF
| except with a different resolution for the cover page image. That
| one baffles me -- zero idea who uploads the extras or why. My
| best guess is a bot that re-uploads lower-res cover page
| versions? But it's usually like original 2.7 MB becoming 2.3 MB,
| not a big difference. Feels very unnecessary to me.
|
| 2) People (or a bot?) who seem to take EPUB's and produce PDF
| versions. I can understand how that could be done in a helpful
| spirit, but honestly the resulting PDF's are so abysmally ugly
| that I really think people are better off producing their own
| PDF's using e.g. Calibre, with their own desired paper size,
| font, etc. Unless there's no original EPUB/MOBI on the site, PDF
| conversions of them should be discouraged IMHO
|
| 3) A very small number of titles do genuinely have like 5+
| seemingly identical EPUB versions. These are usually very popular
| bestselling books. I'm totally baffled here as to why this
| happens.
|
| It does seem like it would be a nice feature to be able to leave
| some kind of crowdsourced comments/flags/annotations to help
| future downloaders figure out which version is best for them
| (e.g. is this PDF an original typeset, a scan, or a conversion?
| -- metadata from the uploader is often missing or inaccurate
| here). But for a site that operates on anoynmity, it seems like
| this would be too open to abuse/spamming. Being able to delete
| duplicates opens the door to accidental or malicious deleting of
| anything. I'd rather live with the "bloat", it's really not an
| impediment to anything at the moment.
| titoCA321 wrote:
| When you look at movie pirates, there's still uploads of Xvid
| in 2022. Crap goes in as PDF, mobi, epub, txt and comes out as
| PDF, mobi, DOCX, txt.
| agumonkey wrote:
| There are classes of books that are significantly larger than the
| rest, like medical / biology books. I don't know if they embed
| vector based images of the whole body or maybe hundreds of images
| but it's surprising big they are.
|
| Who's in to make some large data gathering about unoptimized
| books and potentially redudant ones ? or maybe trim pdfs (qpdf
| can optimize a structure to an extent)
| liberalgeneral wrote:
| Database dumps are available here if you are interested:
| http://libgen.rs/dbdumps/
|
| libgen_compact_* is what you are probably looking for, but they
| are all SQL dumps so you'll need to import them into MySQL
| first. :/
| agumonkey wrote:
| the dumps are not enough, one has too scan the actual file
| content to assess the quality
|
| are you alone in your analysis or are there groups who try to
| improve lg ?
| [deleted]
| Synaesthesia wrote:
| >"30 MiB ought to be enough for anyone"
|
| Sometimes you have eg a history book which has a lot of high
| quality photos, and then it can be quite large.
| spiffistan wrote:
| I've been dreaming of a book decompiler that would some
| newfangled AI/ML to produce a perfectly typeset copy of an older
| book; in the same font or similar, recognizing multiple languages
| and scripts within the work.
| copperx wrote:
| In the same vein, I would like an e-reader that has TeX or
| InDesign quality typesetting. I'd settle for Knuth-Plass line
| breaking with decent justification (and hyphenation).
|
| At the very least, make it so that headings do not appear at
| the bottom of a page. Who thought that was OK?
| signaru wrote:
| I've experienced scanning personal books and also try to reduce
| them since I'm also concerned with bloat on my (older) mobile
| reading devices. Unfortunately, there are reasons I cannot upload
| those, but the procedures might still be helpful for existing
| scans.
|
| Use ScanTailor to clean them up. If there is no need for
| color/grayscale, have the output strictly black and white.
|
| OCR them with Adobe Acrobat ClearScan (or something else, that is
| what I have).
|
| Convert to black and white DJVU (Djvu-Spec).
|
| Dealing with color is another thing, and takes some time. I find
| that using G'MIC's anisotropic smoothing can help with the ink-
| jet/half-tone patterns. But it's too time consuming to be used
| for books.
| pronoiac wrote:
| I like ScanTailor! I've used ocrmypdf for the OCR and
| compression steps. It uses lossless JBIG2 by default, at 2 or
| 3k per page; I'm curious how that compares to DJVU. (And my
| mistake, pdf and DJVU are competing container formats.)
| signaru wrote:
| If the PDF is from a scanned source, converting it to DJVU
| with equivalent DPI typically results to about half the file
| size (figures can vary depending on the specifics of the PDF
| source).
| powera wrote:
| Curation is hard, particularly for a "community" project.
|
| Every file is there for a reason, and much of the time, even if
| it is a stupid reason, removing it means there is one more person
| opposed to the concept of "curation".
| Hizonner wrote:
| Um, if the goal is to fit what you can onto a 20TB hard drive at
| home, then nobody is stopping you from choosing your own subset,
| as opposed to deleting stuff out of the main archive based on
| ham-handed criteria...
| mjreacher wrote:
| I think one of the problems is the lack of a good open source PDF
| compressor. We have good open source OCR software like ocrmypdf
| which I've seen used before, but some of the best compressed
| books I've seen on libgen used some commercial compressor while
| the open source ones I've used were generally quite lackluster.
| This applies double so when people are ripping images from
| another source, combining them into a PDF then uploading as a
| high resolution PDF which inevitably ends up being between 70-370
| MB.
|
| How to deal with duplication is also a very difficult problem
| because there's loads of reasons why things could be duplicated.
| Take a textbook, I've seen duplicates which contain either one or
| several of the following: different editions, different printings
| (of any particular edition), added bookmarks/table of contents
| for the file, removed blank white pages, removed front/end cover
| pages, removed introduction/index/copyright/book information
| pages, LaTeX'd copies of pre-TeX textbooks, OCR'd, different
| resolution, other kinds of optimization by software that reduces
| to wildly different file sizes, different file types (eg .chm,
| PDFs that are straight conversions from epub/mobi), etc. Some of
| this can be detected by machines, eg usage of OCR but some of the
| other things aren't easy at all to detect.
| crazygringo wrote:
| What commercial compressor/performance are you talking about?
|
| AFAIK the best compression you see is monochrome pages encoded
| in Group4, which for example ImageMagick will do which is open
| source, and ocrmypdf happily works on top of.
|
| Otherwise it's just your choice of using underlying JPG, PNG,
| or JPEG 2000, and up to you to set your desired lossy
| compression ratio.
___________________________________________________________________
(page generated 2022-08-21 23:00 UTC)