[HN Gopher] Reducing the Size of Large PDFs
___________________________________________________________________
Reducing the Size of Large PDFs
Author : ingve
Score : 52 points
Date : 2022-01-30 13:45 UTC (1 days ago)
(HTM) web link (leancrew.com)
(TXT) w3m dump (leancrew.com)
| Gys wrote:
| I use this to also reduce the color info by converting to grey:
| gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4
| -dPDFSETTINGS=/ebook -dColorConversionStrategy=/Gray
| -dProcessColorModel=/DeviceGray -dNOPAUSE -dQUIET -dBATCH
| -sOutputFile=file-out.pdf file-in.pdf
| gwern wrote:
| Curious. I wonder what Matplotlib does that's so bad? If it's
| merely being pathological, then compressing the PDFs with JBIG2
| would make them potentially even smaller. (I usually do that with
| 'ocrmypdf', which doesn't have to OCR your PDF.)
| mkl wrote:
| Depends on your PDF and your goals, I think. JBIG2 is for
| raster data, but Matplotlib outputs vector images by default
| when producing PDFs. Rasterising may help with things like huge
| numbers of points that hide each other, but at the cost of
| sharpness and scalability.
| ldjb wrote:
| In the past, I've used pdf2ps to convert the PDF to a PostScript
| file, then used ps2pdf to convert it back to a PDF. This often
| reduces the file size substantially.
|
| I probably wouldn't recommend doing this for anything serious
| (since it might negatively affect the a11y of the file), but in
| the right situations it can be a quick and simple way to
| massively reduce a PDF's filesize without noticeable side-
| effects.
| tingletech wrote:
| I came across this trick once that worked to shrink a bunch of
| PDF files I needed to get up to meet a grant deadline. Convert it
| to postscript with poppler, and back to PDF with ghostscript.
| YMMV. http://tingletech.github.io/pdftrick/
| tambourine_man wrote:
| It's a good trick, but I ocasionally get errors such as:
| GPL Ghostscript 9.10: Missing glyph CID=84, glyph=0054 in the
| font Helvetica-Light
| boramalper wrote:
| For scans, DjVu[0] is incredibly good.
|
| [0] https://en.wikipedia.org/wiki/DjVu
| mkl wrote:
| I do this too. Be careful relying on it in an automated way,
| though: it will make some PDFs bigger.
| jbay808 wrote:
| Wouldn't a good automation process compare the output and input
| filesizes and return whichever was smaller?
| thehappypm wrote:
| This is so true. You can't really win in general.
|
| High res images are going to look pretty good in PDF/print. SVG
| is going to look amazing, infinitely zoom able and sharp at
| every level.
|
| You'd think raster images are bigger, right? Well, SVGs of
| graphs can be huge.. imagine a scatter plot of a million data
| points. That's roughly same sized PNG as the scatter plot with
| 1000 points at the same resolution. But, a million points is
| going to be both slow to draw and take a lot of space!
| heavymark wrote:
| They mention not wanting to upload to some sketchy site (very
| under standable), but wanted to note for the average user, Adobe
| has a free web service for compressing PDFs that works very well.
| Far better than the compression that comes with Preview or even
| Adobe Acrobat oddly. When typing "compress pdf" into Google it's
| the second result that pops up.
| kccqzy wrote:
| Just compressing PDFs is not informative enough to tell the
| user what it is doing under the hood. Is it subsetting fonts?
| That's usually harmless. Is it rasterizing the vector graphics?
| That's generally not fine by me. Is it downscaling the raster
| graphics? It's probably fine but would depend on the intended
| audience of the PDF and the DPI. Is it merely optimizing the
| compression used for compressing the content stream? That's
| also fine. Is it using some thresholding to reduce grayscale to
| black-and-white? That's only fine if the document is scanned.
| Etc, etc.
|
| I can never trust this kind of "simple tool" that magically
| reduces file sizes without telling me what it's doing.
| Personally, I think PDF is too complicated a format to have
| this kind of simple file-size-reduction tool.
| jamiedamien wrote:
| Oh man, I've been down the rabbit hole of reducing matplotlib PDF
| sizes too many times. Ghostscript is great most of the time, but
| as mkl points out, it can make some PDFs bigger.
|
| In particular, matplotlib plots that use points (markers) blow up
| in size quite a bit after processing through ghostscript, due to
| the way matplotlib re-uses spline information to draw the e.g.
| circles, where as ghostscript seemingly cannot / chooses not to
| (?). I recall something to do with xobject re-use...
|
| I've also found that if you use type 42 fonts (helpful if
| submitting to a conference where the submission system doesn't
| accept type 3 fonts), matplotlib will not subset the font,
| resulting in increased file sizes.
|
| So I use a similar ghostscript script, but one that also checks
| if the resulting file is actually smaller. If it's bigger, it
| just uses the original PDF. For files with lots of points, I've
| found that rasterizing just the points artist is a good solution
| (everything else in the plot is still vector), which allows for
| ghostscript to subset the type 42 fonts without the file-size
| explosion due to the points. Still, I wish there was a good way
| or script to e.g. just subset fonts in a PDF file, as well as
| processing a PDF to remove redundant fonts.
|
| When including many PDF plots into a large LaTeX document, each
| PDF still comes with embedded fonts, which can increase the file
| size of the final PDF. Most of the fonts end up being duplicates.
| For this, I use a custom matplotlib backend that creates a PDF
| file with no text, together with a PGF file that specifies the
| position of each text. LaTeX then handles all the text rendering
| (which results in nice looking figures!), so each font is only
| included once in the final PDF.
| PythonNut wrote:
| Wow, this sounds really cool! Out of curiosity, did you get bad
| results with the pure PGF backend? (And can you link to your
| script by any chance?) I'm always amazed that including
| matplotlib plots in LaTeX documents is so fraught since it's
| such common use case.
|
| Also I read that matplotlib 3.5 has some sort of improved
| support for type 42 subsetting. I haven't had a chance to try
| it out yet but this could be a welcome improvement!
| jamiedamien wrote:
| Oh didn't know about the improved type 42 font support in the
| new matplotlib! That's good to know and I should check it
| out.
|
| And good point, the PGF works just as well (results should be
| identical), but since all the plot information has to be
| compiled by latex, it ends up ballooning the compilation time
| of the tex doc and the matplotlib PGF page suggests that you
| can run into memory issues as well. I was doing this for a
| thesis with 50+ plots and so still wanted compilation to be
| fast.
|
| I've suggested this as an improvement to matplotlib, but
| unlikely to be merged since maybe it's a bit hacky (although
| it's very similar to what Inkscape's export to LaTeX option
| does): https://github.com/matplotlib/matplotlib/issues/22297
| (the backend file can be found here: https://github.com/matpl
| otlib/matplotlib/files/7921801/backe...)
|
| And the gs script is below: #!/bin/bash
| set -e set -o pipefail if [ -z $1 ]; then
| echo "Supply input output" exit 1; fi
| if [ -z $2 ]; then outfile="$(basename ${1}
| .pdf)-small.pdf" if [ -f $outfile ]; then
| echo "WARNING ${outfile} already exists." echo
| "Supply input output" exit 1; fi else
| outfile="${2}" fi gs -sDEVICE=pdfwrite
| -dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH -dQUIET
| -dDetectDuplicateImages=true -r150 -sOutputFile="${outfile}"
| "${1}" pre_b=$(wc -c "${1}" | cut -d' ' -f1)
| post_b=$(wc -c "${outfile}" | cut -d' ' -f1) if ((
| $pre_b <= $post_b )); then echo "Original is smaller
| ($pre_b -> $post_b). copying..." cp "${1}"
| "${outfile}" fi
___________________________________________________________________
(page generated 2022-01-31 23:01 UTC)