[HN Gopher] Reducing the Size of Large PDFs
       ___________________________________________________________________
        
       Reducing the Size of Large PDFs
        
       Author : ingve
       Score  : 52 points
       Date   : 2022-01-30 13:45 UTC (1 days ago)
        
 (HTM) web link (leancrew.com)
 (TXT) w3m dump (leancrew.com)
        
       | Gys wrote:
       | I use this to also reduce the color info by converting to grey:
       | gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4
       | -dPDFSETTINGS=/ebook -dColorConversionStrategy=/Gray
       | -dProcessColorModel=/DeviceGray -dNOPAUSE -dQUIET -dBATCH
       | -sOutputFile=file-out.pdf file-in.pdf
        
       | gwern wrote:
       | Curious. I wonder what Matplotlib does that's so bad? If it's
       | merely being pathological, then compressing the PDFs with JBIG2
       | would make them potentially even smaller. (I usually do that with
       | 'ocrmypdf', which doesn't have to OCR your PDF.)
        
         | mkl wrote:
         | Depends on your PDF and your goals, I think. JBIG2 is for
         | raster data, but Matplotlib outputs vector images by default
         | when producing PDFs. Rasterising may help with things like huge
         | numbers of points that hide each other, but at the cost of
         | sharpness and scalability.
        
       | ldjb wrote:
       | In the past, I've used pdf2ps to convert the PDF to a PostScript
       | file, then used ps2pdf to convert it back to a PDF. This often
       | reduces the file size substantially.
       | 
       | I probably wouldn't recommend doing this for anything serious
       | (since it might negatively affect the a11y of the file), but in
       | the right situations it can be a quick and simple way to
       | massively reduce a PDF's filesize without noticeable side-
       | effects.
        
       | tingletech wrote:
       | I came across this trick once that worked to shrink a bunch of
       | PDF files I needed to get up to meet a grant deadline. Convert it
       | to postscript with poppler, and back to PDF with ghostscript.
       | YMMV. http://tingletech.github.io/pdftrick/
        
       | tambourine_man wrote:
       | It's a good trick, but I ocasionally get errors such as:
       | GPL Ghostscript 9.10: Missing glyph CID=84, glyph=0054 in the
       | font Helvetica-Light
        
       | boramalper wrote:
       | For scans, DjVu[0] is incredibly good.
       | 
       | [0] https://en.wikipedia.org/wiki/DjVu
        
       | mkl wrote:
       | I do this too. Be careful relying on it in an automated way,
       | though: it will make some PDFs bigger.
        
         | jbay808 wrote:
         | Wouldn't a good automation process compare the output and input
         | filesizes and return whichever was smaller?
        
         | thehappypm wrote:
         | This is so true. You can't really win in general.
         | 
         | High res images are going to look pretty good in PDF/print. SVG
         | is going to look amazing, infinitely zoom able and sharp at
         | every level.
         | 
         | You'd think raster images are bigger, right? Well, SVGs of
         | graphs can be huge.. imagine a scatter plot of a million data
         | points. That's roughly same sized PNG as the scatter plot with
         | 1000 points at the same resolution. But, a million points is
         | going to be both slow to draw and take a lot of space!
        
       | heavymark wrote:
       | They mention not wanting to upload to some sketchy site (very
       | under standable), but wanted to note for the average user, Adobe
       | has a free web service for compressing PDFs that works very well.
       | Far better than the compression that comes with Preview or even
       | Adobe Acrobat oddly. When typing "compress pdf" into Google it's
       | the second result that pops up.
        
         | kccqzy wrote:
         | Just compressing PDFs is not informative enough to tell the
         | user what it is doing under the hood. Is it subsetting fonts?
         | That's usually harmless. Is it rasterizing the vector graphics?
         | That's generally not fine by me. Is it downscaling the raster
         | graphics? It's probably fine but would depend on the intended
         | audience of the PDF and the DPI. Is it merely optimizing the
         | compression used for compressing the content stream? That's
         | also fine. Is it using some thresholding to reduce grayscale to
         | black-and-white? That's only fine if the document is scanned.
         | Etc, etc.
         | 
         | I can never trust this kind of "simple tool" that magically
         | reduces file sizes without telling me what it's doing.
         | Personally, I think PDF is too complicated a format to have
         | this kind of simple file-size-reduction tool.
        
       | jamiedamien wrote:
       | Oh man, I've been down the rabbit hole of reducing matplotlib PDF
       | sizes too many times. Ghostscript is great most of the time, but
       | as mkl points out, it can make some PDFs bigger.
       | 
       | In particular, matplotlib plots that use points (markers) blow up
       | in size quite a bit after processing through ghostscript, due to
       | the way matplotlib re-uses spline information to draw the e.g.
       | circles, where as ghostscript seemingly cannot / chooses not to
       | (?). I recall something to do with xobject re-use...
       | 
       | I've also found that if you use type 42 fonts (helpful if
       | submitting to a conference where the submission system doesn't
       | accept type 3 fonts), matplotlib will not subset the font,
       | resulting in increased file sizes.
       | 
       | So I use a similar ghostscript script, but one that also checks
       | if the resulting file is actually smaller. If it's bigger, it
       | just uses the original PDF. For files with lots of points, I've
       | found that rasterizing just the points artist is a good solution
       | (everything else in the plot is still vector), which allows for
       | ghostscript to subset the type 42 fonts without the file-size
       | explosion due to the points. Still, I wish there was a good way
       | or script to e.g. just subset fonts in a PDF file, as well as
       | processing a PDF to remove redundant fonts.
       | 
       | When including many PDF plots into a large LaTeX document, each
       | PDF still comes with embedded fonts, which can increase the file
       | size of the final PDF. Most of the fonts end up being duplicates.
       | For this, I use a custom matplotlib backend that creates a PDF
       | file with no text, together with a PGF file that specifies the
       | position of each text. LaTeX then handles all the text rendering
       | (which results in nice looking figures!), so each font is only
       | included once in the final PDF.
        
         | PythonNut wrote:
         | Wow, this sounds really cool! Out of curiosity, did you get bad
         | results with the pure PGF backend? (And can you link to your
         | script by any chance?) I'm always amazed that including
         | matplotlib plots in LaTeX documents is so fraught since it's
         | such common use case.
         | 
         | Also I read that matplotlib 3.5 has some sort of improved
         | support for type 42 subsetting. I haven't had a chance to try
         | it out yet but this could be a welcome improvement!
        
           | jamiedamien wrote:
           | Oh didn't know about the improved type 42 font support in the
           | new matplotlib! That's good to know and I should check it
           | out.
           | 
           | And good point, the PGF works just as well (results should be
           | identical), but since all the plot information has to be
           | compiled by latex, it ends up ballooning the compilation time
           | of the tex doc and the matplotlib PGF page suggests that you
           | can run into memory issues as well. I was doing this for a
           | thesis with 50+ plots and so still wanted compilation to be
           | fast.
           | 
           | I've suggested this as an improvement to matplotlib, but
           | unlikely to be merged since maybe it's a bit hacky (although
           | it's very similar to what Inkscape's export to LaTeX option
           | does): https://github.com/matplotlib/matplotlib/issues/22297
           | (the backend file can be found here: https://github.com/matpl
           | otlib/matplotlib/files/7921801/backe...)
           | 
           | And the gs script is below:                 #!/bin/bash
           | set -e       set -o pipefail       if [ -z $1 ]; then
           | echo "Supply input output"         exit 1;       fi
           | if [ -z $2 ]; then         outfile="$(basename ${1}
           | .pdf)-small.pdf"         if [ -f $outfile ]; then
           | echo "WARNING ${outfile} already exists."           echo
           | "Supply input output"           exit 1;         fi       else
           | outfile="${2}"       fi            gs -sDEVICE=pdfwrite
           | -dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH -dQUIET
           | -dDetectDuplicateImages=true -r150 -sOutputFile="${outfile}"
           | "${1}"            pre_b=$(wc -c "${1}" | cut -d' ' -f1)
           | post_b=$(wc -c "${outfile}" | cut -d' ' -f1)       if ((
           | $pre_b <= $post_b )); then         echo "Original is smaller
           | ($pre_b -> $post_b). copying..."         cp "${1}"
           | "${outfile}"       fi
        
       ___________________________________________________________________
       (page generated 2022-01-31 23:01 UTC)