[HN Gopher] Show HN: Beautiful PDFs from HTML
       ___________________________________________________________________
        
       Show HN: Beautiful PDFs from HTML
        
       Author : abhinav22
       Score  : 345 points
       Date   : 2021-04-04 18:16 UTC (4 hours ago)
        
 (HTM) web link (pdf.math.dev)
 (TXT) w3m dump (pdf.math.dev)
        
       | kartoshechka wrote:
       | It should be the other way. PDF ought to die and let documents be
       | parsable
        
       | synergy20 wrote:
       | PDF is elegant on my eyes. To make it a web surfing experience
       | for me it lacks one thing here, that I need the TOC on the left
       | or right side so I can click and jump to various section quickly.
        
         | abhinav22 wrote:
         | The table of contents is itself hyperlinked. And at the bottom
         | of each page is a link to return to the table of contents.
         | 
         | Let me know if that works well?
        
       | raosl wrote:
       | Is there any option to automatically trigger saving such PDF as a
       | file?
        
       | hda111 wrote:
       | I would prefer if it was a single page like iOS does it when a
       | screenshot of a webpage is made where it offers to export a PDF
       | with everything.
        
       | raosl wrote:
       | Is there an option to automatically trigger saving such PDF using
       | JS?
        
       | GrumpyNl wrote:
       | There should be a button on there that converts the site to a
       | pdf.
        
       | hu3 wrote:
       | Are there any report generation libraries built on top of
       | page.js?
       | 
       | I've been looking for report generation solution based on
       | frontend technology.
       | 
       | Btw. This is great. Thanks for sharing.
        
         | dreix wrote:
         | I started generating PDF before pagedjs and use my own electron
         | based solution. Search for schild.report on github to see how
         | it works. Basically reports are created via svelte templates
         | and then printed via the electron print api. Works extremely
         | well.
        
         | nojito wrote:
         | Yup. I use pagedown all the time.
         | 
         | https://pagedown.rbind.io/
        
         | abhinav22 wrote:
         | Thanks for your kind words.
         | 
         | Would you be able to expand on what you mean by "report
         | generation libraries"?
         | 
         | For example, I am building (in Common Lisp, but it's trivial
         | and can be done in any software) a tool to read content from a
         | database and auto generate the HTML markup for producing pdf
         | reports. This allows me to reuse content across reports and
         | also leverage the full power of databases (text search in
         | particular). As another example, I have many monthly financial
         | metrics - I will store these in a database then use my lisp
         | markup tool to generate the necessary HTML to produce the pdf
         | report (via paged.js).
         | 
         | In addition, one can use headless chrome to automate the full
         | workflow so that the reports are generated directly from your
         | program and not via File > Print in your browser.
         | 
         | Was that what you were thinking of?
         | 
         | You can also add charts via charts.js.
         | 
         | The beauty of paged.js is that you can leverage many of the
         | features of browsers and JavaScript libraries in your report
         | generation.
         | 
         | I wasn't able to get syntax highlighting for code blocks to
         | work however, need to dig into that a bit more.
        
           | hu3 wrote:
           | > Was that what you were thinking of?
           | 
           | Yes! This is great. Thanks for the pointers. I'll look into
           | that.
        
       | brightball wrote:
       | My question on these is always how it handles multi page
       | documents. Most DOM and CSS approaches to HTML to PDF require it
       | to fully render everything for placement before it can convert.
       | 
       | If you have to render something big, like a 300 page member
       | directory for example, the approach will blow up.
        
         | idreyn wrote:
         | I used paged.js in production and while it's definitely not
         | fast enough to run during an HTTP request, it can render a 300
         | page document reasonably quickly, definitely within the 120
         | second TTL of our worker tasks. It can be quite finicky though
         | and sometimes stalls on things like images that are taller than
         | a single page.
        
       | runxel wrote:
       | For anybody working with pandoc you should try Weasyprint [0].
       | 
       | [0]: https://weasyprint.org/
        
         | sonthonax wrote:
         | Weasyprint is great, but be prepared to start hacking its
         | layout engine for more complex generations.
         | 
         | It's nowhere near as mature as PrinceXML.
        
       | gnicholas wrote:
       | FYI typo on "foooters"
        
       | exhilaration wrote:
       | Does anyone know how this compares to PrinceXML / DocRapotor? We
       | pay them big bucks for PDF generation (invoices specifically) and
       | so far we haven't found anything comparable.
        
         | abhinav22 wrote:
         | I was a huge supporter of PrinceXML / DocRaptor for precisely
         | the same reason - all the alternatives (that I knew at the
         | time) were not good.
         | 
         | Paged.js was a revelation (thanks to HN for telling me about
         | it!). It is based off CSS Print Specifications like PrinceXML
         | (as is my understanding - I'm about 95% sure), and to me it's
         | even better because it utilizes all the other front end
         | technologies directly from your browser - I think there are
         | some use cases where PrinceXML won't be able to get the same
         | functionality.
         | 
         | For invoices, I think you should be able to easily switch over.
         | Based on what I can see.
        
         | jitl wrote:
         | Have you tried using Chromium print-to-pdf by an API like
         | Puppeteer or Playwright? Combined with something like paged.js
         | and a decent print stylesheet, you can get pretty good quality
         | output.
        
         | cdolan wrote:
         | I also pay a lot per month to DocRaptor and they've been very
         | reliable to date, but unless I'm rendering charts with JS I
         | feel like I'm really overspending
        
         | Semaphor wrote:
         | Not sure about paged.js, but we use Weasyprint for invoices. It
         | isn't as advanced as some of the paid options (most annoyingly
         | page headers are somewhat hacky) but it works very well.
         | 
         | My go-to for everything CSS Paged Media is [0] which has a nice
         | comparison of supported features at [1]. They recently added
         | Weasyprint, PagedJS and Typeset.sh
         | 
         | [0]: https://css.paged.media/
         | 
         | [1]: https://css.paged.media/lessons
        
         | zulko wrote:
         | A few years ago I started an alternative to PrinceXML called
         | ReLaXed.js [1], it's always been sufficient for my reports but
         | it may lack some pagination/layout features that Paged.js may
         | have as they seem to have given this much more thoughts (still
         | wrapping my head around whether paged.js could be "plugged
         | into" Relaxed).
         | 
         | [1] https://github.com/RelaxedJS/ReLaXed
        
       | systemvoltage wrote:
       | There is something really pleasing about reading PDFs. It's
       | perhaps how static it is and it won't change on me. I can zoom in
       | or "operate" on it without some reorganization. It puts the mind
       | to ease. There is no reflowing. There is no columns shifting.
       | It's just is. Like a piece of paper as an analog - the intent of
       | the author and the designer is retained and frozen in time. Fonts
       | are embedded and chosen by the creator. Haters of PDFs do not
       | understand the human aspects of it - they just see it as a
       | specification (which is convoluted).
        
         | agumonkey wrote:
         | I've been obsessed with dynamicity and having everything
         | adaptable.. but yeah static stuff have this weird feeling of
         | 'right'.
        
         | tambourine_man wrote:
         | Copy/pasting can be infuriating though. The web is catching up
         | fast however, but at least I can view source/inspect.
        
         | tyingq wrote:
         | _" It's perhaps how static it is and it won't change on me."_
         | 
         | You might be surprised that's not always true.
         | 
         | https://www.pdfscripting.com/public/FreeStuff/PDFSamples/Jav...
        
         | [deleted]
        
         | yawaworht1978 wrote:
         | Furthermore, even on mobile devices, everything just work.
         | Ctrl+f is non invasive, if I close the document in the middle
         | of 100s of pages and reopen the document, I do not need to
         | scroll, unlike many web apps it preserves the scroll location.
         | It does what it needs to do without any hiccups.
        
         | cryptoz wrote:
         | I have the opposite experience. To me, opening a PDF puts me on
         | edge. My computer is likely to slow to a crawl. It never
         | renders correctly - sometimes it will slowly render each page,
         | one at a time, flashing the screen as it goes. Or maybe the PDF
         | is using some features that my reader doesn't support so it
         | renders incomplete and incorrectly.
         | 
         | Links are often hard to pick out. What is a link and what
         | isn't? What happens when I click on something, is it going to
         | stay in the PDF or open a browser or something?
         | 
         | Don't get me started on moving around in PDFs. There are always
         | 2 sets of page numbers, one for the PDF and one of the
         | document. Extremely confusing.
         | 
         | Searching. Ugh, searching a PDF is a nightmare I don't want to
         | even think about right now. Ctrl+F is broken 99% of the time.
         | 
         | Or at least, that's my experience over the last 20 years. Sure,
         | it's gotten better recently, but, not enough to make my mind
         | 'at ease' exactly. Very stressful to open a PDF still, usually.
        
           | chmod775 wrote:
           | > To me, opening a PDF puts me on edge. My computer is likely
           | to slow to a crawl. It never renders correctly - sometimes it
           | will slowly render each page, one at a time, flashing the
           | screen as it goes.
           | 
           | That probably isn't the fault of the PDF, but the PDF reader
           | you're using.
           | 
           | > Searching. Ugh, searching a PDF is a nightmare I don't want
           | to even think about right now. Ctrl+F is broken 99% of the
           | time.
           | 
           | Now this is actually the fault of PDF and how it does
           | positioning of stuff within it - but about 50% of the blame
           | lies with whichever software generated a shit PDF.
        
             | LordDragonfang wrote:
             | >Now this is actually the fault of PDF and how it does
             | positioning of stuff within it - but about 50% of the blame
             | lies with whichever software generated a shit PDF.
             | 
             | Arguably, it's still the fault of the horribly
             | overcomplicated pdf spec -- html manages to do it just
             | fine, with a plain text format, to boot
        
           | diarrhea wrote:
           | > _Don 't get me started on moving around in PDFs. There are
           | always 2 sets of page numbers, one for the PDF and one of the
           | document. Extremely confusing._
           | 
           | It's only a tiny subset of all PDFs in circulation, but the
           | LaTeX PDFs I produce using appropriate settings (mainly
           | KOMAScript class) always nail this. The current page number
           | always corresponds to what is _printed_ in the PDF. This can
           | be alphanumerical (e.g. page  "a" / 300, where 300 is the
           | total number of all pages) or roman, for the frontmatter. The
           | PDF viewer will then literally show e.g. "Page XII / 300".
           | 
           | So in that sense, it's in the hands of the party producing
           | the PDF to get this right, not an inherent limitation in the
           | standard.
           | 
           | But now, new problems arise. If you're on the printed page
           | _XII_ but your viewer displays  "Page 22 / 300", you know
           | where you are in total. "Page XII / 300" is "correcter" but
           | can be anything.
           | 
           | > _Searching. Ugh, searching a PDF is a nightmare I don 't
           | want to even think about right now. Ctrl+F is broken 99% of
           | the time._
           | 
           | Don't share this experience. It's a the same level as in
           | browsers, where CTRL+F is also quite limited (I'd give a
           | kidney to have regex available everywhere---ripgrep-all gets
           | close on the desktop). The only different thing in PDFs is if
           | hyphenation occurs, which is arguably less common in browsers
           | (simply because of poorer typographical standards/people care
           | more in proper PDFs). Your search term will indeed be
           | invisible to CTRL+F. The only other time it breaks down in
           | PDFs if the PDF is corrupt/poorly produced/bad OCR.
        
             | necovek wrote:
             | PDFs can be produced by mapping completely unrelated
             | characters to glyphs that appear as actual characters
             | (basically, they'd embed a "compacted" font that only has
             | the glyphs required for the document, and then map them to
             | ASCII or something). This was quite common with pdftex
             | documents not using ASCII characters in the past, thus
             | making text unsearchable (and even more so when going the
             | ps2pdf route). For example, you'd have a Cyrillic document
             | in which when you selected text in, highlighted text would
             | be some jumbled ASCII characters.
             | 
             | The fact is that PDF can display one thing and have
             | underlying semantic text be something else entirely
             | (frequently used for OCRing: you show actual scanned images
             | of text, and put the invisible OCRed text as searchable ).
             | 
             | It works in the other direction too: you could solve the
             | hyphenation problem in the same way by having PDF include
             | invisible non-hyphenated word in place of the hyphenated
             | one for searching.
             | 
             | Still, PDF is mostly a laying-out format, and while tools
             | have evolved to provide some "meaning" to rendered content,
             | it is never going to be semantic in the sense markup
             | languages can be (i.e. there is no "emphasis", "quote" or
             | "header" command for PDF, instead, it just uses a different
             | font). To put things into perspective, TeX files can be
             | semantic (if a semantic TeX .fmt like LaTeX is used) like
             | HTML/ePub, but PDF is an output format, just like DVI is.
        
         | bqmjjx0kac wrote:
         | I don't have much experience with PDF as a spec, but I guess
         | I'm a "PDF hater."
         | 
         | It's the things that didn't need to be PDFs, but inexplicably
         | are, that annoy me. Like data dumps from local governments that
         | could have been machine-readable, or announcements that are
         | distributed in print and emailed as PDFs, rather than lifting
         | the content into the message body.
        
           | diarrhea wrote:
           | Agreed, but there exist solutions [0] to make even PDF tables
           | machine-readable (optionally making use of machine learning
           | techniques). It's incredibly backwards and much harder than,
           | say, CSV, but it might get the machine-reading job done.
           | 
           | 0: https://camelot-py.readthedocs.io/en/master/
        
           | BlueTemplar wrote:
           | Also pdf has very bad support of animation formats.
        
             | tonyedgecombe wrote:
             | That's a good thing, I don't want to see animations in a
             | PDF.
        
               | BlueTemplar wrote:
               | And I would like to have an actual portable document
               | format for electronic documents, but since mhtml support
               | was for some reason dropped except in e-mail clients (as
               | .eml), I'm stuck with this frankenformat that is pdf !
        
           | axiolite wrote:
           | > data dumps from local governments that could have been
           | machine-readable, or announcements that are distributed in
           | print and emailed as PDFs, rather than lifting the content
           | into the message body.
           | 
           | You should thank PDF for giving you any useful electronic
           | copies at all.
           | 
           | If it's scanned-in papers, sticking them loosly in an e-mail
           | or web page would be much more difficult to read through.
           | 
           | If it's text data, then perhaps it was primarily composed to
           | be printed, and PDF allows easy creation of readable
           | electronic copies with minimum of effort from any input.
           | Before PDF you might have gotten nothing at all, because most
           | people don't have readers for various obscure proprietary
           | input formats.
           | 
           | And PDF is far easier than other formats to convert into
           | another format for your own consumption. Do you have a
           | command-line tool which will extract the embedded images out
           | of a Microsoft Word document? Or one that will convert it to
           | plain text, preserving formatting? pdfimages and pdftotext
           | -layout are very widely available.
        
             | Aeolun wrote:
             | > You should thank PDF for giving you any useful electronic
             | copies at all.
             | 
             | I think the point is that data dumps in PDF format are not
             | useful at all.
             | 
             | I take objection to your statement that it's easier to
             | convert. The only reason there are so many tools to do so
             | is because it's so hard/impossible in the first place.
        
           | systemvoltage wrote:
           | You're right - it's just the wrong format and it isn't
           | intended for that. Gov should be publishing
           | text/csv/parsable-formats not PDFs when it comes to data
           | dumps.
        
           | BostonEnginerd wrote:
           | Our city council posts PDF files which consist of scans of
           | documents someone printed out. Sometimes they turn on OCR,
           | but not often.
        
         | Lammy wrote:
         | Also they can be justified in ways browsers can't (yet) do.
         | Every web page has a ragged right.
        
           | jimmygrapes wrote:
           | I have been trying to place the justification format I have
           | seen in some print books. In most cases it will add spaces
           | between words to make up for sentences with words that can't
           | be split easily, but it also doesn't fully right-justify the
           | last letter of the line. It seems to be something like "add
           | inter-word spacing _unless_ spacing exceeds X "
        
             | notriddle wrote:
             | The gold standard for computer text justification is the
             | TeX algorithm: https://en.wikipedia.org/wiki/TeX#Hyphenatio
             | n_and_justificat...
        
           | chmod775 wrote:
           | Like this (the demo at the top is interactive)?
           | 
           | https://developer.mozilla.org/en-US/docs/Web/CSS/text-align
        
             | KMnO4 wrote:
             | Interesting, there's absolutely no difference between
             | justify and align left on iOS Safari. I wouldn't have
             | guessed that's an unsupported feature. I guess it's not
             | common enough for me to notice throughout the web.
        
               | abhinav22 wrote:
               | Justify works in iOS Safari (I just checked). I like to
               | use it in my designs sometimes :)
        
               | URSpider94 wrote:
               | Works fine for me.
        
             | Lammy wrote:
             | No, like 'text-justify: inter-character':
             | https://developer.mozilla.org/en-US/docs/Web/CSS/text-
             | justif...
        
         | maga wrote:
         | Maybe it's simply a force of habit at this point, but I can
         | only read/study/memorize technical stuff from PDF/paginated
         | source--I memorize the overall "picture" of the page and it's
         | location along with the bits I actually need, and it's really
         | tough to do with non-paginated sources.
        
           | imposterr wrote:
           | Yes! I totally get this. I think it's akin to the "mind
           | palace" method popularized in media.
        
             | maga wrote:
             | That's an apt observation, actually. I used to practice
             | mind palace when I had to memorize lists, and I do feel
             | more comfortable when information is "physically" placed
             | somewhere like a page, I guess it's all connected.
        
               | hikarudo wrote:
               | I wonder if source code could be read and understood more
               | effectively if it were paginated.
        
               | jhgb wrote:
               | I'm happy that someone came Forth with that idea here.
        
               | Jiocus wrote:
               | Source code often does facilitate it's own method of
               | paginating, into different files. One could argue that is
               | the whole point of the practice.
               | 
               | Pagination or not, both pagination and files provide some
               | degree of spatial sense just as 'loci' and memory
               | palaces.
               | 
               | Edit addendum:
               | 
               | There are theories about which senses are our dominant
               | ones, and how they affect our learning processes. Some
               | may lean towards visual ques in their mental life, others
               | on kinetic or sound. Personlly I experience my mental
               | models as spatial. Even abstract thoughts become situated
               | "somewhere", if not by itself, then by contrast of other
               | things on my mind.
               | 
               | "Everything is a Memory palace."
               | 
               | Needless to say, when I'm deep off in a terminal with
               | something, I don't think I'd describe it as text-based.
        
               | necovek wrote:
               | Source code has a bunch of other properties that make
               | pagination less useful.
               | 
               | I.e. we strive for short functions, we use indentation
               | heavily, it is commonly rendered in fixed-width fonts
               | (this helps with spatial memory/overview too), etc.
        
               | hikarudo wrote:
               | Good points. Also, code can change frequently, so the
               | "visual memory" reinforced by pagination becomes less
               | useful, maybe even a hindrance.
        
         | jwr wrote:
         | The great thing about PDFs is that you can open huge documents
         | and page through them quickly. I feel this is under-appreciated
         | in today's world, where scrolling is being forced on us
         | everywhere. Scrolling sucks for reading text. Every time you
         | scroll, you have to pay attention to how much you scrolled, and
         | then find your place in the text again.
         | 
         | As for paging speed, just try using GoodReader or PDF Expert on
         | an iPad. I can flip through thousand-page manuals and
         | datasheets as quickly as if it were a paper book. And a 12"
         | iPad shows an entire A4 page without the need for zooming and
         | panning.
         | 
         | In my experience, people who dislike reading PDFs have only
         | tried doing so in Acrobat Reader (which is hot garbage, and
         | slow), on a small screen that is wider than it is tall, zoomed
         | in so that only half a page is being shown. That is a sub-par
         | experience indeed.
        
           | BlueTemplar wrote:
           | Be it pdfs or html, I find my place through chapters, rather
           | than pages.
        
             | pvorb wrote:
             | Haven't you ever been forced to stop reading in the middle
             | of a chapter? It happens to me all the time.
        
               | BlueTemplar wrote:
               | Depends what you mean.
               | 
               | Temporary interruptions yes. But then the location is
               | kept.
               | 
               | Interruptions for a very long time ? I might have to
               | reread the whole chapter anyway...
        
           | Thrymr wrote:
           | > I feel this is under-appreciated in today's world, where
           | scrolling is being forced on us everywhere. Scrolling sucks
           | for reading text. Every time you scroll, you have to pay
           | attention to how much you scrolled, and then find your place
           | in the text again.
           | 
           | This is incredibly important, and something that dedicated
           | book readers like Kindles get right, but I've never seen done
           | well in long web pages. Discrete "pages" (that correspond to
           | "screens") make it much easier to find your place as you go
           | to the next page. Note that multipart web pages often have
           | you scroll through each "page" separately, and give you the
           | worst of both worlds. Sure, PDF isn't always best for reading
           | on a computer or phone screen, but infinite scrolling is
           | annoying too.
        
         | bobbylarrybobby wrote:
         | Isn't zooming in and having text reflow a feature, not a bug,
         | of HTML? PDFs are pretty much impossible to read on a phone
         | because of the endless amount of zooming in and out and
         | horizontal scrolling (unless they were designed for mobile --
         | and then they're hard to read on a desktop). Never mind users
         | on a desktop who just like their text large for ease of reading
         | -- their screen might not be wide enough to fit the text
         | without horizontally scrolling.
         | 
         | As an author, my intent is that the content be easily readable
         | to all readers. I don't see why I should want or get to dictate
         | the layout and aesthetics to my readers.
        
           | lolinder wrote:
           | What's a feature in one context can be a bug in another. I
           | get where OP is coming from: when I zoom in and the text
           | reflows, it's easy to lose my place. PDFs don't have that
           | problem.
           | 
           | Also, digital, free-flow media lose basically all sense of
           | space. PDFs are much better for finding a piece of content
           | again later, because I can remember the location on the page
           | and roughly how many pages into the document.
        
             | kwhitefoot wrote:
             | I wonder why browser bookmarks don't save the position.
        
               | necovek wrote:
               | The answer is likely obvious in that it depends on what
               | content is the visible content at the time of the
               | bookmark, which will further depend on the content itself
               | (it can change since this is a bookmark on an alive web
               | page), page styles, zoom level/scaling, window size etc.
               | 
               | Basically, for a bookmark to fully store a position, it
               | would have to store all of the above (and probably more),
               | and it would only be really usable on the same device as
               | long as the underlying content does not change.
        
             | Gehinnn wrote:
             | Usually, if you have reflow, you can disable it. However,
             | if you don't have reflow, you cannot usually enable it!
        
               | kwhitefoot wrote:
               | Opera does better, or at least it used to.
        
           | noxer wrote:
           | It would not be hard to read on desktop you can simply show
           | multiple pages like in book you usually see 2 pages. The same
           | concept works if you have place for more pages. Its just not
           | something people do (create PDFs for mobile). Almost all PDFs
           | are meant to be read at approximately the size of DIN A4 for
           | one page. In a time everyone is and should be disencouraged
           | from printing stuff this is not really needed.
        
           | systemvoltage wrote:
           | I think there is a spectrum of commodity vs. artistic mediums
           | in all forms and we often talk past each other when debating
           | the finer points. If your goal is to send out a press release
           | to the public, perhaps layout/aesthetics isn't as important
           | (sometimes it is though so its not a hard/fast rule). In
           | artistic media, especially in magazines and mixed-media
           | books, layout and aesthetics are an integral part of print
           | media. It is inseparable. Just as in music, you don't want to
           | add an equalizer ruining the original intent of the artist,
           | books created by artists in 1890 still are with us in print
           | format - exactly how they were intended to published to
           | readers. But it is entirely different if the "music" is a
           | podcast - I want to use an equalizer to bring up the higher
           | frequencies for better audibility. Similarly, if I am reading
           | a novel on an epaper display and I want to increase the font-
           | size or type, we should allow that as you said.
        
             | BlueTemplar wrote:
             | I agree. What annoys me is when not very artistic mediums
             | like scientific articles force a fixed page layout. It gets
             | even worse when you have to hunt down the relevant figures
             | over the following pages because they couldn't be put on
             | the same page due to lack of space. Also opening a figure
             | in a separate window isn't much a thing either for pdfs.
        
               | akhilpotla wrote:
               | I definitely prefer to read research papers in html. I
               | like to zoom in a lot when reading a long piece on my
               | computer since it helps me read faster and keeps me from
               | getting distracted. I've been thinking about working on a
               | side project where I convert pdfs to html for academic
               | papers.
        
               | pvorb wrote:
               | One benefit of PDFs for research papers is that you can
               | easily save them to your own computer, build up a library
               | of them, highlight lines with functionality built into
               | most PDF readers. I generally prefer HTML for reading,
               | but PDF has some benefits, too. Granted, most of these
               | features are also available for HTML. But for some reason
               | you need to look for browser plugins in order to
               | highlight HTML pages, whereas in PDF you can just use the
               | feature. And PDF is always about the content whereas HTML
               | also typically contains navigation and other distractors.
        
               | BlueTemplar wrote:
               | I _really_ don 't get why mhtml has been discontinued by
               | browsers??
        
           | amelius wrote:
           | The necessity of zooming is a shortcoming of the device, not
           | of the text format.
        
             | samatman wrote:
             | Zooming seems like an inevitable consequence of how screens
             | and eyes work, what am I missing?
             | 
             | Forget text for a second, if I want to see fine details in
             | an enormous image I'm going to have to zoom in. I normally
             | adjust font size rather than zooming text but it's nice to
             | have both available.
        
         | michaelgrafl wrote:
         | First thing I do when opening a Word document at work is
         | converting it to PDF and reopen it in Sumatra Reader.
         | 
         | It just feels a lot better to me. It opens faster, it opens at
         | the position I left it at, zooming in and out is fast,
         | scrolling is smoother, and even if I wanted to, I couldn't
         | modify it on accident.
         | 
         | It just feels a lot more reified than something that is
         | responsive or editable.
         | 
         | Not great for mobile, but that's not what I care for at work.
        
       | webwanderings wrote:
       | This is cool. Exactly what I was looking for not too long ago
       | (sometimes markdown does not fit all the needs for
       | documentation).
       | 
       | How about images? How do you handle images; their layout, scaling
       | etc?
        
         | abhinav22 wrote:
         | Basically you can use CSS to manage their layout - dimensions,
         | positioning, scaling, etc. Should work pretty well
         | 
         | If you want floating images (e.g. text on the left, images on
         | the right), it may be a bit more difficult and not perfectly
         | possible. This guide will help: https://www.pagedjs.org/page-
         | floats/
         | 
         | One tricky part is if you want to have text within images and
         | have them the same size as your main text (eg in MS Word where
         | you can have shapes and text boxes). For that, you can probably
         | get close enough with a simple image load, and more precise by
         | using svg graphics, but it may result in a reasonable amount of
         | complexity to make perfect (if at all).
         | 
         | For charts, use charts.js in my opinion.
        
       | axlee wrote:
       | What is your favourite library for printing HTML to PDF?
        
         | airstrike wrote:
         | https://pandoc.org/ + something like pdflatex
        
       | honzajde wrote:
       | Saved PDF (in Chrome) does not have a TOC as a side pane in
       | Acrobat Reader. MIssing pretty important feature.
        
         | abhinav22 wrote:
         | Yes, that would be an advanced feature and I think likely out
         | of the scope for paged.js. That said the table of contents page
         | is hyperlinked - you can jump to sections, and I put a return
         | to table of contents in the footer to aide with navigation.
         | Hopefully that helps?
        
       | anonu wrote:
       | This is one of the hardest problems we face as a financial
       | research platform. We have a lot of financial data in tables
       | along with line, bar and pie charts. Coming up with something
       | sensible and readable is a bit harder than expected. Ultimately
       | we have a "json to latex" converter we built but it's not
       | great...
        
         | LaundroMat wrote:
         | Maybe Vega and its Figures for Papers can help?
         | 
         | https://vega.github.io/vega-lite/tutorials/figures.html
        
         | TimTheTinker wrote:
         | May I recommend Prince? https://www.princexml.com/
         | 
         | I created a PDF exporter for a manual test tracking app using
         | this -- render to (pretty simple) HTML, pass to the prince
         | executable, and out comes a beautifully typeset PDF.
         | 
         | Prince has its own rendering engine that is purpose-built for
         | PDF rendering. It's actually very good - a lot of professional
         | books and documents have been typeset using Prince.
        
         | smt88 wrote:
         | We have been using PDF2XL[1] for this for years (used to be
         | called CogniView).
         | 
         | It's genuinely unbelievable. If the PDF isn't sufficiently
         | structured, it has OCR that seems to "just work".
         | 
         | You can also automate the extraction and integrate it into your
         | pipeline.
         | 
         | The UI is pretty old and ugly-looking, but it is one of the few
         | apps I've used in the last 10 years that made me feel genuine
         | delight.
         | 
         | 1. https://pdf2xl.com
        
         | burmanm wrote:
         | Anything wrong with XSL-FO? I know it's not the hottest thing
         | on earth, but it works. Apache FOP is still developed and it's
         | easy to add it to a pipeline.
        
         | abhinav22 wrote:
         | I work in financial reporting, doing very similar things to
         | what you mentioned here. Mind dropping me an email at
         | Ashok.khanna@hotmail.com and we could discuss further? Would
         | love to chat to others facing similar dilemmas :)
        
       | mywacaday wrote:
       | The section numbers in the table of contents are over writing the
       | text, maybe not so beautiful.
        
         | abhinav22 wrote:
         | Thanks for flagging. It's aimed for desktop users and a
         | reasonable size screen (eg A4 or so), it's not really a
         | responsive design as it's meant to be a tool to generate PDFs.
         | So it won't work in some browser dimensions (the paged.js code
         | is reasonably complex).
         | 
         | That said it works in iOS as far as I can test. For some reason
         | page numbers in table of contents not working perfectly in
         | Safari but Chrome works pretty well.
         | 
         | I guess the conclusion is that this is aimed somewhat to
         | desktop Chrome users as a specific tool for pdf generation.
        
           | Tomte wrote:
           | I'm seeing all page numbers as 0 in the TOC on iPadOS.
        
             | abhinav22 wrote:
             | Yep - looks like Chrome only gets those right. I will speak
             | to the paged.js guys and see where the bug is in the code.
             | 
             | Rest seems to work on Safari - let me know if any other
             | issues and I'll fix / update accordingly
        
       | sonthonax wrote:
       | By using a W3C standard and devolving the layout engine to the
       | browser, this solves a difficult problem the right way.
       | 
       | I would have loved to have something like this for a project
       | years ago.
        
         | abhinav22 wrote:
         | It really does. The team at paged.js are simply amazing and
         | deserve so much credit.
         | 
         | It's such a big deficiency in the modern web, we really need
         | Chrome / Safari / etc to implement the W3C standard or
         | something better
        
       | sunjester wrote:
       | mobile friendly?
        
         | abhinav22 wrote:
         | No, it's more for pdf production from desktop. The print
         | preview on the example web page works on iPhone and browsers
         | but the aim is really for pdf and not for consumption via
         | mobile etc
        
       | frongpik wrote:
       | Add to this a pdf to html converter, with a focus on official
       | forms (e.g. irs tax forms), ability to easily edit fields and add
       | signatures (similar to how the free android adobe app does it),
       | and you can charge money for it.
        
         | abhinav22 wrote:
         | Thanks - indeed there is likely a market for it. One of the
         | issues is that to get a commercial app, you have to solve for
         | most of the edge cases and make sure it has a good enough UI
         | for the non sophisticated.
         | 
         | I was thinking about doing it, but it would be a lot of work to
         | do right.
         | 
         | By right, I would want it to be the quality of sublime or emacs
         | / vim :) :)
        
           | frongpik wrote:
           | Actually, there's one more usecase for a html to pdf
           | converter: making a book-style copy of a multipage website.
           | I'm looking right now at a scientific site with content
           | spread over multiple pages and it's tiring to find and click
           | all the links to make sure I don't miss anything.
        
           | frongpik wrote:
           | You'd essentially have a pdf editor that can import a pdf,
           | edit it and export the html back to pdf. Working with
           | official forms is one usecase. Another is an iframe that can
           | preview pdfs without resorting to plugins.
        
             | abhinav22 wrote:
             | Indeed - that makes a lot of sense
        
       | buovjaga wrote:
       | I tried out paged.js recently for a genealogical report exported
       | from Gramps, but I had to use PrinceXML because counter-reset to
       | start at a page does not work:
       | https://gitlab.pagedmedia.org/tools/pagedjs/issues/91
       | 
       | Apart from this feature everything worked fine.
        
         | abhinav22 wrote:
         | Do you have a repo I could play around with? I had the same
         | issue, but for my use case I figured a work around - primarily
         | by using classes. I could spend a few minutes and see if I can
         | get it working for you.
         | 
         | But yes, it's a deficiency in the system currently
        
       | majkinetor wrote:
       | Awesome project.
       | 
       | Consider adding in the demo automatic anchors on headers so one
       | can quickly copy them for sharing. Currently they can only be
       | obtained from ToC but you need to scroll to it. On anything
       | larger then few pages, this is a must. One problem there is that
       | current automatic id's are generated sequentially and not really
       | user friendly for link sharing.
        
       | dukeofdoom wrote:
       | Adobe makes it harder and harder to use their PDF reader. I live
       | in Canada, and somehow I'm given forms I legally have to use,
       | that I can only print out in adobe reader v10. I need to go
       | through hassle of installing and uninstalling their terrible
       | product couple times a year.
        
       | doersino wrote:
       | There's also Bindery, a JavaScript library for book creation
       | which also leverages the print-to-PDF feature built into modern
       | browsers: https://evanbrooks.info/bindery/
       | 
       | On top of that and the in-browser Markdown renderer Markdeep,
       | I've built a tool for typesetting undergraduate theses:
       | https://github.com/doersino/markdeep-thesis/
       | 
       | And, coincidentally, just a few days ago I've written a blog post
       | about controlling the settings in Chrome's "Print" dialogue with
       | CSS (other browsers don't support many of the relevant features):
       | https://excessivelyadequate.com/posts/print.html
        
       | wyck wrote:
       | The Achilles heel pf PDF's are they don't have responsive
       | layouts. It's so bad the Adobe team created an AI to resize
       | layouts, yes an AI in the cloud, available only on the Adobe app.
       | How insanely bad is your file format that you need an AI to
       | resize layouts in 2021? If anyone has had to handle layouts
       | programmatically I'd think they would agree that PDF's are the
       | most outdated ass backwards file format in existence.
        
         | necovek wrote:
         | PDF is an attempt at non-Turing complete, simpler PostScript
         | (PS). It comes from a time of paged media de facto ruling the
         | world. Changing layouts was never the goal, because PDF was the
         | output format.
         | 
         | In case of academic research papers typeset with LaTeX, the
         | source file is something you'd likely want to consider the
         | semantic equivalent of HTML. TeX should be able to render the
         | same document with different output constraints ("responsive
         | layout"), but because of the architecture (TeX itself is fully
         | Turing complete), it is pretty slow at re-rendering an entire
         | document.
         | 
         | Part of the allure of a static document format like PDF is that
         | you can, in theory, fetch just page 454 of 6000 page document
         | and render that: with HTML, just like with TeX, you'd have to
         | get and render the entire document to be certain that the
         | layout won't change after you've processed the whole file.
        
       | TedDoesntTalk wrote:
       | I've been looking for a way to convert Wordpress blogs to PDF.
       | There are Wordpress plugins for this but I have not found any
       | that work well.
       | 
       | Can this be integrated with Wordpress?
        
         | abhinav22 wrote:
         | Should be able to - somebody with intermediate Wordpress
         | knowledge (unfortunately I don't know php) should be able to
         | integrate within a day in my opinion, based on my understanding
         | of web development
        
       | janandonly wrote:
       | I love "printing" webpages to PDF files. I've been doing this for
       | more than 15 years. I delete most of the images first so that I
       | end up with files of 50-500KB. I then store said file in
       | appropriate labeled directories.
       | 
       | Now 15 years later I have a private stash of websites and
       | wikipedia articles that I can consult by simply pressing
       | command+spacebar (the files are indexed in MacOS search).
       | 
       | To make a PDF file out of a website I currently use
       | Printfriendly.com, but it wasn't always this way.
       | 
       | Back in the days I loved to use Arc90's Readability (a firefox
       | extension). I don't know what happend to that extension though,
       | there are plenty of old HN articles about that Wonderfull plug-in
       | though:
       | 
       | Post from 2010, probably I started using it right after finding
       | this post... https://news.ycombinator.com/item?id=1153343
       | 
       | https://news.ycombinator.com/item?id=3246081
       | 
       | https://news.ycombinator.com/item?id=3243097
       | 
       | My joy knows no bounds !!
       | 
       | I actually ducked for "What happend to arc90.com ?" and found as
       | the 7th item in the list this website:
       | https://ejucovy.github.io/readability/
       | 
       | It still hosts a working version !!!
       | 
       | Okay kids uses these settings and thank me later: * Style:
       | Athelas * size: small * Margin: narrow * Convert hyperlinks to
       | footnotes
       | 
       | Whenever a pages is worthy of saving, press the button for
       | Readability and pres ctrl+P and save to PDF... that's it.
        
       | john-doe wrote:
       | Great paged.js tutorial, thank you for publishing it.
        
         | abhinav22 wrote:
         | Thank you! Paged.js is really such an awesome tool.
         | 
         | I was searching for many weeks for something like this, so I
         | really think the word needs to get out there more. It could
         | significantly improve the workflows of many people who are self
         | writing / self publishing as it opens up the power of CSS and
         | HTML (which allows to nicely defined formatting templates and
         | use code to automate content generation) to pdf reports (which
         | I think has its place).
         | 
         | I haven't used pandoc, but I think a HTML/CSS/Paged.js workflow
         | could challenge it.
         | 
         | At work I'm already converting many processes to it - I have a
         | database of content and then use SQL queries to extract data
         | and then generate beautiful PDFs through paged.js.
         | 
         | It also works well with mathematical typesetting (via MathJax).
        
       | ChuckMcM wrote:
       | This is very nice. I keep wondering when we'll see a resurgence
       | of magazines which use a system like this.
        
       ___________________________________________________________________
       (page generated 2021-04-04 23:00 UTC)