[HN Gopher] Show HN: Beautiful PDFs from HTML
___________________________________________________________________
Show HN: Beautiful PDFs from HTML
Author : abhinav22
Score : 345 points
Date : 2021-04-04 18:16 UTC (4 hours ago)
(HTM) web link (pdf.math.dev)
(TXT) w3m dump (pdf.math.dev)
| kartoshechka wrote:
| It should be the other way. PDF ought to die and let documents be
| parsable
| synergy20 wrote:
| PDF is elegant on my eyes. To make it a web surfing experience
| for me it lacks one thing here, that I need the TOC on the left
| or right side so I can click and jump to various section quickly.
| abhinav22 wrote:
| The table of contents is itself hyperlinked. And at the bottom
| of each page is a link to return to the table of contents.
|
| Let me know if that works well?
| raosl wrote:
| Is there any option to automatically trigger saving such PDF as a
| file?
| hda111 wrote:
| I would prefer if it was a single page like iOS does it when a
| screenshot of a webpage is made where it offers to export a PDF
| with everything.
| raosl wrote:
| Is there an option to automatically trigger saving such PDF using
| JS?
| GrumpyNl wrote:
| There should be a button on there that converts the site to a
| pdf.
| hu3 wrote:
| Are there any report generation libraries built on top of
| page.js?
|
| I've been looking for report generation solution based on
| frontend technology.
|
| Btw. This is great. Thanks for sharing.
| dreix wrote:
| I started generating PDF before pagedjs and use my own electron
| based solution. Search for schild.report on github to see how
| it works. Basically reports are created via svelte templates
| and then printed via the electron print api. Works extremely
| well.
| nojito wrote:
| Yup. I use pagedown all the time.
|
| https://pagedown.rbind.io/
| abhinav22 wrote:
| Thanks for your kind words.
|
| Would you be able to expand on what you mean by "report
| generation libraries"?
|
| For example, I am building (in Common Lisp, but it's trivial
| and can be done in any software) a tool to read content from a
| database and auto generate the HTML markup for producing pdf
| reports. This allows me to reuse content across reports and
| also leverage the full power of databases (text search in
| particular). As another example, I have many monthly financial
| metrics - I will store these in a database then use my lisp
| markup tool to generate the necessary HTML to produce the pdf
| report (via paged.js).
|
| In addition, one can use headless chrome to automate the full
| workflow so that the reports are generated directly from your
| program and not via File > Print in your browser.
|
| Was that what you were thinking of?
|
| You can also add charts via charts.js.
|
| The beauty of paged.js is that you can leverage many of the
| features of browsers and JavaScript libraries in your report
| generation.
|
| I wasn't able to get syntax highlighting for code blocks to
| work however, need to dig into that a bit more.
| hu3 wrote:
| > Was that what you were thinking of?
|
| Yes! This is great. Thanks for the pointers. I'll look into
| that.
| brightball wrote:
| My question on these is always how it handles multi page
| documents. Most DOM and CSS approaches to HTML to PDF require it
| to fully render everything for placement before it can convert.
|
| If you have to render something big, like a 300 page member
| directory for example, the approach will blow up.
| idreyn wrote:
| I used paged.js in production and while it's definitely not
| fast enough to run during an HTTP request, it can render a 300
| page document reasonably quickly, definitely within the 120
| second TTL of our worker tasks. It can be quite finicky though
| and sometimes stalls on things like images that are taller than
| a single page.
| runxel wrote:
| For anybody working with pandoc you should try Weasyprint [0].
|
| [0]: https://weasyprint.org/
| sonthonax wrote:
| Weasyprint is great, but be prepared to start hacking its
| layout engine for more complex generations.
|
| It's nowhere near as mature as PrinceXML.
| gnicholas wrote:
| FYI typo on "foooters"
| exhilaration wrote:
| Does anyone know how this compares to PrinceXML / DocRapotor? We
| pay them big bucks for PDF generation (invoices specifically) and
| so far we haven't found anything comparable.
| abhinav22 wrote:
| I was a huge supporter of PrinceXML / DocRaptor for precisely
| the same reason - all the alternatives (that I knew at the
| time) were not good.
|
| Paged.js was a revelation (thanks to HN for telling me about
| it!). It is based off CSS Print Specifications like PrinceXML
| (as is my understanding - I'm about 95% sure), and to me it's
| even better because it utilizes all the other front end
| technologies directly from your browser - I think there are
| some use cases where PrinceXML won't be able to get the same
| functionality.
|
| For invoices, I think you should be able to easily switch over.
| Based on what I can see.
| jitl wrote:
| Have you tried using Chromium print-to-pdf by an API like
| Puppeteer or Playwright? Combined with something like paged.js
| and a decent print stylesheet, you can get pretty good quality
| output.
| cdolan wrote:
| I also pay a lot per month to DocRaptor and they've been very
| reliable to date, but unless I'm rendering charts with JS I
| feel like I'm really overspending
| Semaphor wrote:
| Not sure about paged.js, but we use Weasyprint for invoices. It
| isn't as advanced as some of the paid options (most annoyingly
| page headers are somewhat hacky) but it works very well.
|
| My go-to for everything CSS Paged Media is [0] which has a nice
| comparison of supported features at [1]. They recently added
| Weasyprint, PagedJS and Typeset.sh
|
| [0]: https://css.paged.media/
|
| [1]: https://css.paged.media/lessons
| zulko wrote:
| A few years ago I started an alternative to PrinceXML called
| ReLaXed.js [1], it's always been sufficient for my reports but
| it may lack some pagination/layout features that Paged.js may
| have as they seem to have given this much more thoughts (still
| wrapping my head around whether paged.js could be "plugged
| into" Relaxed).
|
| [1] https://github.com/RelaxedJS/ReLaXed
| systemvoltage wrote:
| There is something really pleasing about reading PDFs. It's
| perhaps how static it is and it won't change on me. I can zoom in
| or "operate" on it without some reorganization. It puts the mind
| to ease. There is no reflowing. There is no columns shifting.
| It's just is. Like a piece of paper as an analog - the intent of
| the author and the designer is retained and frozen in time. Fonts
| are embedded and chosen by the creator. Haters of PDFs do not
| understand the human aspects of it - they just see it as a
| specification (which is convoluted).
| agumonkey wrote:
| I've been obsessed with dynamicity and having everything
| adaptable.. but yeah static stuff have this weird feeling of
| 'right'.
| tambourine_man wrote:
| Copy/pasting can be infuriating though. The web is catching up
| fast however, but at least I can view source/inspect.
| tyingq wrote:
| _" It's perhaps how static it is and it won't change on me."_
|
| You might be surprised that's not always true.
|
| https://www.pdfscripting.com/public/FreeStuff/PDFSamples/Jav...
| [deleted]
| yawaworht1978 wrote:
| Furthermore, even on mobile devices, everything just work.
| Ctrl+f is non invasive, if I close the document in the middle
| of 100s of pages and reopen the document, I do not need to
| scroll, unlike many web apps it preserves the scroll location.
| It does what it needs to do without any hiccups.
| cryptoz wrote:
| I have the opposite experience. To me, opening a PDF puts me on
| edge. My computer is likely to slow to a crawl. It never
| renders correctly - sometimes it will slowly render each page,
| one at a time, flashing the screen as it goes. Or maybe the PDF
| is using some features that my reader doesn't support so it
| renders incomplete and incorrectly.
|
| Links are often hard to pick out. What is a link and what
| isn't? What happens when I click on something, is it going to
| stay in the PDF or open a browser or something?
|
| Don't get me started on moving around in PDFs. There are always
| 2 sets of page numbers, one for the PDF and one of the
| document. Extremely confusing.
|
| Searching. Ugh, searching a PDF is a nightmare I don't want to
| even think about right now. Ctrl+F is broken 99% of the time.
|
| Or at least, that's my experience over the last 20 years. Sure,
| it's gotten better recently, but, not enough to make my mind
| 'at ease' exactly. Very stressful to open a PDF still, usually.
| chmod775 wrote:
| > To me, opening a PDF puts me on edge. My computer is likely
| to slow to a crawl. It never renders correctly - sometimes it
| will slowly render each page, one at a time, flashing the
| screen as it goes.
|
| That probably isn't the fault of the PDF, but the PDF reader
| you're using.
|
| > Searching. Ugh, searching a PDF is a nightmare I don't want
| to even think about right now. Ctrl+F is broken 99% of the
| time.
|
| Now this is actually the fault of PDF and how it does
| positioning of stuff within it - but about 50% of the blame
| lies with whichever software generated a shit PDF.
| LordDragonfang wrote:
| >Now this is actually the fault of PDF and how it does
| positioning of stuff within it - but about 50% of the blame
| lies with whichever software generated a shit PDF.
|
| Arguably, it's still the fault of the horribly
| overcomplicated pdf spec -- html manages to do it just
| fine, with a plain text format, to boot
| diarrhea wrote:
| > _Don 't get me started on moving around in PDFs. There are
| always 2 sets of page numbers, one for the PDF and one of the
| document. Extremely confusing._
|
| It's only a tiny subset of all PDFs in circulation, but the
| LaTeX PDFs I produce using appropriate settings (mainly
| KOMAScript class) always nail this. The current page number
| always corresponds to what is _printed_ in the PDF. This can
| be alphanumerical (e.g. page "a" / 300, where 300 is the
| total number of all pages) or roman, for the frontmatter. The
| PDF viewer will then literally show e.g. "Page XII / 300".
|
| So in that sense, it's in the hands of the party producing
| the PDF to get this right, not an inherent limitation in the
| standard.
|
| But now, new problems arise. If you're on the printed page
| _XII_ but your viewer displays "Page 22 / 300", you know
| where you are in total. "Page XII / 300" is "correcter" but
| can be anything.
|
| > _Searching. Ugh, searching a PDF is a nightmare I don 't
| want to even think about right now. Ctrl+F is broken 99% of
| the time._
|
| Don't share this experience. It's a the same level as in
| browsers, where CTRL+F is also quite limited (I'd give a
| kidney to have regex available everywhere---ripgrep-all gets
| close on the desktop). The only different thing in PDFs is if
| hyphenation occurs, which is arguably less common in browsers
| (simply because of poorer typographical standards/people care
| more in proper PDFs). Your search term will indeed be
| invisible to CTRL+F. The only other time it breaks down in
| PDFs if the PDF is corrupt/poorly produced/bad OCR.
| necovek wrote:
| PDFs can be produced by mapping completely unrelated
| characters to glyphs that appear as actual characters
| (basically, they'd embed a "compacted" font that only has
| the glyphs required for the document, and then map them to
| ASCII or something). This was quite common with pdftex
| documents not using ASCII characters in the past, thus
| making text unsearchable (and even more so when going the
| ps2pdf route). For example, you'd have a Cyrillic document
| in which when you selected text in, highlighted text would
| be some jumbled ASCII characters.
|
| The fact is that PDF can display one thing and have
| underlying semantic text be something else entirely
| (frequently used for OCRing: you show actual scanned images
| of text, and put the invisible OCRed text as searchable ).
|
| It works in the other direction too: you could solve the
| hyphenation problem in the same way by having PDF include
| invisible non-hyphenated word in place of the hyphenated
| one for searching.
|
| Still, PDF is mostly a laying-out format, and while tools
| have evolved to provide some "meaning" to rendered content,
| it is never going to be semantic in the sense markup
| languages can be (i.e. there is no "emphasis", "quote" or
| "header" command for PDF, instead, it just uses a different
| font). To put things into perspective, TeX files can be
| semantic (if a semantic TeX .fmt like LaTeX is used) like
| HTML/ePub, but PDF is an output format, just like DVI is.
| bqmjjx0kac wrote:
| I don't have much experience with PDF as a spec, but I guess
| I'm a "PDF hater."
|
| It's the things that didn't need to be PDFs, but inexplicably
| are, that annoy me. Like data dumps from local governments that
| could have been machine-readable, or announcements that are
| distributed in print and emailed as PDFs, rather than lifting
| the content into the message body.
| diarrhea wrote:
| Agreed, but there exist solutions [0] to make even PDF tables
| machine-readable (optionally making use of machine learning
| techniques). It's incredibly backwards and much harder than,
| say, CSV, but it might get the machine-reading job done.
|
| 0: https://camelot-py.readthedocs.io/en/master/
| BlueTemplar wrote:
| Also pdf has very bad support of animation formats.
| tonyedgecombe wrote:
| That's a good thing, I don't want to see animations in a
| PDF.
| BlueTemplar wrote:
| And I would like to have an actual portable document
| format for electronic documents, but since mhtml support
| was for some reason dropped except in e-mail clients (as
| .eml), I'm stuck with this frankenformat that is pdf !
| axiolite wrote:
| > data dumps from local governments that could have been
| machine-readable, or announcements that are distributed in
| print and emailed as PDFs, rather than lifting the content
| into the message body.
|
| You should thank PDF for giving you any useful electronic
| copies at all.
|
| If it's scanned-in papers, sticking them loosly in an e-mail
| or web page would be much more difficult to read through.
|
| If it's text data, then perhaps it was primarily composed to
| be printed, and PDF allows easy creation of readable
| electronic copies with minimum of effort from any input.
| Before PDF you might have gotten nothing at all, because most
| people don't have readers for various obscure proprietary
| input formats.
|
| And PDF is far easier than other formats to convert into
| another format for your own consumption. Do you have a
| command-line tool which will extract the embedded images out
| of a Microsoft Word document? Or one that will convert it to
| plain text, preserving formatting? pdfimages and pdftotext
| -layout are very widely available.
| Aeolun wrote:
| > You should thank PDF for giving you any useful electronic
| copies at all.
|
| I think the point is that data dumps in PDF format are not
| useful at all.
|
| I take objection to your statement that it's easier to
| convert. The only reason there are so many tools to do so
| is because it's so hard/impossible in the first place.
| systemvoltage wrote:
| You're right - it's just the wrong format and it isn't
| intended for that. Gov should be publishing
| text/csv/parsable-formats not PDFs when it comes to data
| dumps.
| BostonEnginerd wrote:
| Our city council posts PDF files which consist of scans of
| documents someone printed out. Sometimes they turn on OCR,
| but not often.
| Lammy wrote:
| Also they can be justified in ways browsers can't (yet) do.
| Every web page has a ragged right.
| jimmygrapes wrote:
| I have been trying to place the justification format I have
| seen in some print books. In most cases it will add spaces
| between words to make up for sentences with words that can't
| be split easily, but it also doesn't fully right-justify the
| last letter of the line. It seems to be something like "add
| inter-word spacing _unless_ spacing exceeds X "
| notriddle wrote:
| The gold standard for computer text justification is the
| TeX algorithm: https://en.wikipedia.org/wiki/TeX#Hyphenatio
| n_and_justificat...
| chmod775 wrote:
| Like this (the demo at the top is interactive)?
|
| https://developer.mozilla.org/en-US/docs/Web/CSS/text-align
| KMnO4 wrote:
| Interesting, there's absolutely no difference between
| justify and align left on iOS Safari. I wouldn't have
| guessed that's an unsupported feature. I guess it's not
| common enough for me to notice throughout the web.
| abhinav22 wrote:
| Justify works in iOS Safari (I just checked). I like to
| use it in my designs sometimes :)
| URSpider94 wrote:
| Works fine for me.
| Lammy wrote:
| No, like 'text-justify: inter-character':
| https://developer.mozilla.org/en-US/docs/Web/CSS/text-
| justif...
| maga wrote:
| Maybe it's simply a force of habit at this point, but I can
| only read/study/memorize technical stuff from PDF/paginated
| source--I memorize the overall "picture" of the page and it's
| location along with the bits I actually need, and it's really
| tough to do with non-paginated sources.
| imposterr wrote:
| Yes! I totally get this. I think it's akin to the "mind
| palace" method popularized in media.
| maga wrote:
| That's an apt observation, actually. I used to practice
| mind palace when I had to memorize lists, and I do feel
| more comfortable when information is "physically" placed
| somewhere like a page, I guess it's all connected.
| hikarudo wrote:
| I wonder if source code could be read and understood more
| effectively if it were paginated.
| jhgb wrote:
| I'm happy that someone came Forth with that idea here.
| Jiocus wrote:
| Source code often does facilitate it's own method of
| paginating, into different files. One could argue that is
| the whole point of the practice.
|
| Pagination or not, both pagination and files provide some
| degree of spatial sense just as 'loci' and memory
| palaces.
|
| Edit addendum:
|
| There are theories about which senses are our dominant
| ones, and how they affect our learning processes. Some
| may lean towards visual ques in their mental life, others
| on kinetic or sound. Personlly I experience my mental
| models as spatial. Even abstract thoughts become situated
| "somewhere", if not by itself, then by contrast of other
| things on my mind.
|
| "Everything is a Memory palace."
|
| Needless to say, when I'm deep off in a terminal with
| something, I don't think I'd describe it as text-based.
| necovek wrote:
| Source code has a bunch of other properties that make
| pagination less useful.
|
| I.e. we strive for short functions, we use indentation
| heavily, it is commonly rendered in fixed-width fonts
| (this helps with spatial memory/overview too), etc.
| hikarudo wrote:
| Good points. Also, code can change frequently, so the
| "visual memory" reinforced by pagination becomes less
| useful, maybe even a hindrance.
| jwr wrote:
| The great thing about PDFs is that you can open huge documents
| and page through them quickly. I feel this is under-appreciated
| in today's world, where scrolling is being forced on us
| everywhere. Scrolling sucks for reading text. Every time you
| scroll, you have to pay attention to how much you scrolled, and
| then find your place in the text again.
|
| As for paging speed, just try using GoodReader or PDF Expert on
| an iPad. I can flip through thousand-page manuals and
| datasheets as quickly as if it were a paper book. And a 12"
| iPad shows an entire A4 page without the need for zooming and
| panning.
|
| In my experience, people who dislike reading PDFs have only
| tried doing so in Acrobat Reader (which is hot garbage, and
| slow), on a small screen that is wider than it is tall, zoomed
| in so that only half a page is being shown. That is a sub-par
| experience indeed.
| BlueTemplar wrote:
| Be it pdfs or html, I find my place through chapters, rather
| than pages.
| pvorb wrote:
| Haven't you ever been forced to stop reading in the middle
| of a chapter? It happens to me all the time.
| BlueTemplar wrote:
| Depends what you mean.
|
| Temporary interruptions yes. But then the location is
| kept.
|
| Interruptions for a very long time ? I might have to
| reread the whole chapter anyway...
| Thrymr wrote:
| > I feel this is under-appreciated in today's world, where
| scrolling is being forced on us everywhere. Scrolling sucks
| for reading text. Every time you scroll, you have to pay
| attention to how much you scrolled, and then find your place
| in the text again.
|
| This is incredibly important, and something that dedicated
| book readers like Kindles get right, but I've never seen done
| well in long web pages. Discrete "pages" (that correspond to
| "screens") make it much easier to find your place as you go
| to the next page. Note that multipart web pages often have
| you scroll through each "page" separately, and give you the
| worst of both worlds. Sure, PDF isn't always best for reading
| on a computer or phone screen, but infinite scrolling is
| annoying too.
| bobbylarrybobby wrote:
| Isn't zooming in and having text reflow a feature, not a bug,
| of HTML? PDFs are pretty much impossible to read on a phone
| because of the endless amount of zooming in and out and
| horizontal scrolling (unless they were designed for mobile --
| and then they're hard to read on a desktop). Never mind users
| on a desktop who just like their text large for ease of reading
| -- their screen might not be wide enough to fit the text
| without horizontally scrolling.
|
| As an author, my intent is that the content be easily readable
| to all readers. I don't see why I should want or get to dictate
| the layout and aesthetics to my readers.
| lolinder wrote:
| What's a feature in one context can be a bug in another. I
| get where OP is coming from: when I zoom in and the text
| reflows, it's easy to lose my place. PDFs don't have that
| problem.
|
| Also, digital, free-flow media lose basically all sense of
| space. PDFs are much better for finding a piece of content
| again later, because I can remember the location on the page
| and roughly how many pages into the document.
| kwhitefoot wrote:
| I wonder why browser bookmarks don't save the position.
| necovek wrote:
| The answer is likely obvious in that it depends on what
| content is the visible content at the time of the
| bookmark, which will further depend on the content itself
| (it can change since this is a bookmark on an alive web
| page), page styles, zoom level/scaling, window size etc.
|
| Basically, for a bookmark to fully store a position, it
| would have to store all of the above (and probably more),
| and it would only be really usable on the same device as
| long as the underlying content does not change.
| Gehinnn wrote:
| Usually, if you have reflow, you can disable it. However,
| if you don't have reflow, you cannot usually enable it!
| kwhitefoot wrote:
| Opera does better, or at least it used to.
| noxer wrote:
| It would not be hard to read on desktop you can simply show
| multiple pages like in book you usually see 2 pages. The same
| concept works if you have place for more pages. Its just not
| something people do (create PDFs for mobile). Almost all PDFs
| are meant to be read at approximately the size of DIN A4 for
| one page. In a time everyone is and should be disencouraged
| from printing stuff this is not really needed.
| systemvoltage wrote:
| I think there is a spectrum of commodity vs. artistic mediums
| in all forms and we often talk past each other when debating
| the finer points. If your goal is to send out a press release
| to the public, perhaps layout/aesthetics isn't as important
| (sometimes it is though so its not a hard/fast rule). In
| artistic media, especially in magazines and mixed-media
| books, layout and aesthetics are an integral part of print
| media. It is inseparable. Just as in music, you don't want to
| add an equalizer ruining the original intent of the artist,
| books created by artists in 1890 still are with us in print
| format - exactly how they were intended to published to
| readers. But it is entirely different if the "music" is a
| podcast - I want to use an equalizer to bring up the higher
| frequencies for better audibility. Similarly, if I am reading
| a novel on an epaper display and I want to increase the font-
| size or type, we should allow that as you said.
| BlueTemplar wrote:
| I agree. What annoys me is when not very artistic mediums
| like scientific articles force a fixed page layout. It gets
| even worse when you have to hunt down the relevant figures
| over the following pages because they couldn't be put on
| the same page due to lack of space. Also opening a figure
| in a separate window isn't much a thing either for pdfs.
| akhilpotla wrote:
| I definitely prefer to read research papers in html. I
| like to zoom in a lot when reading a long piece on my
| computer since it helps me read faster and keeps me from
| getting distracted. I've been thinking about working on a
| side project where I convert pdfs to html for academic
| papers.
| pvorb wrote:
| One benefit of PDFs for research papers is that you can
| easily save them to your own computer, build up a library
| of them, highlight lines with functionality built into
| most PDF readers. I generally prefer HTML for reading,
| but PDF has some benefits, too. Granted, most of these
| features are also available for HTML. But for some reason
| you need to look for browser plugins in order to
| highlight HTML pages, whereas in PDF you can just use the
| feature. And PDF is always about the content whereas HTML
| also typically contains navigation and other distractors.
| BlueTemplar wrote:
| I _really_ don 't get why mhtml has been discontinued by
| browsers??
| amelius wrote:
| The necessity of zooming is a shortcoming of the device, not
| of the text format.
| samatman wrote:
| Zooming seems like an inevitable consequence of how screens
| and eyes work, what am I missing?
|
| Forget text for a second, if I want to see fine details in
| an enormous image I'm going to have to zoom in. I normally
| adjust font size rather than zooming text but it's nice to
| have both available.
| michaelgrafl wrote:
| First thing I do when opening a Word document at work is
| converting it to PDF and reopen it in Sumatra Reader.
|
| It just feels a lot better to me. It opens faster, it opens at
| the position I left it at, zooming in and out is fast,
| scrolling is smoother, and even if I wanted to, I couldn't
| modify it on accident.
|
| It just feels a lot more reified than something that is
| responsive or editable.
|
| Not great for mobile, but that's not what I care for at work.
| webwanderings wrote:
| This is cool. Exactly what I was looking for not too long ago
| (sometimes markdown does not fit all the needs for
| documentation).
|
| How about images? How do you handle images; their layout, scaling
| etc?
| abhinav22 wrote:
| Basically you can use CSS to manage their layout - dimensions,
| positioning, scaling, etc. Should work pretty well
|
| If you want floating images (e.g. text on the left, images on
| the right), it may be a bit more difficult and not perfectly
| possible. This guide will help: https://www.pagedjs.org/page-
| floats/
|
| One tricky part is if you want to have text within images and
| have them the same size as your main text (eg in MS Word where
| you can have shapes and text boxes). For that, you can probably
| get close enough with a simple image load, and more precise by
| using svg graphics, but it may result in a reasonable amount of
| complexity to make perfect (if at all).
|
| For charts, use charts.js in my opinion.
| axlee wrote:
| What is your favourite library for printing HTML to PDF?
| airstrike wrote:
| https://pandoc.org/ + something like pdflatex
| honzajde wrote:
| Saved PDF (in Chrome) does not have a TOC as a side pane in
| Acrobat Reader. MIssing pretty important feature.
| abhinav22 wrote:
| Yes, that would be an advanced feature and I think likely out
| of the scope for paged.js. That said the table of contents page
| is hyperlinked - you can jump to sections, and I put a return
| to table of contents in the footer to aide with navigation.
| Hopefully that helps?
| anonu wrote:
| This is one of the hardest problems we face as a financial
| research platform. We have a lot of financial data in tables
| along with line, bar and pie charts. Coming up with something
| sensible and readable is a bit harder than expected. Ultimately
| we have a "json to latex" converter we built but it's not
| great...
| LaundroMat wrote:
| Maybe Vega and its Figures for Papers can help?
|
| https://vega.github.io/vega-lite/tutorials/figures.html
| TimTheTinker wrote:
| May I recommend Prince? https://www.princexml.com/
|
| I created a PDF exporter for a manual test tracking app using
| this -- render to (pretty simple) HTML, pass to the prince
| executable, and out comes a beautifully typeset PDF.
|
| Prince has its own rendering engine that is purpose-built for
| PDF rendering. It's actually very good - a lot of professional
| books and documents have been typeset using Prince.
| smt88 wrote:
| We have been using PDF2XL[1] for this for years (used to be
| called CogniView).
|
| It's genuinely unbelievable. If the PDF isn't sufficiently
| structured, it has OCR that seems to "just work".
|
| You can also automate the extraction and integrate it into your
| pipeline.
|
| The UI is pretty old and ugly-looking, but it is one of the few
| apps I've used in the last 10 years that made me feel genuine
| delight.
|
| 1. https://pdf2xl.com
| burmanm wrote:
| Anything wrong with XSL-FO? I know it's not the hottest thing
| on earth, but it works. Apache FOP is still developed and it's
| easy to add it to a pipeline.
| abhinav22 wrote:
| I work in financial reporting, doing very similar things to
| what you mentioned here. Mind dropping me an email at
| Ashok.khanna@hotmail.com and we could discuss further? Would
| love to chat to others facing similar dilemmas :)
| mywacaday wrote:
| The section numbers in the table of contents are over writing the
| text, maybe not so beautiful.
| abhinav22 wrote:
| Thanks for flagging. It's aimed for desktop users and a
| reasonable size screen (eg A4 or so), it's not really a
| responsive design as it's meant to be a tool to generate PDFs.
| So it won't work in some browser dimensions (the paged.js code
| is reasonably complex).
|
| That said it works in iOS as far as I can test. For some reason
| page numbers in table of contents not working perfectly in
| Safari but Chrome works pretty well.
|
| I guess the conclusion is that this is aimed somewhat to
| desktop Chrome users as a specific tool for pdf generation.
| Tomte wrote:
| I'm seeing all page numbers as 0 in the TOC on iPadOS.
| abhinav22 wrote:
| Yep - looks like Chrome only gets those right. I will speak
| to the paged.js guys and see where the bug is in the code.
|
| Rest seems to work on Safari - let me know if any other
| issues and I'll fix / update accordingly
| sonthonax wrote:
| By using a W3C standard and devolving the layout engine to the
| browser, this solves a difficult problem the right way.
|
| I would have loved to have something like this for a project
| years ago.
| abhinav22 wrote:
| It really does. The team at paged.js are simply amazing and
| deserve so much credit.
|
| It's such a big deficiency in the modern web, we really need
| Chrome / Safari / etc to implement the W3C standard or
| something better
| sunjester wrote:
| mobile friendly?
| abhinav22 wrote:
| No, it's more for pdf production from desktop. The print
| preview on the example web page works on iPhone and browsers
| but the aim is really for pdf and not for consumption via
| mobile etc
| frongpik wrote:
| Add to this a pdf to html converter, with a focus on official
| forms (e.g. irs tax forms), ability to easily edit fields and add
| signatures (similar to how the free android adobe app does it),
| and you can charge money for it.
| abhinav22 wrote:
| Thanks - indeed there is likely a market for it. One of the
| issues is that to get a commercial app, you have to solve for
| most of the edge cases and make sure it has a good enough UI
| for the non sophisticated.
|
| I was thinking about doing it, but it would be a lot of work to
| do right.
|
| By right, I would want it to be the quality of sublime or emacs
| / vim :) :)
| frongpik wrote:
| Actually, there's one more usecase for a html to pdf
| converter: making a book-style copy of a multipage website.
| I'm looking right now at a scientific site with content
| spread over multiple pages and it's tiring to find and click
| all the links to make sure I don't miss anything.
| frongpik wrote:
| You'd essentially have a pdf editor that can import a pdf,
| edit it and export the html back to pdf. Working with
| official forms is one usecase. Another is an iframe that can
| preview pdfs without resorting to plugins.
| abhinav22 wrote:
| Indeed - that makes a lot of sense
| buovjaga wrote:
| I tried out paged.js recently for a genealogical report exported
| from Gramps, but I had to use PrinceXML because counter-reset to
| start at a page does not work:
| https://gitlab.pagedmedia.org/tools/pagedjs/issues/91
|
| Apart from this feature everything worked fine.
| abhinav22 wrote:
| Do you have a repo I could play around with? I had the same
| issue, but for my use case I figured a work around - primarily
| by using classes. I could spend a few minutes and see if I can
| get it working for you.
|
| But yes, it's a deficiency in the system currently
| majkinetor wrote:
| Awesome project.
|
| Consider adding in the demo automatic anchors on headers so one
| can quickly copy them for sharing. Currently they can only be
| obtained from ToC but you need to scroll to it. On anything
| larger then few pages, this is a must. One problem there is that
| current automatic id's are generated sequentially and not really
| user friendly for link sharing.
| dukeofdoom wrote:
| Adobe makes it harder and harder to use their PDF reader. I live
| in Canada, and somehow I'm given forms I legally have to use,
| that I can only print out in adobe reader v10. I need to go
| through hassle of installing and uninstalling their terrible
| product couple times a year.
| doersino wrote:
| There's also Bindery, a JavaScript library for book creation
| which also leverages the print-to-PDF feature built into modern
| browsers: https://evanbrooks.info/bindery/
|
| On top of that and the in-browser Markdown renderer Markdeep,
| I've built a tool for typesetting undergraduate theses:
| https://github.com/doersino/markdeep-thesis/
|
| And, coincidentally, just a few days ago I've written a blog post
| about controlling the settings in Chrome's "Print" dialogue with
| CSS (other browsers don't support many of the relevant features):
| https://excessivelyadequate.com/posts/print.html
| wyck wrote:
| The Achilles heel pf PDF's are they don't have responsive
| layouts. It's so bad the Adobe team created an AI to resize
| layouts, yes an AI in the cloud, available only on the Adobe app.
| How insanely bad is your file format that you need an AI to
| resize layouts in 2021? If anyone has had to handle layouts
| programmatically I'd think they would agree that PDF's are the
| most outdated ass backwards file format in existence.
| necovek wrote:
| PDF is an attempt at non-Turing complete, simpler PostScript
| (PS). It comes from a time of paged media de facto ruling the
| world. Changing layouts was never the goal, because PDF was the
| output format.
|
| In case of academic research papers typeset with LaTeX, the
| source file is something you'd likely want to consider the
| semantic equivalent of HTML. TeX should be able to render the
| same document with different output constraints ("responsive
| layout"), but because of the architecture (TeX itself is fully
| Turing complete), it is pretty slow at re-rendering an entire
| document.
|
| Part of the allure of a static document format like PDF is that
| you can, in theory, fetch just page 454 of 6000 page document
| and render that: with HTML, just like with TeX, you'd have to
| get and render the entire document to be certain that the
| layout won't change after you've processed the whole file.
| TedDoesntTalk wrote:
| I've been looking for a way to convert Wordpress blogs to PDF.
| There are Wordpress plugins for this but I have not found any
| that work well.
|
| Can this be integrated with Wordpress?
| abhinav22 wrote:
| Should be able to - somebody with intermediate Wordpress
| knowledge (unfortunately I don't know php) should be able to
| integrate within a day in my opinion, based on my understanding
| of web development
| janandonly wrote:
| I love "printing" webpages to PDF files. I've been doing this for
| more than 15 years. I delete most of the images first so that I
| end up with files of 50-500KB. I then store said file in
| appropriate labeled directories.
|
| Now 15 years later I have a private stash of websites and
| wikipedia articles that I can consult by simply pressing
| command+spacebar (the files are indexed in MacOS search).
|
| To make a PDF file out of a website I currently use
| Printfriendly.com, but it wasn't always this way.
|
| Back in the days I loved to use Arc90's Readability (a firefox
| extension). I don't know what happend to that extension though,
| there are plenty of old HN articles about that Wonderfull plug-in
| though:
|
| Post from 2010, probably I started using it right after finding
| this post... https://news.ycombinator.com/item?id=1153343
|
| https://news.ycombinator.com/item?id=3246081
|
| https://news.ycombinator.com/item?id=3243097
|
| My joy knows no bounds !!
|
| I actually ducked for "What happend to arc90.com ?" and found as
| the 7th item in the list this website:
| https://ejucovy.github.io/readability/
|
| It still hosts a working version !!!
|
| Okay kids uses these settings and thank me later: * Style:
| Athelas * size: small * Margin: narrow * Convert hyperlinks to
| footnotes
|
| Whenever a pages is worthy of saving, press the button for
| Readability and pres ctrl+P and save to PDF... that's it.
| john-doe wrote:
| Great paged.js tutorial, thank you for publishing it.
| abhinav22 wrote:
| Thank you! Paged.js is really such an awesome tool.
|
| I was searching for many weeks for something like this, so I
| really think the word needs to get out there more. It could
| significantly improve the workflows of many people who are self
| writing / self publishing as it opens up the power of CSS and
| HTML (which allows to nicely defined formatting templates and
| use code to automate content generation) to pdf reports (which
| I think has its place).
|
| I haven't used pandoc, but I think a HTML/CSS/Paged.js workflow
| could challenge it.
|
| At work I'm already converting many processes to it - I have a
| database of content and then use SQL queries to extract data
| and then generate beautiful PDFs through paged.js.
|
| It also works well with mathematical typesetting (via MathJax).
| ChuckMcM wrote:
| This is very nice. I keep wondering when we'll see a resurgence
| of magazines which use a system like this.
___________________________________________________________________
(page generated 2021-04-04 23:00 UTC)