[HN Gopher] PDF/A-3, PDF for Long-Term Preservation, Use of ISO ...
___________________________________________________________________
PDF/A-3, PDF for Long-Term Preservation, Use of ISO 32000-1...
(2020)
Author : gabrielsroka
Score : 40 points
Date : 2023-03-21 19:00 UTC (4 hours ago)
(HTM) web link (www.loc.gov)
(TXT) w3m dump (www.loc.gov)
| jimjimjim wrote:
| background info if useful:
|
| PDF/A is a specification that limits what features of PDF are
| allowed. The purpose is to not allow features that may be
| problematic for archiving.
|
| Initially PDF/A was really strict and prevented things like
| transparencies since they affected reproducibility when printing
| and embedded files etc.
|
| Then people requested less restricted versions to allow more
| archiving use cases.
|
| But even the newer less restrictive versions have a more well
| defined and verifiable specification than the main pdf
| specification.
| chasil wrote:
| The big restriction is that the classic Postscript typefaces
| are not available (no Times, Helvetica, or Zapf Dingbats), and
| the PDF file must bundle any fonts it uses.
|
| The pdfsizeopt package will make any PDF smaller, and I think
| it deletes letters/characters from the included font that are
| not used.
|
| https://github.com/pts/pdfsizeopt
| brookst wrote:
| Preserving PDFs for future generations is like preserving
| radioactive waste for them. It's inevitable they'll end up with
| lots, and they won't thank us for it, but we should at least try
| to contain the mess.
| cm2187 wrote:
| I love the idea of the martians having invaded earth, wiped out
| humanity, but when they opened their first pdf file, got all
| their files crypto-locked. A modern version of War of the
| Worlds.
| tannhaeuser wrote:
| I'm not really getting it, aren't RFCs written in a
| straightforward Wiki syntax? Then why would they be preserved
| using PDF, and how is XML the source format, or would be
| considered useful as the canonical or authoring format when the
| existence of thousands of RFCs in plain text/light Wiki syntax
| clearly says otherwise?
| gabrielsroka wrote:
| I think the FAQ I linked to below addresses some of these
| https://www.rfc-editor.org/rse/format-faq/
| shfiuewgieug wrote:
| [flagged]
| shfiuewgieug wrote:
| [flagged]
| gabrielsroka wrote:
| See also https://www.rfc-editor.org/rse/format-faq/
| makkesk8 wrote:
| if only there were an open source and easy to use pdf library
| with pdf/a support :/
| gabrielsroka wrote:
| It looks like the IETF has some tools for this. A quick search
| revealed https://github.com/ietf-tools/ietf-at
| zokier wrote:
| The mentioned RFC PDFs are generated with Weasyprint which
| gained PDF/A support apparently last year.
| https://www.courtbouillon.org/blog/00028-weasyprint-56
| ggm wrote:
| Embedding the input formatting directives is neat!
| zokier wrote:
| Reading the descriptions of A-3 and A-4, to me it sounds like
| PDF/A jumped the shark and for archival purposes the old A-2
| might still be the best variant.
|
| In general, embedding files in PDF is kinda neat capability, like
| the example of having (CSV) dataset embedded in report or
| something like that. But at the same time I get the feeling that
| its an indication of general shortcoming of our file handling
| that it makes sense to use PDF as a container format. ZIP files
| and such are pretty crude formats for higher-level file bundles
| and the UX falls short too.
| tpmx wrote:
| [flagged]
| giantrobot wrote:
| The PDF/A versions are subsets of PDF specs that are
| specifically aimed at archiving. They forbid features like
| encryption and font linking which would affect access years or
| decades from now.
| maxerickson wrote:
| Oh wah.
|
| Anyway, they do actually have a "what for" at the link:
| https://www.loc.gov/preservation/digital/formats/fdd/fdd0003...
| cookiengineer wrote:
| > very portable C implementations
|
| Did you mean "very portable exploitable implementations"?
|
| Sorry, but claiming PDF is stable is absurd to say the least.
| Any mobile, smartphone, or gaming console usually was exploited
| because of PDF parsers before pdf.js got embedded in web
| browsers.
|
| Windows' biggest attack surface is still outlook and PDF files.
|
| So I'd argue that PDF has a too large attack surface, which
| must be reduced for better archiving purposes without side
| effects.
| [deleted]
| [deleted]
| gabrielsroka wrote:
| Full title is "PDF/A-3, PDF for Long-term Preservation, Use of
| ISO 32000-1, With Embedded Files"
|
| > the new publishing framework, known as "V3", for RFCs from the
| IETF (Internet Engineering Task Force). V3 uses an XML document
| as the master format from which plain text, HTML, and PDF
| versions are derived. The PDF is a PDF/A-3u document with the XML
| master embedded. The first RFC published in the new format was
| RFC 8650 [0], published in November 2019. For more background on
| this choice, see RFC 7995: PDF Format for RFCs (December 2016)
| [1] and additional Useful References below.
|
| [NOTE, I changed the links below slightly to point to the actual
| new HTML format]
|
| [0] https://www.rfc-editor.org/rfc/rfc8650.html
|
| [1] https://www.rfc-editor.org/rfc/rfc7995.html
___________________________________________________________________
(page generated 2023-03-21 23:01 UTC)