[HN Gopher] Interval parsing grammars for file format parsing (2...
___________________________________________________________________
Interval parsing grammars for file format parsing (2023)
Author : fanf2
Score : 82 points
Date : 2024-08-10 17:42 UTC (5 hours ago)
(HTM) web link (dl.acm.org)
(TXT) w3m dump (dl.acm.org)
| andrybak wrote:
| > We have used IPGs to specify a number of file formats including
| ZIP, ELF, GIF, PE, and part of PDF
|
| For PDF, that's fair. Video "Types of PDF - Computerphile" covers
| this: https://www.youtube.com/watch?v=K7oxZCgO1dY
| jgalt212 wrote:
| I"ll watch any and all Professor Brailsford videos.
| quotemstr wrote:
| > ZIP files that are prefixed by random garbage can still be
| extracted by unzip but fail to be recognized by a parser that
| conforms to the format specification
|
| To be fair, the ability to stick a ZIP file at the end of any
| other kind of file enables all sorts of neat tricks (like the old
| self-extracting zips).
| userbinator wrote:
| That's because zip files are read from the end.
| FreakLegion wrote:
| And this is in fact what the spec lays out, contrary to the
| quote from the paper. The PK header is a convention.
| Conforming parsers don't require it, but lazy implementations
| often do. This has led to more than one security incident
| over the years.
| bloatfish wrote:
| Yeah and PK is the signature _per record_ - it 's not a
| file header. Did these guys read the format specification
| at all?
| pointlessone wrote:
| They definitely did not implement PDF parsing, even a subset of
| it. They make some assumptions that will definitely result in
| incorrect parsing. For instance, they assume, objects are tightly
| packed. They're not required to. They should be to save space but
| are not required to. Moreover, it is possible to place objects
| inside other objects. It's not advised but not prohibited. As far
| as I can tell this is where their PDF parsing ends. They don't
| parse the objects themselves (not regular objects, nor stream
| objects). So they've chosen PDF "because it is the most
| complicated format to our knowledge" but ended up just
| (incorrectly) chunking the stream by offset table.
| jahewson wrote:
| Yeah this work is far away from what a real PDF parser
| requires. It's not uncommon for the lengths at the beginning of
| streams to be wrong or the data to be encoded in a format
| different from the one claimed. The offset table can also be
| wrong or missing.
| pointlessone wrote:
| Malformed file is a whole another can of worms a good parser
| should know how to deal with but here it doesn't even format
| compliant.
|
| I think they wanted to demonstrate that their work can slice
| a stream by offset table, in a declarative fashion. It is a
| useful property. I think they would've better picked OTF/TTF
| for demonstration of this particular feature.
| airstrike wrote:
| Sounds to me like that's more of an issue with the PDF
| specification than with the work presented in the paper, in
| which case that's hardly the metric by which we should measure
| its merit.
| pointlessone wrote:
| I'm not saying PDF is a good format. I'm pointing out that
| they've made a poor choice going for PDF. There are other
| formats they could've used to demonstrate this specific
| technique. Like OTF/TTF which is a more traditional binary
| format with a whole range of approaches, including offset
| tables.
| revskill wrote:
| How about MS office document ?
| tithe wrote:
| DOCX, PPTX, and XLSX Microsoft Office files are actually ZIP
| archives (which the paper addresses). You can append a ".zip"
| extension onto the end of them and explore.
|
| https://en.wikipedia.org/wiki/Office_Open_XML
| jahewson wrote:
| The old office binary formats are basically a FAT file system
| containing streams of unremarkable records. Knowing what those
| records do is the hard part!
| aappleby wrote:
| Is this really a new thing? It feels like they've just crammed a
| sliver of the same bog-standard parsing we've been doing for
| decades back into the CFG.
|
| I guess that's good for preventing off-by-one-based parsing
| errors, but surely there's prior art from long ago.
___________________________________________________________________
(page generated 2024-08-10 23:00 UTC)