[HN Gopher] Interval parsing grammars for file format parsing (2...
       ___________________________________________________________________
        
       Interval parsing grammars for file format parsing (2023)
        
       Author : fanf2
       Score  : 82 points
       Date   : 2024-08-10 17:42 UTC (5 hours ago)
        
 (HTM) web link (dl.acm.org)
 (TXT) w3m dump (dl.acm.org)
        
       | andrybak wrote:
       | > We have used IPGs to specify a number of file formats including
       | ZIP, ELF, GIF, PE, and part of PDF
       | 
       | For PDF, that's fair. Video "Types of PDF - Computerphile" covers
       | this: https://www.youtube.com/watch?v=K7oxZCgO1dY
        
         | jgalt212 wrote:
         | I"ll watch any and all Professor Brailsford videos.
        
       | quotemstr wrote:
       | > ZIP files that are prefixed by random garbage can still be
       | extracted by unzip but fail to be recognized by a parser that
       | conforms to the format specification
       | 
       | To be fair, the ability to stick a ZIP file at the end of any
       | other kind of file enables all sorts of neat tricks (like the old
       | self-extracting zips).
        
         | userbinator wrote:
         | That's because zip files are read from the end.
        
           | FreakLegion wrote:
           | And this is in fact what the spec lays out, contrary to the
           | quote from the paper. The PK header is a convention.
           | Conforming parsers don't require it, but lazy implementations
           | often do. This has led to more than one security incident
           | over the years.
        
             | bloatfish wrote:
             | Yeah and PK is the signature _per record_ - it 's not a
             | file header. Did these guys read the format specification
             | at all?
        
       | pointlessone wrote:
       | They definitely did not implement PDF parsing, even a subset of
       | it. They make some assumptions that will definitely result in
       | incorrect parsing. For instance, they assume, objects are tightly
       | packed. They're not required to. They should be to save space but
       | are not required to. Moreover, it is possible to place objects
       | inside other objects. It's not advised but not prohibited. As far
       | as I can tell this is where their PDF parsing ends. They don't
       | parse the objects themselves (not regular objects, nor stream
       | objects). So they've chosen PDF "because it is the most
       | complicated format to our knowledge" but ended up just
       | (incorrectly) chunking the stream by offset table.
        
         | jahewson wrote:
         | Yeah this work is far away from what a real PDF parser
         | requires. It's not uncommon for the lengths at the beginning of
         | streams to be wrong or the data to be encoded in a format
         | different from the one claimed. The offset table can also be
         | wrong or missing.
        
           | pointlessone wrote:
           | Malformed file is a whole another can of worms a good parser
           | should know how to deal with but here it doesn't even format
           | compliant.
           | 
           | I think they wanted to demonstrate that their work can slice
           | a stream by offset table, in a declarative fashion. It is a
           | useful property. I think they would've better picked OTF/TTF
           | for demonstration of this particular feature.
        
         | airstrike wrote:
         | Sounds to me like that's more of an issue with the PDF
         | specification than with the work presented in the paper, in
         | which case that's hardly the metric by which we should measure
         | its merit.
        
           | pointlessone wrote:
           | I'm not saying PDF is a good format. I'm pointing out that
           | they've made a poor choice going for PDF. There are other
           | formats they could've used to demonstrate this specific
           | technique. Like OTF/TTF which is a more traditional binary
           | format with a whole range of approaches, including offset
           | tables.
        
       | revskill wrote:
       | How about MS office document ?
        
         | tithe wrote:
         | DOCX, PPTX, and XLSX Microsoft Office files are actually ZIP
         | archives (which the paper addresses). You can append a ".zip"
         | extension onto the end of them and explore.
         | 
         | https://en.wikipedia.org/wiki/Office_Open_XML
        
         | jahewson wrote:
         | The old office binary formats are basically a FAT file system
         | containing streams of unremarkable records. Knowing what those
         | records do is the hard part!
        
       | aappleby wrote:
       | Is this really a new thing? It feels like they've just crammed a
       | sliver of the same bog-standard parsing we've been doing for
       | decades back into the CFG.
       | 
       | I guess that's good for preventing off-by-one-based parsing
       | errors, but surely there's prior art from long ago.
        
       ___________________________________________________________________
       (page generated 2024-08-10 23:00 UTC)