[HN Gopher] Kaitai Struct: A new way to develop parsers for bina...
___________________________________________________________________
Kaitai Struct: A new way to develop parsers for binary structures
Author : marcodiego
Score : 74 points
Date : 2022-03-17 20:13 UTC (2 hours ago)
(HTM) web link (kaitai.io)
(TXT) w3m dump (kaitai.io)
| kangalioo wrote:
| There's also Wuffs, a safe and fast programming language made by
| Google specifically for decoding and encoding file formats
| https://github.com/google/wuffs
|
| Paired with C FFI available in most languages, this seems like
| the nicer solution. It's simpler than generating code for a bunch
| of high level languages, and more performant
| layer8 wrote:
| Not for managed environments like client-side JS, JVM, .NET,
| ...
| jmgao wrote:
| This appears to just allow you to parse binary formats to the
| represented fields. (Not that that's not extremely useful,
| doing this in managed languages is generally a giant pain in
| the ass!)
|
| wuffs is much more powerful: it's essentially a safe C
| dialect that compiles to C, that lets you write an entire
| codec and know that there aren't any overflows.
| eesmith wrote:
| How much of a future should I expect for Wuffs?
|
| The linked-to page says: "Version 0.2. The API and ABI aren't
| stabilized yet. The compiler undoubtedly has bugs."
|
| There are not many recent commits, and mostly by one developer.
| secondcoming wrote:
| Interesting, but now you have to add in the possibility of having
| bugs in your YAML file. The YAML is probably less readable than
| the spec for the binary format itself.
|
| Looking at the code-gen for utf8_string [0] and it's a case of
| 'thanks, but no thanks'
|
| > std::unique_ptr<std::vector<std::unique_ptr<utf8_codepoint_t>>>
| m_codepoints;
|
| This is a solution looking for a problem, but I bet it was fun to
| write.
|
| [0] https://formats.kaitai.io/utf8_string/cpp_stl_11.html
| asadawadia wrote:
| Great library - too bad it only allows reading
| ctoth wrote:
| If you're working in Python and need to write as well as read
| check out Construct[0], which is also a declarative parser
| builder.
|
| [0]: https://construct.readthedocs.io/en/latest/intro.html
| CGamesPlay wrote:
| As a code generator, I guess this may be nice. It seems like a
| DSL like the Nom [0] API is more natural and expressive, though.
| I imagine you can hit limits to expressiveness in Yaml pretty
| quickly.
|
| [0] https://github.com/Geal/nom
| mturk wrote:
| Kaitai is a really great system, with an awesome WebIDE. At work
| we have just started a project to use it for astrophysics
| simulations and data from dark matter detectors, and one of my
| hobby projects is to use it to explore retro game data formats.
| jll29 wrote:
| Kudos - this is neat - I especially love the library of pre-
| existing descriptions, which helps me to learn about the tool as
| well as about an abundance of file formats without re-engineering
| time wasted.
|
| This is somewhat akin to ASN.1.
|
| My personal feature wish list:
|
| - support writing as well as reading;
|
| - support generating Rust, Julia and Swift code.
|
| - upload button to let users add to a contrib/ folder of existing
| format descriptions
| dhx wrote:
| I contributed a number of file formats a few years ago (and
| attempted numerous others) but ran into a number of problems with
| certain file formats:
|
| 1. It's not possible to read from the file until a multiple byte
| termination sequence is detected. [1]
|
| 2. You can't read sections of a file where the termination
| condition is the presence of a sequence of bytes denoting the
| next unrelated section of the file (and you don't want to
| consume/read these bytes) [2]
|
| 3. The WebIDE at the time couldn't handle very large file format
| specifications such as Photoshop (PSD) [3]
|
| 4. Files containing compressed or encrypted sections require a
| compression/encryption algorithm to be hardcoded into Kaitai
| struct libraries for each programming language it can output to.
|
| The WebIDE I particularly liked as it makes it easy to get
| started and share results. I also liked how Kaitai Struct allows
| easy definition of constraints (simple ones at least) into the
| file format specification so that you can say "this section of
| the file shall have a size not exceeding header.length * 2
| bytes".
|
| Some alternative binary file format specification attempts for
| those interested in seeing alternatives, each with their own set
| of problems/pros/cons:
|
| 1. 010 Editor [4]
|
| 2. Synalysis [5]
|
| 3. hachoir [6]
|
| 4. DFDL [7]
|
| [1] https://github.com/kaitai-io/kaitai_struct/issues/158
|
| [2] https://github.com/kaitai-io/kaitai_struct/issues/156
|
| [3]
| https://raw.githubusercontent.com/davidhicks/kaitai_struct_f...
|
| [4] https://www.sweetscape.com/010editor/repository/templates/
|
| [5] https://github.com/synalysis/Grammars
|
| [6] https://github.com/vstinner/hachoir/tree/main/hachoir/parser
|
| [7] https://github.com/DFDLSchemas/
| gigel82 wrote:
| Ugh, wish I'd found this a couple of years ago; after hand-
| writing a Unity asset parser in node.js for a hobby project
| (big/little-endian mixes, byte alignment, versioned header
| format, different compression algos, etc.).
| sidpatil wrote:
| This looks really cool! This would have been really useful to me
| a couple years ago.
| lpapez wrote:
| It was available a few years ago, and I found it very useful.
| neonsunset wrote:
| As far as .NET implementation goes, it is _really bad_ :
|
| - Very old and currently obsolete project target
|
| - As a result, does not use modern data types such as Span<T>
|
| - No utilisation of ArrayPool<T> which is important for things
| like serialisers where you expect to deal with buffers a lot
|
| - Appears to be a blind Java port given provided code style
|
| This is not acceptable when working with low-level and binary
| structures which this standard is focused on. Yes, I know, this
| is an OSS project and therefore instead of complaining here I
| should have been working on contributing a PR to fix those
| issues. However, my main concern is that this standard and set of
| libraries in the current form work against the performance-
| sensitive nature of working with binary data.
| imglorp wrote:
| Erlang got this right: for the narrow case of packets
| in/mangle/out, described like an RFC bit-field diagram, it was
| very clean and simple.
| renewiltord wrote:
| Seems rather well designed actually. Appears that you can even
| use length-delimited lists and stuff. I like it. I have a project
| where we have a compact binary encoding and I have to write
| documentation _and_ serde for it. This works for docs and
| deserialization so that's good. I understand why serialization
| isn't supported but I feel like there's probably a clever API
| that allows inserting your own ser in. We'll see. I might switch
| our internal thing this weekend to it.
|
| Would be cool if you could generate a protocol diagram from this.
___________________________________________________________________
(page generated 2022-03-17 23:00 UTC)