hngopher.com

       [HN Gopher] On File Formats
       ___________________________________________________________________
        
       On File Formats
        
       Author : ibobev
       Score  : 104 points
       Date   : 2025-05-21 07:48 UTC (4 days ago)
        
 (HTM) web link (solhsa.com)
 (TXT) w3m dump (solhsa.com)
        
       | adelpozo wrote:
       | I would add to make it streamable or at least allow to be read
       | remotely efficiently.
        
         | flowerthoughts wrote:
         | Agreed on that one. With a nice file format, streamable is
         | hopefully just a matter of ordering things appropriately once
         | you know the sizes of the individual chunks. You want to write
         | the index last, but you want to read it first. Perhaps you want
         | the most influential values first if you're building something
         | progressive (level-of-detail split.)
         | 
         | Similar is the discussion of delimited fields vs. length
         | prefix. Delimited fields are nicer to write, but length
         | prefixed fields are nicer to read. I think most new formats use
         | length prefixes, so I'd start there. I wrote a blog post about
         | combining the value and length into a VLI that also handles
         | floating point and bit/byte strings:
         | https://tommie.github.io/a/2024/06/small-encoding
        
           | lifthrasiir wrote:
           | I don't think a single encoding is generally useful. A good
           | encoding for given application would depend on the value
           | distribution and neighboring data. For example any variable-
           | length scalar encoding would make vectorization much harder.
        
       | mjevans wrote:
       | Most of that's pretty good.
       | 
       | Compression: For anything that ends up large it's probably
       | desired. Though consider both algorithm and 'strength' based on
       | the use case carefully. Even a simple algorithm might make things
       | faster when it comes time to transfer or write to permanent
       | storage. A high cost search to squeeze out yet more redundancy is
       | probably worth it if something will be copied and/or decompressed
       | many times, but might not be worth it for that locally compiled
       | kernel you'll boot at most 10 times before replacing it with
       | another.
        
       | thasso wrote:
       | For archive formats, or anything that has a table of contents or
       | an index, consider putting the index at the end of the file so
       | that you can append to it without moving a lot of data around.
       | This also allows for easy concatenation.
        
         | charcircuit wrote:
         | Why not put it at the beginning so that it is available at the
         | start of the filestream that way it is easier to get first so
         | you know what other ranges of the file you may need?
         | 
         | >This also allows for easy concatenation.
         | 
         | How would it be easier than putting it at the front?
        
           | lifthrasiir wrote:
           | If the archive is being updated in place, turning ABC# into
           | ABCD#' (where # and #' are indices) is easier than turning
           | #ABC into #'ABCD. The actual position of indices doesn't
           | matter much if the stream is seekable. I don't think the
           | concatenation is a good argument though.
        
           | shakna wrote:
           | Files are... Flat streams. Sort of.
           | 
           | So if you rewrite an index at the head of the file, you may
           | end up having to rewrite everything that comes afterwards, to
           | push it further down in the file, if it overflows any padding
           | offset. Which makes appending an extremely slow operation.
           | 
           | Whereas seeking to end, and then rewinding, is not nearly as
           | costly.
        
             | charcircuit wrote:
             | Most workflows do not modify files in place but rather
             | create new files as its safer and allows you to go back to
             | the original if you made a mistake.
        
               | shakna wrote:
               | If you're writing twice, you don't care about the
               | performance to begin with. Or the size of the files being
               | produced.
               | 
               | But if you're writing indices, there's a good chance that
               | you do care about performance.
        
               | charcircuit wrote:
               | Files are often authored once and read / used many times.
               | When authoring a file performance is less important and
               | there is plenty of file space available. Indices are for
               | the performance for using the file which is more
               | important than the performance for authoring it.
        
               | shakna wrote:
               | If storage and concern aren't a concern when writing,
               | then you probably shouldn't be doing workarounds to
               | include the index in the file itself. Follow the dbm
               | approach and separate both into two different files.
               | 
               | Which is what dbm, bdb, Windows search indexes, IBM
               | datasets, and so many, many other standards will do.
        
               | charcircuit wrote:
               | Separate files isn't always the answer. It can be more
               | awkward to need to download both and always keep them
               | together compared to when it's a single file.
        
             | PhilipRoman wrote:
             | You can do it via fallocate(2) FALLOC_FL_INSERT_RANGE and
             | FALLOC_FL_COLLAPSE_RANGE but sadly these still have a lot
             | of limitations and are not portable. Based on discussions
             | I've read, it seems there is no real motivation for
             | implementing support for it, since anyone who cares about
             | the performance of doing this will use some DB format
             | anyway.
             | 
             | In theory, files should be just unrolled linked lists (or
             | trees) of bytes, but I guess a lot of internal code still
             | assumes full, aligned blocks.
        
           | McGlockenshire wrote:
           | > How would it be easier than putting it at the front?
           | 
           | Have you ever wondered why `tar` is the Tape Archive? Tape.
           | Magnetic recording tape. You stream data to it, and rewinding
           | is Hard, so you put the list of files you just dealt with at
           | the very end. This now-obsolete hardware expectation touches
           | us decades later.
        
             | jclulow wrote:
             | tar streams don't have an index at all, actually, they're
             | just a series of header blocks and data blocks. Some backup
             | software built on top may include a catalog of some kind
             | inside the tar stream itself, of course, and may choose to
             | do so as the last entry.
        
               | ahoka wrote:
               | IIRC, the original TAR format was just writing the
               | 'struct stat' from sys/stat.h, followed by the file
               | contents for each file.
        
             | charcircuit wrote:
             | But new file formats being developed are most likely not
             | going to be designed to be used with tapes. If you want to
             | avoid rewinds you can write a new concatenated version of
             | the files. This also allows you to keep the original in
             | case you need it.
        
           | MattPalmer1086 wrote:
           | Imagine you have a 12Gb zip file, and you want to add one
           | more file to it. Very easy and quick if the index is at the
           | end, very slow if it's at the start (assuming your index now
           | needs more space than is available currently).
           | 
           | Reading the index from the end of the file is also quick;
           | where you read next depends on what you are trying to find in
           | it, which may not be the start.
        
             | orphea wrote:
             | Some formats are meant to be streamable. And if the stream
             | is not seekable, then you have to read all 12 Gb before you
             | get to the index.
             | 
             | The point is, not all is black and white. Where to put the
             | index is just another trade off.
        
               | strogonoff wrote:
               | Different trade-offs is why it might make sense to
               | embrace the Unix way for file formats: do one thing well,
               | and document it so that others can do a different thing
               | well with the same data and no loss.
               | 
               | For example, if it is an archival/recording oriented use
               | case, then you make it cheap/easy to add data and
               | possibly add some resiliency for when recording process
               | crashes. If you want efficient random access, streaming,
               | storage efficiency, the same dataset can be stored in a
               | different layout without loss of quality--and conversion
               | between them doesn't have to be extremely optimal, it
               | just should be possible to implement from spec.
               | 
               | Like, say, you record raw video. You want "all of the
               | quality" and you know all in all it's going to take
               | terabytes, so bringing excess capacity is basically a
               | given when shooting. Therefore, if some camera maker, in
               | its infinite wisdom, creates a proprietary undocumented
               | format to sliiightly improve on file size but
               | "accidentally" makes it unusable in most software without
               | first converting it using their own proprietary tool, you
               | may justifiedly not appreciate it. (Canon Cinema Raw
               | Light HQ--I kid you not, that's what it's called--I'm
               | looking at you.)
               | 
               | On this note, what are the best/accepted approaches out
               | there when it comes to documenting/speccing out file
               | formats? Ideally something generalized enough that it can
               | also handle cases where the "file" is in fact a
               | particularly structured directory (a la macOS app
               | bundle).
        
               | trinix912 wrote:
               | Adding to the recording _raw_ video point, for such
               | purposes, try to design the format so that losing a
               | portion of the file doesn't render it entirely unusable.
               | Kinda like how you can recover DV video from spliced
               | tapes because the data for the current frame (+/- the
               | bordering frame) is enough to start a valid new file
               | stream.
        
               | MattPalmer1086 wrote:
               | Yes, a good point. Each file format must try to optimise
               | for the use cases it supports of course.
        
               | vrighter wrote:
               | make the index a linked data structure. You can then
               | extend it whenever, wherever
        
               | jonstewart wrote:
               | That's true, but streamable formats often don't need an
               | index.
               | 
               | A team member just created a new tool that uses the tar
               | format (streamable), but then puts the index as the
               | penultimate entry, with the last entry just being a fixed
               | size entry with the offset of the beginning of the index.
               | 
               | In this way normal tar tools just work but it's possible
               | to retrieve a listing and access a file randomly. It's
               | also still possible to append to it in the future, modulo
               | futzing with the index a bit.
               | 
               | (The intended purpose is archiving files that were stored
               | as S3 objects back into S3.?
        
         | zzo38computer wrote:
         | What probably allows for even more easier concatenation would
         | be to store the header of each file immediately preceding the
         | data of that file. You can make a index in memory when reading
         | the file if that is helpful for your use.
        
       | leiserfg wrote:
       | If binary, consider just using SQLite.
        
         | lifthrasiir wrote:
         | Using SQLite as a container format is only beneficial when the
         | file format itself is a composite, like word processor files
         | which will include both the textual data and any attachments.
         | SQLite is just a hinderance otherwise, like image file formats
         | or archival/compressed file formats [1].
         | 
         | [1] SQLite's own sqlar format is a bad idea for this reason.
        
           | frainfreeze wrote:
           | sqlar proved a great solution in the past for me. Where does
           | it fall short in your experience?
        
             | lifthrasiir wrote:
             | Unless you are using the container file as a database too,
             | sqlar is strictly inferior to ZIP in terms of pretty much
             | everything [1]. I'm actually more interested in the context
             | sqlar did prove useful for you.
             | 
             | [1] https://news.ycombinator.com/item?id=28670418
        
               | frainfreeze wrote:
               | I remember seeing the comment you linked few years back,
               | and back then comments were already locked so I couldn't
               | reply, and this time I sadly don't have the time to get
               | deeper into this, however - I recommend you to research
               | more about sqlar/using sqlite db as _file format_ in
               | general, or at minimum looking at SQLite Encryption
               | Extension (SEE)
               | (https://www.sqlite.org/see/doc/trunk/www/readme.wiki).
               | You can get a lot out of the box with very little
               | investment. IMHO sqlar is not competing with ZIP (can zip
               | do metadata and transactions?)
        
           | SyrupThinker wrote:
           | From my own experience SQLite works just fine as the
           | container for an archive format.
           | 
           | It ends up having some overhead compared to established ones,
           | but the ability to query over the attributes of 10000s of
           | files is pretty nice, and definitely faster than the worst
           | case of tar.
           | 
           | My archiver could even keep up with 7z in some cases (for
           | size and access speed).
           | 
           | Implementing it is also not particularly tricky, and SQLite
           | even allows streaming the blobs.
           | 
           | Making readers for such a format seems more accessible to me.
        
             | lifthrasiir wrote:
             | SQLite format itself is not very simple, because it is a
             | database file format in its heart. By using SQLite you are
             | unknowingly constraining your use case; for example you can
             | indeed stream BLOBs, but you can't randomly access BLOBs
             | because the SQLite format puts a large BLOB into pages in a
             | linked list, at least when I checked last. And BLOBs are
             | limited in size anyway (4GB AFAIK) so streaming itself
             | might not be that useful. The use of SQLite also means that
             | you have to bring SQLite into your code base, and SQLite is
             | not very small if you are just using it as a container.
             | 
             | > My archiver could even keep up with 7z in some cases (for
             | size and access speed).
             | 
             | 7z might feel slow because it enables solid compression by
             | default, which trades decompression speed with compression
             | ratio. I can't imagine 7z having a similar compression
             | ratio with correct options though, was your input
             | incompressible?
        
           | sureglymop wrote:
           | I think it's fine as an image format. I've used the mbtiles
           | format which is basically just a table filled with map tiles.
           | Sqlite makes it super easy to deal with it, e.g. to dump
           | individual blobs and save them as image files.
           | 
           | It just may not always be the most performant option. For
           | example, for map tiles there is alternatively the pmtiles
           | binary format which is optimized for http range requests.
        
           | InsideOutSanta wrote:
           | The Mac image editor Acorn uses SQLite as its file format.
           | It's described here:
           | 
           | https://shapeof.com/archives/2025/4/acorn_file_format.html
           | 
           | The author notes that an advantage is that other programs can
           | easily read the file format and extract information from it.
        
             | lifthrasiir wrote:
             | It is clearly a composite file format [1]:
             | 
             | > Acorn's native file format is used to losslessly store
             | layer data, editable text, layer filters, an optional
             | composite of the image, and various metadata. Its advantage
             | over other common formats such as PNG or JPEG is that it
             | preserves all this native information without flattening
             | the layer data or vector graphics.
             | 
             | As I've mentioned, this is a good use case for SQLite as a
             | container. But ZIP would work equally well here.
             | 
             | [1]
             | https://flyingmeat.com/acorn/docs/technotes/ACTN002.html
        
           | aidenn0 wrote:
           | Except image formats and archival formats _are_ composites
           | (data+metadata). We have Exif for images, and you might be
           | surprised by how much metadata the USTar format has.
        
         | paulddraper wrote:
         | Did you read the article?
         | 
         | That wouldn't support partial parsing.
        
       | shakna wrote:
       | Spent the weekend with an untagged chunked format, and... I
       | rather hate it.
       | 
       | A friend wanted a newer save viewer/editor for Dragonball
       | Xenoverse 2, because there's about a total of two, and they're
       | slow to update.
       | 
       | I thought it'd be fairly easy to spin up something to read it,
       | because I've spun up a bunch of save editors before, and they're
       | usually trivial.
       | 
       | XV2 save files change over versions. They're also just arrays of
       | structs [0], that don't properly identify themselves, so some
       | parts of them you're just guessing. Each chunk can also contain
       | chunks - some of which are actually a network request to get more
       | chunks from elsewhere in the codebase!
       | 
       | [0] Also encrypted before dumping to disk, but the keys have been
       | known since about the second release, and they've never switched
       | them.
        
       | lifthrasiir wrote:
       | Generally good points. Unfortunately existing file formats are
       | rarely following these rules. In fact these rules should form
       | naturally when you are dealing with many different file formats
       | anyway. Specific points follow:
       | 
       | - Agreed that human-readable formats have to be dead simple,
       | otherwise binary formats should be used. Note that textual
       | numbers are surprisingly complex to handle, so any formats with
       | significant number uses should just use binary.
       | 
       | - Chunking is generally good for structuring and incremental
       | parsing, but do not expect it to provide reorderability or
       | back/forward compatibility somehow. Unless explicitly designed,
       | they do not exist. Consider PNG for example; PNG chunks were
       | designed to be quite robust, but nowadays some exceptions [1] do
       | exist. Versioning is much more crucial for that.
       | 
       | [1] https://www.w3.org/TR/png/#animation-information
       | 
       | - Making a new file format from scratch is always difficult.
       | Already mentioned, but you should really consider using existing
       | file formats as a container first. Some formats are even
       | explicitly designed for this purpose, like sBOX [2] or RFC 9277
       | CBOR-labeled data tags [3].
       | 
       | [2] https://nothings.org/computer/sbox/sbox.html
       | 
       | [3] https://www.rfc-editor.org/rfc/rfc9277.html
        
         | mort96 wrote:
         | > Note that textual numbers are surprisingly complex to handle,
         | so any formats with significant number uses should just use
         | binary.
         | 
         | Especially true of floats!
         | 
         | With binary formats, it's _usually_ enough to only support
         | machines whose floating point representation conforms to IEEE
         | 754, which means you can just memcpy a float variable to or
         | from the file (maybe with some endianness conversion). But
         | writing a floating point parser and serializer which correctly
         | round-trips all floats and where the parser guarantees that it
         | parses to the nearest possible float... That 's incredibly
         | tricky.
         | 
         | What I've sometimes done when I'm writing a parser for textual
         | floats is, I parse the input into separate parts (so the
         | integer part, the floating point part, the exponent part), then
         | serialize those parts into some other format which I already
         | have a parser for. So I may serialize them into a JSON-style
         | number and use a JSON library to parse it if I have that handy,
         | or if I don't, I serialize it into a format that's guaranteed
         | to work with strtod regardless of locale. (The C standard does,
         | surprisingly, quite significantly constrain how locales can
         | affect strtod's number parsing.)
        
           | hyperbolablabla wrote:
           | Couldn't you just write the hex bytes? That would be
           | unambiguous, and it wouldn't lose precision.
        
       | constantcrying wrote:
       | Also you should consider the context in which you are developing.
       | Often there are "standard" tools and methods to deal with the
       | kind of data you want to store.
       | 
       | E.g. if you are interested in storing significant amounts of
       | structured floating point data, choosing something like HDF5 will
       | not only make your life easier it will also make it easy to
       | communicate what you have done to others.
        
       | InsideOutSanta wrote:
       | > _Most extensions have three characters, which means the search
       | space is pretty crowded. You may want to consider using four
       | letters._
       | 
       | Is there a reason not to use a lot more characters? If your
       | application's name is MustacheMingle, call the file
       | foo.mustachemingle instead of foo.mumi?
       | 
       | This will decrease the probability of collision to almost zero. I
       | am unaware of any operating systems that don't allow it, and it
       | will be 100% clear to the user which application the file belongs
       | to.
       | 
       | It will be less aesthetically pleasing than a shorter extension,
       | but that's probably mainly a matter of habit. We're just not used
       | to longer file name extensions.
       | 
       | Any reason why this is a bad idea?
        
         | delusional wrote:
         | > it will be 100% clear to the user which application the file
         | belongs to.
         | 
         | The most popular operating system hides it from the user, so
         | clarity would not improve in that case. At leat one other
         | (Linux) doesn't really use "extensions" and instead relies on
         | magic headers inside the files to determine the format.
         | 
         | Otherwise I think the decision is largely aestethic. If you
         | value absolute clarity, then I don't see any reason it won't
         | work, it'll just be a little "ugly"
        
           | hiAndrewQuinn wrote:
           | I don't even think it's ugly. I'm incredibly thankful every
           | time I see someone make e.g. `db.sqlite`, it immediately sets
           | me at ease to know I'm not accidentally dealing with a DuckDB
           | file or something.
        
             | wvbdmp wrote:
             | Yes, oh my god. Stop using .db for Sqlite files!!! It's too
             | generic and it's already used by Windows for those
             | thumbnail system files.
        
           | dist-epoch wrote:
           | > At leat one other (Linux) doesn't really use "extensions"
           | and instead relies on magic headers inside the files to
           | determine the format.
           | 
           | mostly for executable files.
           | 
           | I doubt many Linux apps look inside a .py file to see if it's
           | actually a JPEG they should build a thumbnail for.
        
             | scrollaway wrote:
             | Your doubts are incorrect. There's a fairly standard way of
             | extracting the file type out of files on linux, which
             | relies on a mix of extensions and magic bytes. Here's where
             | you can start to read about this:
             | 
             | https://wiki.archlinux.org/title/XDG_MIME_Applications
             | 
             | A lot of apps implement this (including most file managers)
        
               | delusional wrote:
               | I'm a little surprised that that link doesn't go to
               | libmagic[1]. No doubt XDG_MIME is an important spec for
               | desktop file detection, but I think libmagic and the
               | magic database that underpins it are more fundamental to
               | filetype detection in general.
               | 
               | It's also one of my favorite oddities on Linux. If you're
               | a Windows user the idea of a database of signatures for
               | filetypes that exists outside the application that "owns"
               | a file type is novel and weird.
               | 
               | [1]: https://man7.org/linux/man-
               | pages/man3/libmagic.3.html
        
           | whyoh wrote:
           | >The most popular operating system hides it from the user, so
           | clarity would not improve in that case.
           | 
           | If you mean Windows, that's not entirely correct. It defaults
           | to hiding only "known" file extensions, like txt, jpg and
           | such. (Which IMO is even worse than hiding all of them; that
           | would at least be consistent.)
           | 
           | EDIT: Actually, I just checked and apparently an extension,
           | even an exotic one, becomes "known" when it's associated with
           | a program, so your point still stands.
        
         | Hackbraten wrote:
         | A 14-character extension might cause UX issues in desktop
         | environments and file managers, where screen real estate per
         | directory entry is usually very limited.
         | 
         | When under pixel pressure, a graphical file manager might
         | choose to prioritize displaying the file extension and truncate
         | only the base filename. This would help the user identify file
         | formats. However, the longer the extension, the less space
         | remains for the base name. So a low-entropy file extension with
         | too many characters can contribute to poor UX.
        
         | dist-epoch wrote:
         | > call the file foo.mustachemingle
         | 
         | You could go the whole java way then
         | foo.com.apache.mustachemingle
         | 
         | > Any reason why this is a bad idea
         | 
         | the focus should be on the name, not on the extension.
        
         | layer8 wrote:
         | It's tedious to type when you want to do `ls *.mustachemingle`
         | or similar.
         | 
         | It's prone to get cut off in UIs with dedicated columns for
         | file extensions.
         | 
         | As you say, it's unconventional and therefore risks not being
         | immediately recognized as a file extension.
         | 
         | On the other hand, Java uses _.properties_ as a file extension,
         | so there is some precedent.
        
       | strogonoff wrote:
       | Thinking about a file format is a good way to clarify your
       | vision. Even if you don't want to facilitate interop, you'd get
       | some benefits for free--if you can encapsulate the state of a
       | particular _thing_ that the user is working on, you could, for
       | example, easily restore their work when they return, etc.
       | 
       | Some cop-out (not necessarily in a bad way) file formats:
       | 
       | 1. Don't have a file format, just specify a directory layout
       | instead. Example: CinemaDNG. Throw a bunch of particularly named
       | DNGs (a file for each frame of the footage) in a directory, maybe
       | add some metadata file or a marker, and you're good. Compared to
       | the likes of CRAW or BRAW, you lose in compression, but gain in
       | interop.
       | 
       | 2. Just dump runtime data. Example: Mnemosyne's old format. Do
       | you use Python? Just dump your state as a Python pickle. (Con:
       | dependency on a particular runtime, good luck rewriting it in
       | Rust.)
       | 
       | 3. Almost dump runtime data. Example: Anki, newer Mnemosyne with
       | their SQLite dumps. (Something suggests to me that they might be
       | using SQLite at runtime.) A step up from a pickle in terms of
       | interop, somewhat opens yourself (but also others) to alternative
       | implementations, at least in any runtime that has the means to
       | read SQLite. I hope if you use this you don't think that the
       | presence of SQL schema makes the format self-documenting.
       | 
       | 4. One or more of the above, except also zip or tar it up.
       | Example: VCV, Anki.
        
         | 3036e4 wrote:
         | About 1, directory of files, many formats these days are just a
         | bunch of files in a ZIP. One thing most applications lack
         | unfortunately is a way to instead just read and write the part
         | files from/to a directory. For one thing it makes it much
         | better for version control, but also just easier to access in
         | general when experimenting. I don't understand why this is not
         | more common, since as a developer it is much more fun to debug
         | things when each thing is its own file rather than an entry in
         | an archive. Most times it is also trivial to support both,
         | since any API for accessing directory entries will be close to
         | 1:1 to an API for accessing ZIP entries anyway.
         | 
         | When editing a file locally I would prefer to just have it
         | split up in a directory 99% of the time, only exporting to a
         | ZIP to publish it.
         | 
         | Of course it is trivial to write wrapper scripts to keep
         | zipping and unzipping files, and I have done that, but it does
         | feel a bit hacky and should be an unnecessary extra step.
        
           | strogonoff wrote:
           | Yes, the zipped version is number four. It's not great for
           | the reason you noted. Some people come up with smudge/clean
           | filters that handle the (de)compression, letting Git store
           | the more structured version of the data even though your
           | working directory contains the compressed files your software
           | can read and write--but I don't know how portable these
           | things are. I agree with you in general, and it is also why
           | my number one example is that you might not need a single-
           | file format at all. macOS app bundles is a great example of
           | this approach in the wild.
           | 
           | One question I was hoping to ask anyone who thought about
           | these matters: what accepted approaches do exist out there
           | when it comes to documenting/speccing out file formats?
           | Ideally, including the cases where the "file" is in fact a
           | directory with a specific layout.
        
         | strogonoff wrote:
         | (Correction: instead of "CRAW", I should have written "Canon
         | Cinema Raw Light". Apparently, those are different things.)
        
         | trinix912 wrote:
         | > 2. Just dump runtime data. Example: Mnemosyne's old format.
         | Do you use Python? Just dump your state as a Python pickle.
         | (Con: dependency on a particular runtime, good luck rewriting
         | it in Rust.)
         | 
         | Be particularly careful with this one as it can potentially
         | vastly expand the attack surface of your program. Not that you
         | shouldn't ever do it, just make sure the deserializer doesn't
         | accept objects/values outside of your spec.
        
           | strogonoff wrote:
           | I certainly hope no one takes my list as an endorsement...
           | It's just some formats seen in the wild.
           | 
           | It should be noted (the article does not) that parsing and
           | deserialisation is generally a known weak area and a common
           | source of CVEs, even when pickling is not used. Being more
           | disciplined about it helps, of course.
        
       | ahoka wrote:
       | Also, "Don't try to be clever to save a few bits.", like using
       | the lower and upper 4 bits of a byte for different things (I'm
       | looking at you, ONFI).
        
         | vrighter wrote:
         | I've had to do just that to retrofit features I wasn't allowed
         | to think about up front (we must get the product out the
         | door.... we'll cross that bridge when we get to it)
        
       | teddyh wrote:
       | Designing your file (and data) formats well is important.
       | 
       | "Show me your flowcharts and conceal your tables, and I shall
       | continue to be mystified. Show me your tables, and I won't
       | usually need your flowcharts; they'll be obvious."
       | 
       | -- Fred Brooks
        
       | zzo38computer wrote:
       | Consider DER format. Partial parsing is possible; you can easily
       | ignore any part of the file that you do not care about, since the
       | framing is consistent. Additionally, it works like the "chunked"
       | formats mentioned in the article, and one of the bits of the
       | header indicates whether it includes other chunks or includes
       | data. (Furthermore, I made up a text-based format called TER
       | which is intended to be converted to DER. TER is not intended to
       | be used directly; it is only intended to be converted to DER for
       | then use in other programs. I had also made up some additional
       | data types, and one of these (called ASN1_IDENTIFIED_DATA) can be
       | used for identifying the format of a file (which might conform to
       | multiple formats, and it allows this too).)
       | 
       | I dislike JSON and some other modern formats (even binary
       | formats); they often are just not as good in my opinion. One
       | problem is they tend to insist on using Unicode, and/or on other
       | things (e.g. 32-bit integers where you might need 64-bits). When
       | using a text-based format where binary would do better, it can
       | also be inefficient especially if binary data is included within
       | the text as well, especially if the format does not indicate that
       | it is meant to represent binary data.
       | 
       | However, even if you use an existing format, you should avoid
       | using the existing format badly; using existing formats badly
       | seems to be common. There is also the issue of if the existing
       | format is actually good or not; many formats are not good, for
       | various reasons (some of which I mentioned above, but there are
       | others, depending on the application).
       | 
       | About target hardware, not all software is intended for a
       | specific target hardware, although some is.
       | 
       | For compression, another consideration is: there are general
       | compression schemes as well as being able to make up a
       | compression scheme that is specific for the kind of data that is
       | being compressed.
       | 
       | They also mention file names. However, this can also depend on
       | the target system; e.g. for DOS files you will need to be limited
       | to three characters after the dot. Also, some programs would not
       | need to care about file names in some or all cases (many programs
       | I write don't care about file names).
        
         | aidenn0 wrote:
         | Maybe it's just because I've never needed the complexity, but
         | ASN.1 seems a bit much for any of the formats I've created.
        
           | zzo38computer wrote:
           | For me too, although you only need to use (and implement) the
           | parts which are relevant for your application and not all of
           | them, so it is not really the problem. (I also never needed
           | to write ASN.1 schemas, and a full implementation of ASN.1 is
           | not necessary for my purpose.) (This is also a reason I use
           | DER instead of BER, even if canonical form is not required;
           | DER is simpler to handle than all of the possibilities of
           | BER.)
        
       | fake-name wrote:
       | I have a rather ideosyncratic opinion here:
       | 
       | For Open-Source projects, human readable file formats are
       | actively harmful.
       | 
       | This mostly is motivated by my experience with KiCad.
       | Principally, there are multiple things that the UI does not
       | expose at all (slots in PCB footprint files) where the _only_ way
       | to add them is to manually edit the footprint file in a text
       | editor.
       | 
       | There are some other similar annoyances in the same vein.
       | 
       | Basically, human readable (and therefore editable) file formats
       | wind up being a way for some things to never be exposed thru the
       | UI. This actively leads to the software being less capable.
        
         | zzo38computer wrote:
         | Not exposing things in the UI is not necessarily a problem (it
         | depends on the program and on other stuff), although it can be
         | (especially if it is not documented). I had not used the
         | program you mention, although it does seem a problem in the way
         | you mention, although someone who wants to add it into the UI
         | could hopefully do so if it is FOSS. However, one potential
         | problem is sometimes if it is a text-based format, writing such
         | a format (in a way which still remains readable rather than
         | messy) can sometimes be more complicated than reading it.
         | 
         | (The TEMPLATE.DER lump (which is a binary file format and not
         | plain text) in Super ZZ Zero is not exposed anywhere in the UI;
         | you must use an external program to create this lump if you
         | want it. Fortunately that lump is not actually mandatory, and
         | only affects the automatic initial modifications of a new world
         | file based on an existing template.)
         | 
         | However, I think that human readable file formats are harmful
         | for other reasons.
        
       | layer8 wrote:
       | > However, it's cleaner to have a field in your header that
       | states where the first sub-chunk starts; that way you can expand
       | your header as much as you like in future versions, with old code
       | being able to ignore those fields and jump to the good stuff.
       | 
       | That's assuming that parsers will honor this, and not just use
       | the fixed offset that worked for the past ten hears. This has
       | happened often enough in the past.
        
       ___________________________________________________________________
       (page generated 2025-05-25 23:01 UTC)