[HN Gopher] On File Formats
___________________________________________________________________
On File Formats
Author : ibobev
Score : 104 points
Date : 2025-05-21 07:48 UTC (4 days ago)
(HTM) web link (solhsa.com)
(TXT) w3m dump (solhsa.com)
| adelpozo wrote:
| I would add to make it streamable or at least allow to be read
| remotely efficiently.
| flowerthoughts wrote:
| Agreed on that one. With a nice file format, streamable is
| hopefully just a matter of ordering things appropriately once
| you know the sizes of the individual chunks. You want to write
| the index last, but you want to read it first. Perhaps you want
| the most influential values first if you're building something
| progressive (level-of-detail split.)
|
| Similar is the discussion of delimited fields vs. length
| prefix. Delimited fields are nicer to write, but length
| prefixed fields are nicer to read. I think most new formats use
| length prefixes, so I'd start there. I wrote a blog post about
| combining the value and length into a VLI that also handles
| floating point and bit/byte strings:
| https://tommie.github.io/a/2024/06/small-encoding
| lifthrasiir wrote:
| I don't think a single encoding is generally useful. A good
| encoding for given application would depend on the value
| distribution and neighboring data. For example any variable-
| length scalar encoding would make vectorization much harder.
| mjevans wrote:
| Most of that's pretty good.
|
| Compression: For anything that ends up large it's probably
| desired. Though consider both algorithm and 'strength' based on
| the use case carefully. Even a simple algorithm might make things
| faster when it comes time to transfer or write to permanent
| storage. A high cost search to squeeze out yet more redundancy is
| probably worth it if something will be copied and/or decompressed
| many times, but might not be worth it for that locally compiled
| kernel you'll boot at most 10 times before replacing it with
| another.
| thasso wrote:
| For archive formats, or anything that has a table of contents or
| an index, consider putting the index at the end of the file so
| that you can append to it without moving a lot of data around.
| This also allows for easy concatenation.
| charcircuit wrote:
| Why not put it at the beginning so that it is available at the
| start of the filestream that way it is easier to get first so
| you know what other ranges of the file you may need?
|
| >This also allows for easy concatenation.
|
| How would it be easier than putting it at the front?
| lifthrasiir wrote:
| If the archive is being updated in place, turning ABC# into
| ABCD#' (where # and #' are indices) is easier than turning
| #ABC into #'ABCD. The actual position of indices doesn't
| matter much if the stream is seekable. I don't think the
| concatenation is a good argument though.
| shakna wrote:
| Files are... Flat streams. Sort of.
|
| So if you rewrite an index at the head of the file, you may
| end up having to rewrite everything that comes afterwards, to
| push it further down in the file, if it overflows any padding
| offset. Which makes appending an extremely slow operation.
|
| Whereas seeking to end, and then rewinding, is not nearly as
| costly.
| charcircuit wrote:
| Most workflows do not modify files in place but rather
| create new files as its safer and allows you to go back to
| the original if you made a mistake.
| shakna wrote:
| If you're writing twice, you don't care about the
| performance to begin with. Or the size of the files being
| produced.
|
| But if you're writing indices, there's a good chance that
| you do care about performance.
| charcircuit wrote:
| Files are often authored once and read / used many times.
| When authoring a file performance is less important and
| there is plenty of file space available. Indices are for
| the performance for using the file which is more
| important than the performance for authoring it.
| shakna wrote:
| If storage and concern aren't a concern when writing,
| then you probably shouldn't be doing workarounds to
| include the index in the file itself. Follow the dbm
| approach and separate both into two different files.
|
| Which is what dbm, bdb, Windows search indexes, IBM
| datasets, and so many, many other standards will do.
| charcircuit wrote:
| Separate files isn't always the answer. It can be more
| awkward to need to download both and always keep them
| together compared to when it's a single file.
| PhilipRoman wrote:
| You can do it via fallocate(2) FALLOC_FL_INSERT_RANGE and
| FALLOC_FL_COLLAPSE_RANGE but sadly these still have a lot
| of limitations and are not portable. Based on discussions
| I've read, it seems there is no real motivation for
| implementing support for it, since anyone who cares about
| the performance of doing this will use some DB format
| anyway.
|
| In theory, files should be just unrolled linked lists (or
| trees) of bytes, but I guess a lot of internal code still
| assumes full, aligned blocks.
| McGlockenshire wrote:
| > How would it be easier than putting it at the front?
|
| Have you ever wondered why `tar` is the Tape Archive? Tape.
| Magnetic recording tape. You stream data to it, and rewinding
| is Hard, so you put the list of files you just dealt with at
| the very end. This now-obsolete hardware expectation touches
| us decades later.
| jclulow wrote:
| tar streams don't have an index at all, actually, they're
| just a series of header blocks and data blocks. Some backup
| software built on top may include a catalog of some kind
| inside the tar stream itself, of course, and may choose to
| do so as the last entry.
| ahoka wrote:
| IIRC, the original TAR format was just writing the
| 'struct stat' from sys/stat.h, followed by the file
| contents for each file.
| charcircuit wrote:
| But new file formats being developed are most likely not
| going to be designed to be used with tapes. If you want to
| avoid rewinds you can write a new concatenated version of
| the files. This also allows you to keep the original in
| case you need it.
| MattPalmer1086 wrote:
| Imagine you have a 12Gb zip file, and you want to add one
| more file to it. Very easy and quick if the index is at the
| end, very slow if it's at the start (assuming your index now
| needs more space than is available currently).
|
| Reading the index from the end of the file is also quick;
| where you read next depends on what you are trying to find in
| it, which may not be the start.
| orphea wrote:
| Some formats are meant to be streamable. And if the stream
| is not seekable, then you have to read all 12 Gb before you
| get to the index.
|
| The point is, not all is black and white. Where to put the
| index is just another trade off.
| strogonoff wrote:
| Different trade-offs is why it might make sense to
| embrace the Unix way for file formats: do one thing well,
| and document it so that others can do a different thing
| well with the same data and no loss.
|
| For example, if it is an archival/recording oriented use
| case, then you make it cheap/easy to add data and
| possibly add some resiliency for when recording process
| crashes. If you want efficient random access, streaming,
| storage efficiency, the same dataset can be stored in a
| different layout without loss of quality--and conversion
| between them doesn't have to be extremely optimal, it
| just should be possible to implement from spec.
|
| Like, say, you record raw video. You want "all of the
| quality" and you know all in all it's going to take
| terabytes, so bringing excess capacity is basically a
| given when shooting. Therefore, if some camera maker, in
| its infinite wisdom, creates a proprietary undocumented
| format to sliiightly improve on file size but
| "accidentally" makes it unusable in most software without
| first converting it using their own proprietary tool, you
| may justifiedly not appreciate it. (Canon Cinema Raw
| Light HQ--I kid you not, that's what it's called--I'm
| looking at you.)
|
| On this note, what are the best/accepted approaches out
| there when it comes to documenting/speccing out file
| formats? Ideally something generalized enough that it can
| also handle cases where the "file" is in fact a
| particularly structured directory (a la macOS app
| bundle).
| trinix912 wrote:
| Adding to the recording _raw_ video point, for such
| purposes, try to design the format so that losing a
| portion of the file doesn't render it entirely unusable.
| Kinda like how you can recover DV video from spliced
| tapes because the data for the current frame (+/- the
| bordering frame) is enough to start a valid new file
| stream.
| MattPalmer1086 wrote:
| Yes, a good point. Each file format must try to optimise
| for the use cases it supports of course.
| vrighter wrote:
| make the index a linked data structure. You can then
| extend it whenever, wherever
| jonstewart wrote:
| That's true, but streamable formats often don't need an
| index.
|
| A team member just created a new tool that uses the tar
| format (streamable), but then puts the index as the
| penultimate entry, with the last entry just being a fixed
| size entry with the offset of the beginning of the index.
|
| In this way normal tar tools just work but it's possible
| to retrieve a listing and access a file randomly. It's
| also still possible to append to it in the future, modulo
| futzing with the index a bit.
|
| (The intended purpose is archiving files that were stored
| as S3 objects back into S3.?
| zzo38computer wrote:
| What probably allows for even more easier concatenation would
| be to store the header of each file immediately preceding the
| data of that file. You can make a index in memory when reading
| the file if that is helpful for your use.
| leiserfg wrote:
| If binary, consider just using SQLite.
| lifthrasiir wrote:
| Using SQLite as a container format is only beneficial when the
| file format itself is a composite, like word processor files
| which will include both the textual data and any attachments.
| SQLite is just a hinderance otherwise, like image file formats
| or archival/compressed file formats [1].
|
| [1] SQLite's own sqlar format is a bad idea for this reason.
| frainfreeze wrote:
| sqlar proved a great solution in the past for me. Where does
| it fall short in your experience?
| lifthrasiir wrote:
| Unless you are using the container file as a database too,
| sqlar is strictly inferior to ZIP in terms of pretty much
| everything [1]. I'm actually more interested in the context
| sqlar did prove useful for you.
|
| [1] https://news.ycombinator.com/item?id=28670418
| frainfreeze wrote:
| I remember seeing the comment you linked few years back,
| and back then comments were already locked so I couldn't
| reply, and this time I sadly don't have the time to get
| deeper into this, however - I recommend you to research
| more about sqlar/using sqlite db as _file format_ in
| general, or at minimum looking at SQLite Encryption
| Extension (SEE)
| (https://www.sqlite.org/see/doc/trunk/www/readme.wiki).
| You can get a lot out of the box with very little
| investment. IMHO sqlar is not competing with ZIP (can zip
| do metadata and transactions?)
| SyrupThinker wrote:
| From my own experience SQLite works just fine as the
| container for an archive format.
|
| It ends up having some overhead compared to established ones,
| but the ability to query over the attributes of 10000s of
| files is pretty nice, and definitely faster than the worst
| case of tar.
|
| My archiver could even keep up with 7z in some cases (for
| size and access speed).
|
| Implementing it is also not particularly tricky, and SQLite
| even allows streaming the blobs.
|
| Making readers for such a format seems more accessible to me.
| lifthrasiir wrote:
| SQLite format itself is not very simple, because it is a
| database file format in its heart. By using SQLite you are
| unknowingly constraining your use case; for example you can
| indeed stream BLOBs, but you can't randomly access BLOBs
| because the SQLite format puts a large BLOB into pages in a
| linked list, at least when I checked last. And BLOBs are
| limited in size anyway (4GB AFAIK) so streaming itself
| might not be that useful. The use of SQLite also means that
| you have to bring SQLite into your code base, and SQLite is
| not very small if you are just using it as a container.
|
| > My archiver could even keep up with 7z in some cases (for
| size and access speed).
|
| 7z might feel slow because it enables solid compression by
| default, which trades decompression speed with compression
| ratio. I can't imagine 7z having a similar compression
| ratio with correct options though, was your input
| incompressible?
| sureglymop wrote:
| I think it's fine as an image format. I've used the mbtiles
| format which is basically just a table filled with map tiles.
| Sqlite makes it super easy to deal with it, e.g. to dump
| individual blobs and save them as image files.
|
| It just may not always be the most performant option. For
| example, for map tiles there is alternatively the pmtiles
| binary format which is optimized for http range requests.
| InsideOutSanta wrote:
| The Mac image editor Acorn uses SQLite as its file format.
| It's described here:
|
| https://shapeof.com/archives/2025/4/acorn_file_format.html
|
| The author notes that an advantage is that other programs can
| easily read the file format and extract information from it.
| lifthrasiir wrote:
| It is clearly a composite file format [1]:
|
| > Acorn's native file format is used to losslessly store
| layer data, editable text, layer filters, an optional
| composite of the image, and various metadata. Its advantage
| over other common formats such as PNG or JPEG is that it
| preserves all this native information without flattening
| the layer data or vector graphics.
|
| As I've mentioned, this is a good use case for SQLite as a
| container. But ZIP would work equally well here.
|
| [1]
| https://flyingmeat.com/acorn/docs/technotes/ACTN002.html
| aidenn0 wrote:
| Except image formats and archival formats _are_ composites
| (data+metadata). We have Exif for images, and you might be
| surprised by how much metadata the USTar format has.
| paulddraper wrote:
| Did you read the article?
|
| That wouldn't support partial parsing.
| shakna wrote:
| Spent the weekend with an untagged chunked format, and... I
| rather hate it.
|
| A friend wanted a newer save viewer/editor for Dragonball
| Xenoverse 2, because there's about a total of two, and they're
| slow to update.
|
| I thought it'd be fairly easy to spin up something to read it,
| because I've spun up a bunch of save editors before, and they're
| usually trivial.
|
| XV2 save files change over versions. They're also just arrays of
| structs [0], that don't properly identify themselves, so some
| parts of them you're just guessing. Each chunk can also contain
| chunks - some of which are actually a network request to get more
| chunks from elsewhere in the codebase!
|
| [0] Also encrypted before dumping to disk, but the keys have been
| known since about the second release, and they've never switched
| them.
| lifthrasiir wrote:
| Generally good points. Unfortunately existing file formats are
| rarely following these rules. In fact these rules should form
| naturally when you are dealing with many different file formats
| anyway. Specific points follow:
|
| - Agreed that human-readable formats have to be dead simple,
| otherwise binary formats should be used. Note that textual
| numbers are surprisingly complex to handle, so any formats with
| significant number uses should just use binary.
|
| - Chunking is generally good for structuring and incremental
| parsing, but do not expect it to provide reorderability or
| back/forward compatibility somehow. Unless explicitly designed,
| they do not exist. Consider PNG for example; PNG chunks were
| designed to be quite robust, but nowadays some exceptions [1] do
| exist. Versioning is much more crucial for that.
|
| [1] https://www.w3.org/TR/png/#animation-information
|
| - Making a new file format from scratch is always difficult.
| Already mentioned, but you should really consider using existing
| file formats as a container first. Some formats are even
| explicitly designed for this purpose, like sBOX [2] or RFC 9277
| CBOR-labeled data tags [3].
|
| [2] https://nothings.org/computer/sbox/sbox.html
|
| [3] https://www.rfc-editor.org/rfc/rfc9277.html
| mort96 wrote:
| > Note that textual numbers are surprisingly complex to handle,
| so any formats with significant number uses should just use
| binary.
|
| Especially true of floats!
|
| With binary formats, it's _usually_ enough to only support
| machines whose floating point representation conforms to IEEE
| 754, which means you can just memcpy a float variable to or
| from the file (maybe with some endianness conversion). But
| writing a floating point parser and serializer which correctly
| round-trips all floats and where the parser guarantees that it
| parses to the nearest possible float... That 's incredibly
| tricky.
|
| What I've sometimes done when I'm writing a parser for textual
| floats is, I parse the input into separate parts (so the
| integer part, the floating point part, the exponent part), then
| serialize those parts into some other format which I already
| have a parser for. So I may serialize them into a JSON-style
| number and use a JSON library to parse it if I have that handy,
| or if I don't, I serialize it into a format that's guaranteed
| to work with strtod regardless of locale. (The C standard does,
| surprisingly, quite significantly constrain how locales can
| affect strtod's number parsing.)
| hyperbolablabla wrote:
| Couldn't you just write the hex bytes? That would be
| unambiguous, and it wouldn't lose precision.
| constantcrying wrote:
| Also you should consider the context in which you are developing.
| Often there are "standard" tools and methods to deal with the
| kind of data you want to store.
|
| E.g. if you are interested in storing significant amounts of
| structured floating point data, choosing something like HDF5 will
| not only make your life easier it will also make it easy to
| communicate what you have done to others.
| InsideOutSanta wrote:
| > _Most extensions have three characters, which means the search
| space is pretty crowded. You may want to consider using four
| letters._
|
| Is there a reason not to use a lot more characters? If your
| application's name is MustacheMingle, call the file
| foo.mustachemingle instead of foo.mumi?
|
| This will decrease the probability of collision to almost zero. I
| am unaware of any operating systems that don't allow it, and it
| will be 100% clear to the user which application the file belongs
| to.
|
| It will be less aesthetically pleasing than a shorter extension,
| but that's probably mainly a matter of habit. We're just not used
| to longer file name extensions.
|
| Any reason why this is a bad idea?
| delusional wrote:
| > it will be 100% clear to the user which application the file
| belongs to.
|
| The most popular operating system hides it from the user, so
| clarity would not improve in that case. At leat one other
| (Linux) doesn't really use "extensions" and instead relies on
| magic headers inside the files to determine the format.
|
| Otherwise I think the decision is largely aestethic. If you
| value absolute clarity, then I don't see any reason it won't
| work, it'll just be a little "ugly"
| hiAndrewQuinn wrote:
| I don't even think it's ugly. I'm incredibly thankful every
| time I see someone make e.g. `db.sqlite`, it immediately sets
| me at ease to know I'm not accidentally dealing with a DuckDB
| file or something.
| wvbdmp wrote:
| Yes, oh my god. Stop using .db for Sqlite files!!! It's too
| generic and it's already used by Windows for those
| thumbnail system files.
| dist-epoch wrote:
| > At leat one other (Linux) doesn't really use "extensions"
| and instead relies on magic headers inside the files to
| determine the format.
|
| mostly for executable files.
|
| I doubt many Linux apps look inside a .py file to see if it's
| actually a JPEG they should build a thumbnail for.
| scrollaway wrote:
| Your doubts are incorrect. There's a fairly standard way of
| extracting the file type out of files on linux, which
| relies on a mix of extensions and magic bytes. Here's where
| you can start to read about this:
|
| https://wiki.archlinux.org/title/XDG_MIME_Applications
|
| A lot of apps implement this (including most file managers)
| delusional wrote:
| I'm a little surprised that that link doesn't go to
| libmagic[1]. No doubt XDG_MIME is an important spec for
| desktop file detection, but I think libmagic and the
| magic database that underpins it are more fundamental to
| filetype detection in general.
|
| It's also one of my favorite oddities on Linux. If you're
| a Windows user the idea of a database of signatures for
| filetypes that exists outside the application that "owns"
| a file type is novel and weird.
|
| [1]: https://man7.org/linux/man-
| pages/man3/libmagic.3.html
| whyoh wrote:
| >The most popular operating system hides it from the user, so
| clarity would not improve in that case.
|
| If you mean Windows, that's not entirely correct. It defaults
| to hiding only "known" file extensions, like txt, jpg and
| such. (Which IMO is even worse than hiding all of them; that
| would at least be consistent.)
|
| EDIT: Actually, I just checked and apparently an extension,
| even an exotic one, becomes "known" when it's associated with
| a program, so your point still stands.
| Hackbraten wrote:
| A 14-character extension might cause UX issues in desktop
| environments and file managers, where screen real estate per
| directory entry is usually very limited.
|
| When under pixel pressure, a graphical file manager might
| choose to prioritize displaying the file extension and truncate
| only the base filename. This would help the user identify file
| formats. However, the longer the extension, the less space
| remains for the base name. So a low-entropy file extension with
| too many characters can contribute to poor UX.
| dist-epoch wrote:
| > call the file foo.mustachemingle
|
| You could go the whole java way then
| foo.com.apache.mustachemingle
|
| > Any reason why this is a bad idea
|
| the focus should be on the name, not on the extension.
| layer8 wrote:
| It's tedious to type when you want to do `ls *.mustachemingle`
| or similar.
|
| It's prone to get cut off in UIs with dedicated columns for
| file extensions.
|
| As you say, it's unconventional and therefore risks not being
| immediately recognized as a file extension.
|
| On the other hand, Java uses _.properties_ as a file extension,
| so there is some precedent.
| strogonoff wrote:
| Thinking about a file format is a good way to clarify your
| vision. Even if you don't want to facilitate interop, you'd get
| some benefits for free--if you can encapsulate the state of a
| particular _thing_ that the user is working on, you could, for
| example, easily restore their work when they return, etc.
|
| Some cop-out (not necessarily in a bad way) file formats:
|
| 1. Don't have a file format, just specify a directory layout
| instead. Example: CinemaDNG. Throw a bunch of particularly named
| DNGs (a file for each frame of the footage) in a directory, maybe
| add some metadata file or a marker, and you're good. Compared to
| the likes of CRAW or BRAW, you lose in compression, but gain in
| interop.
|
| 2. Just dump runtime data. Example: Mnemosyne's old format. Do
| you use Python? Just dump your state as a Python pickle. (Con:
| dependency on a particular runtime, good luck rewriting it in
| Rust.)
|
| 3. Almost dump runtime data. Example: Anki, newer Mnemosyne with
| their SQLite dumps. (Something suggests to me that they might be
| using SQLite at runtime.) A step up from a pickle in terms of
| interop, somewhat opens yourself (but also others) to alternative
| implementations, at least in any runtime that has the means to
| read SQLite. I hope if you use this you don't think that the
| presence of SQL schema makes the format self-documenting.
|
| 4. One or more of the above, except also zip or tar it up.
| Example: VCV, Anki.
| 3036e4 wrote:
| About 1, directory of files, many formats these days are just a
| bunch of files in a ZIP. One thing most applications lack
| unfortunately is a way to instead just read and write the part
| files from/to a directory. For one thing it makes it much
| better for version control, but also just easier to access in
| general when experimenting. I don't understand why this is not
| more common, since as a developer it is much more fun to debug
| things when each thing is its own file rather than an entry in
| an archive. Most times it is also trivial to support both,
| since any API for accessing directory entries will be close to
| 1:1 to an API for accessing ZIP entries anyway.
|
| When editing a file locally I would prefer to just have it
| split up in a directory 99% of the time, only exporting to a
| ZIP to publish it.
|
| Of course it is trivial to write wrapper scripts to keep
| zipping and unzipping files, and I have done that, but it does
| feel a bit hacky and should be an unnecessary extra step.
| strogonoff wrote:
| Yes, the zipped version is number four. It's not great for
| the reason you noted. Some people come up with smudge/clean
| filters that handle the (de)compression, letting Git store
| the more structured version of the data even though your
| working directory contains the compressed files your software
| can read and write--but I don't know how portable these
| things are. I agree with you in general, and it is also why
| my number one example is that you might not need a single-
| file format at all. macOS app bundles is a great example of
| this approach in the wild.
|
| One question I was hoping to ask anyone who thought about
| these matters: what accepted approaches do exist out there
| when it comes to documenting/speccing out file formats?
| Ideally, including the cases where the "file" is in fact a
| directory with a specific layout.
| strogonoff wrote:
| (Correction: instead of "CRAW", I should have written "Canon
| Cinema Raw Light". Apparently, those are different things.)
| trinix912 wrote:
| > 2. Just dump runtime data. Example: Mnemosyne's old format.
| Do you use Python? Just dump your state as a Python pickle.
| (Con: dependency on a particular runtime, good luck rewriting
| it in Rust.)
|
| Be particularly careful with this one as it can potentially
| vastly expand the attack surface of your program. Not that you
| shouldn't ever do it, just make sure the deserializer doesn't
| accept objects/values outside of your spec.
| strogonoff wrote:
| I certainly hope no one takes my list as an endorsement...
| It's just some formats seen in the wild.
|
| It should be noted (the article does not) that parsing and
| deserialisation is generally a known weak area and a common
| source of CVEs, even when pickling is not used. Being more
| disciplined about it helps, of course.
| ahoka wrote:
| Also, "Don't try to be clever to save a few bits.", like using
| the lower and upper 4 bits of a byte for different things (I'm
| looking at you, ONFI).
| vrighter wrote:
| I've had to do just that to retrofit features I wasn't allowed
| to think about up front (we must get the product out the
| door.... we'll cross that bridge when we get to it)
| teddyh wrote:
| Designing your file (and data) formats well is important.
|
| "Show me your flowcharts and conceal your tables, and I shall
| continue to be mystified. Show me your tables, and I won't
| usually need your flowcharts; they'll be obvious."
|
| -- Fred Brooks
| zzo38computer wrote:
| Consider DER format. Partial parsing is possible; you can easily
| ignore any part of the file that you do not care about, since the
| framing is consistent. Additionally, it works like the "chunked"
| formats mentioned in the article, and one of the bits of the
| header indicates whether it includes other chunks or includes
| data. (Furthermore, I made up a text-based format called TER
| which is intended to be converted to DER. TER is not intended to
| be used directly; it is only intended to be converted to DER for
| then use in other programs. I had also made up some additional
| data types, and one of these (called ASN1_IDENTIFIED_DATA) can be
| used for identifying the format of a file (which might conform to
| multiple formats, and it allows this too).)
|
| I dislike JSON and some other modern formats (even binary
| formats); they often are just not as good in my opinion. One
| problem is they tend to insist on using Unicode, and/or on other
| things (e.g. 32-bit integers where you might need 64-bits). When
| using a text-based format where binary would do better, it can
| also be inefficient especially if binary data is included within
| the text as well, especially if the format does not indicate that
| it is meant to represent binary data.
|
| However, even if you use an existing format, you should avoid
| using the existing format badly; using existing formats badly
| seems to be common. There is also the issue of if the existing
| format is actually good or not; many formats are not good, for
| various reasons (some of which I mentioned above, but there are
| others, depending on the application).
|
| About target hardware, not all software is intended for a
| specific target hardware, although some is.
|
| For compression, another consideration is: there are general
| compression schemes as well as being able to make up a
| compression scheme that is specific for the kind of data that is
| being compressed.
|
| They also mention file names. However, this can also depend on
| the target system; e.g. for DOS files you will need to be limited
| to three characters after the dot. Also, some programs would not
| need to care about file names in some or all cases (many programs
| I write don't care about file names).
| aidenn0 wrote:
| Maybe it's just because I've never needed the complexity, but
| ASN.1 seems a bit much for any of the formats I've created.
| zzo38computer wrote:
| For me too, although you only need to use (and implement) the
| parts which are relevant for your application and not all of
| them, so it is not really the problem. (I also never needed
| to write ASN.1 schemas, and a full implementation of ASN.1 is
| not necessary for my purpose.) (This is also a reason I use
| DER instead of BER, even if canonical form is not required;
| DER is simpler to handle than all of the possibilities of
| BER.)
| fake-name wrote:
| I have a rather ideosyncratic opinion here:
|
| For Open-Source projects, human readable file formats are
| actively harmful.
|
| This mostly is motivated by my experience with KiCad.
| Principally, there are multiple things that the UI does not
| expose at all (slots in PCB footprint files) where the _only_ way
| to add them is to manually edit the footprint file in a text
| editor.
|
| There are some other similar annoyances in the same vein.
|
| Basically, human readable (and therefore editable) file formats
| wind up being a way for some things to never be exposed thru the
| UI. This actively leads to the software being less capable.
| zzo38computer wrote:
| Not exposing things in the UI is not necessarily a problem (it
| depends on the program and on other stuff), although it can be
| (especially if it is not documented). I had not used the
| program you mention, although it does seem a problem in the way
| you mention, although someone who wants to add it into the UI
| could hopefully do so if it is FOSS. However, one potential
| problem is sometimes if it is a text-based format, writing such
| a format (in a way which still remains readable rather than
| messy) can sometimes be more complicated than reading it.
|
| (The TEMPLATE.DER lump (which is a binary file format and not
| plain text) in Super ZZ Zero is not exposed anywhere in the UI;
| you must use an external program to create this lump if you
| want it. Fortunately that lump is not actually mandatory, and
| only affects the automatic initial modifications of a new world
| file based on an existing template.)
|
| However, I think that human readable file formats are harmful
| for other reasons.
| layer8 wrote:
| > However, it's cleaner to have a field in your header that
| states where the first sub-chunk starts; that way you can expand
| your header as much as you like in future versions, with old code
| being able to ignore those fields and jump to the good stuff.
|
| That's assuming that parsers will honor this, and not just use
| the fixed offset that worked for the past ten hears. This has
| happened often enough in the past.
___________________________________________________________________
(page generated 2025-05-25 23:01 UTC)