[HN Gopher] There are many ways to fail to read a file in a C pr...
       ___________________________________________________________________
        
       There are many ways to fail to read a file in a C program
        
       Author : rcarmo
       Score  : 68 points
       Date   : 2022-08-15 09:38 UTC (13 hours ago)
        
 (HTM) web link (colinpaice.blog)
 (TXT) w3m dump (colinpaice.blog)
        
       | carapace wrote:
       | I've come around to the view that filesystems are idea whose time
       | has passed, a relic or holdover or atavism from the era of small,
       | slow machines.
       | 
       | What you would like, I think, is something like git (or IPFS)
       | where data is stored as content-addresses blobs and metadata
       | (including filenames and directory structures) are also just
       | blobs in the object store.
        
         | salawat wrote:
         | Filesystems are literally what it says on the tin. It is a
         | _filing system_. Look in library and secretarial annals for the
         | earliest foundational thinking from which computing 's idea of
         | filesystems were born. A systemization of behaviors and
         | abstractions that facilitate the organization, addressing, and
         | access of data. Go to any library, or talk to any long time/old
         | school secretary or warden of archived paperwork, and I assure
         | you, they will be happy to extoll the virtues of simple or
         | reckonable information storage.
         | 
         | A hierarchical data store comes baked in with an opportunity of
         | implementing topical locality for the end user, which allows
         | you to utilize pathfinding logic baked into your brain to
         | navigate the corpus of information in question. Content
         | addressable stores, require praying that the layers of
         | cryptography work, or you have enough understanding of the
         | implementation details and tooling around the store to find
         | what you need.
         | 
         | In short, find | grep being strictly necessary, rather than a
         | fallback, means you've failed at organizing things so your user
         | can understand where the hell something even is, and why it is
         | there.
         | 
         | I assure you, more harm is done by forgetting the fundamental
         | human way of life that computing tries to plaster over, as we
         | inflict impedance mismatch on Users by forcing them to search
         | in a way that makes sense only to the machine, rather than to
         | them.
         | 
         | Sometimes a little less ideal computational performance pays
         | dividends in ease of picking up.
        
           | carapace wrote:
           | I'm old enough to be familiar with non-computerized filing
           | systems myself. But I don't think there's a close match
           | between computer files and directories and old-school
           | hardcopy filing systems.
           | 
           | But that's not really what I'm getting at. I'm more thinking
           | like POSIX API vs. Git plumbing API.
           | 
           | > A hierarchical data store comes baked in with an
           | opportunity of implementing topical locality for the end
           | user, which allows you to utilize pathfinding logic baked
           | into your brain to navigate the corpus of information in
           | question.
           | 
           | Most documents naturally fall into more than one hierarchy,
           | and some "flat" patterns as well (e.g. alphabetical by
           | author).
           | 
           | One of the downsides of computer FS is that they encourage a
           | single name-based hierarchy (although using symlinks or
           | hardlinks you can reference files from several directories.)
           | 
           | The hierarchy can and should be separate from the object
           | store. Then you can also use e.g. Jef Raskin's Zooming UI to
           | organize topical locality, in addition to more traditional
           | UIs like directory trees.
        
             | Joker_vD wrote:
             | Oh for the... we had non-hierarchical file systems already,
             | thank you very much. It's what OS/360 used. It's what Apple
             | Macintosh did (and yes, Macintosh Finder faked the
             | hierarchy on top of it, just as you propose). And they're
             | not gone, Amazon S3 essentially is one.
             | 
             | And I remember that I half-jokingly proposed in some other
             | discussion about file paths to either remove the filenames
             | entirely or at least lift the uniqueness restriction: after
             | all, if you have a GUI, files with the same name don't
             | cause that much of a trouble.
        
       | edflsafoiewq wrote:
       | By far the biggest problem with reading a file in C is
       | Microsoft's ill-conceived wide-char functions, _wfopen, etc. that
       | produced decades of "ensure the path has no unicode characters"
       | problems. Basically every C/C++ project has a wrapper to fix
       | this. The good news is the bad days may be over soon, thanks to
       | MS moving towards the Unix solution of using UTF8, as well as
       | modern languages like Rust moving this stuff into the stdlib
       | where you can't mess it up.
        
         | mananaysiempre wrote:
         | This problem is as Microsoft-specific as the one in the head
         | article is IBM-specific, C as a language has very little to do
         | with it (especially given that Microsoft very quickly pivoted
         | from supporting portable C to exclusively doing a Windows-
         | specific dialect and then a tacit deprecation in favour of
         | C++).
         | 
         | There are also limits to how far the UTF-8 illusion can go on
         | Windows: while on Unix and friends a path is fundamentally a
         | 0x00-terminated, 0x2F-separated sequence of 8-bit
         | quantities[1], on NT a path is fundamentally a(n unterminated)
         | 0x005C-separated sequence of 16-bit quantities, and Win32 puts
         | a varying number of layers of makeup[2] on that. Thus on Unix
         | you must be prepared to handle invalid UTF-8 in a filename, but
         | can expect to roundtrip any byte sequence (sans 0x00 and 0x2F),
         | and on UTF-8 Win32 you must be prepared to handle arbitrary
         | WTF-8[3] _and_ cannot expect to roundtrip any byte sequence
         | (isolated surrogates can merge, though I don't know if UTF-8
         | Win32 is willing to accept such invalid WTF-8).
         | 
         | Note that Rust does _not_ use the UTF-8 interfaces on Windows
         | (neither does it use the fundamental UNICODE_STRING APIs,
         | however).
         | 
         | [1]
         | https://yarchive.net/comp/linux/case_insensitive_filenames.h...
         | (of course, Linux filesystems have since developed case-
         | insensitive mount options)
         | 
         | [2] https://googleprojectzero.blogspot.com/2016/02/the-
         | definitiv...
         | 
         | [3] https://simonsapin.github.io/wtf-8/
        
           | nuc1e0n wrote:
           | AFAIK There are no UTF-8 specific interfaces on windows, only
           | the system codepage ( _A) and Wide character (_ W)
           | interfaces. Configuring the system codepage to be utf-8 is
           | possible, but doesn't solve all encoding problems in my
           | experience. To get the commandline without mangling you have
           | to use the wide character function.
           | 
           | Plus, you don't need to be prepared to handle invalid utf-8
           | in filenames on unix, the fopen call can just be made to fail
           | if needed.
        
             | mananaysiempre wrote:
             | Right, by "UTF-8 Win32" I mean the *A Win32 functions (as
             | used by the non-wide functions in the Microsoft C runtime)
             | when a UTF-8 code page is active. Rust uses the *W ones
             | instead.
             | 
             | As being prepared for invalid UTF-8 on Unix, well, it
             | depends. If you refuse to run with non-UTF-8 LC_CTYPE or to
             | accept invalid UTF-8 in user-provided file names, I suppose
             | that's on you. (Though I sure hope you are not writing an
             | implementation of rm or tar!) If you're trying to erase or
             | move everything in a directory, though, you'll have to
             | either deal with whatever's there or at least recognize
             | that the action may fail.
        
         | zokier wrote:
         | > The good news is the bad days may be over soon, thanks to MS
         | moving towards the Unix solution of using UTF8, as well as
         | modern languages like Rust moving this stuff into the stdlib
         | where you can't mess it up.
         | 
         | It's not all roses on unix or rust side either. In unix
         | filenames are _not_ utf-8, which leads rust having fun things
         | like OsString.
        
           | jcranmer wrote:
           | On Unix, filenames are _probably_ UTF-8. The kernel doesn 't
           | require them to be UTF-8, but if you're stuck trying to
           | display a filename, you have to figure out what charset the
           | filename is in, and UTF-8 is almost certainly the answer.
        
             | jefftk wrote:
             | APIs don't do well with "probably". For example, say you're
             | working with a language (Python, Rust, etc) that
             | distinguishes between utf-8 strings and byte strings.
             | You're making an API for listing a directory: it gives you
             | an array of strings but what kind should they be?
             | 
             | (Python's approach is that os.listdir("/") gives you a
             | List[str] (silently omitting undecodable entries), while
             | os.listdir(b"/") gives you a List[bytes]. That is, if you
             | give the path as a utf-8 string it returns utf-8 strings,
             | otherwise it returns bytes.)
        
               | tialaramex wrote:
               | If it's an API for "listing a directory" the things in it
               | aren't strings they're paths, and Rust indeed gives you
               | Paths here (actually PathBufs in case you want to do
               | stuff to them)
               | 
               | Paths _might_ just be strings, but they aren 't
               | necessarily, and since Rust actually cares about types if
               | you want strings you need to write the code to decide
               | what to do about this, even if your "It's not UTF-8" case
               | is just "Give up I can't be bothered".
        
               | jefftk wrote:
               | Yes, Rust in keeping with its aesthetic resolves the
               | "probably" in the most pessimistic direction, and
               | requires you to explicitly say how you want to handle the
               | potential for non-utf8.
        
               | nerdponx wrote:
               | Nitpick: Python strings are not "UTF-8"; they are
               | abstract sequences of Unicode codepoints (and internally
               | CPython stores UTF-32). However, UTF-8 _is_ the default
               | encoding for processing raw bytes received from the
               | outside world and turning them into strings.
               | 
               | That said, I actually didn't realize that os.listdir()
               | silently omits un-decodable entries, which I find mildly
               | alarming. This behavior isn't mentioned in the docs
               | (https://docs.python.org/3/library/os.html#os.listdir)
               | and seems out-of-character for Python, which usually
               | raises an exception by default if data cannot be decoded
               | to text.
               | 
               | Are you sure that this is actually what happens with non-
               | decodable filenames? Reading here
               | (https://docs.python.org/3/glossary.html#term-filesystem-
               | enco...) and here (https://docs.python.org/3/c-api/init_c
               | onfig.html#c.PyConfig....), it suggests that encoding
               | errors should be handled by surrogate escapes by default
               | on non-Windows systems.
        
               | fabioz wrote:
               | Just as a note, the way that Python stores the items
               | isn't always 4 bytes per char, it depends on the actual
               | string contents.
               | 
               | I think that https://rushter.com/blog/python-strings-and-
               | memory/ is a nice reference on that.
        
               | jefftk wrote:
               | I'm wrong: https://news.ycombinator.com/item?id=32471944
               | 
               | If it ever did that, it doesn't anymore.
        
               | jefftk wrote:
               | Found some history:
               | https://vstinner.github.io/python30-listdir-undecodable-
               | file...
               | 
               | Includes "Modify os.listdir(str) to ignore silently
               | undecodable filenames, instead of returning them as
               | bytes", but not later work where apparently this was
               | changed to use surrogates.
        
               | jcranmer wrote:
               | Honestly? Convince operating systems to have a switch
               | that enforces UTF-8 path names, and then convince distros
               | to flip that switch by default for new installs. That is
               | to say, we need to move the world from a _probably_ to a
               | _definitely_ state.
               | 
               | The reality is that file names are "stringy" in nature--
               | people expect to be able to do display them--and that
               | means you need to have some up-front agreement on how to
               | interpret those strings. In practice, on Unix systems,
               | everyone has generally agreed that this is UTF-8, to the
               | point that trying to not be UTF-8 generally causes
               | interesting breaks in the system. It would be great if we
               | could actually get the operating system to help enforce
               | these rules, rather than placing the blame on other
               | software for not correctly handling situations where the
               | correct solution is itself incredibly ambiguous.
        
               | jefftk wrote:
               | On a system with that switch flipped, what should happen
               | if you plug in a USB drive or untar an archive that has
               | non-utf8 filenames?
        
               | deathanatos wrote:
               | For a legacy/broken FS that permits non-utf8 filenames,
               | and has them: have the FS driver map them into UTF-8 as
               | best it can by making some sort of compromise. E.g., use
               | the PUA to map malformed sequences in/out.
               | 
               | For untar'ing a tar archive: error out by default, but
               | provide a flag or option to permit untarring using some
               | sort of escaping to map the malformed names back into
               | Unicode. I think here I'd map to something printable,
               | though, like "\xnn" or something.
        
               | Joker_vD wrote:
               | While I am all for an OS monoculture, it's still not
               | there yet... and besides, _my_ idea of what OS precisely
               | must be the sole survivor is different from yours, and
               | yours is different from somebody else 's, etc.
               | 
               | So I am afraid we can either a) indulge ourselves in
               | wishful thinking, b) actively try to extinguish platforms
               | that don't match our ideals, c) make an effort to be
               | actually cross-platform, and not in "let's just build a
               | tiny Linux model in a bottle for us to use and pretend
               | the rest of the environment is not there" kind.
        
               | tpolzer wrote:
               | You can use ZFS as a root file system, and it actually
               | has such a switch (called "utf8only").
        
               | naniwaduni wrote:
               | They should be byte (or 16-bit code unit, or whatever)
               | strings. There is no ambiguity here, only incompatibility
               | and delusion.
               | 
               | Rust gets this almost right. Python gets this very wrong.
        
               | jefftk wrote:
               | _> Rust gets this almost right. Python gets this very
               | wrong._
               | 
               | Having worked in both, I'd say they both chose ideomatic
               | solutions:
               | 
               | Rust: I can't prove this is utf-8, so if you want to use
               | it as utf-8 you'll need to tell me what to do if it
               | isn't.
               | 
               | Python: if you're in the common situation where
               | everything is utf-8 and you want to just work with
               | strings, go do the simple thing. Or you can be explicit
               | about wanting to work with bytes, and that's good too.
               | 
               | (Though I think Python should throw an error instead of
               | silently omitting non-utf8 files.)
        
               | nerdponx wrote:
               | I just did a bit of research into this here
               | https://news.ycombinator.com/item?id=32472087
               | 
               | You _can_ actually reconfigure Python to throw an error
               | instead of using a surrogate escape, but only (I think)
               | changing something at compile time: https://docs.python.o
               | rg/3/library/sys.html#sys.getfilesystem...
        
               | jefftk wrote:
               | Actually, I think I'm wrong about python's behavior, at
               | least now; it doesn't omit the files, and instead does
               | something with surrogates. On Linux:
               | >>> b'\xc0'.decode('utf-8')         Traceback (most
               | recent call last):           File "<stdin>", line 1, in
               | <module>         UnicodeDecodeError: 'utf-8' codec can't
               | decode byte 0xc0 in position 0: invalid start byte
               | >>> open(b'\xc0', 'w').write('foo')         3         >>>
               | os.listdir()         ['\udcc0']         >>>
               | open(b'\xc0').read()         'foo'         >>>
               | open('\udcc0').read()         'foo'
        
               | naniwaduni wrote:
               | PEP 383[1] is a workaround, but it took a while to get
               | there[2].
               | 
               | [1]: https://peps.python.org/pep-0383/ [2]: cf.
               | https://github.com/bup/bup/blob/master/DESIGN#L667-L729
        
       | atoav wrote:
       | As someone who uses mainly python, rust and js, as well as C on
       | embedded. I recently came around reading _The C Programming
       | Language_ (2nd edition) and was surprised how _many_ of the
       | language decisions I found extremely horrible. I mean all the
       | examples in the first chapter teach you how to do stuff with
       | strings that will fail the moment you throw Unicode into the mix,
       | and _no_ programmer should use ASCII for any but the lowest level
       | stuff today.
       | 
       | I see this as a relic from older, nobler times and the language
       | is interesting to learn about especially since it is the base of
       | a lot of things, but if C was a sport it would be free climbing,
       | or maybe something even more dangerous that requires a lot of
       | skill that I can't think of now.
       | 
       | In Rust there are many string types (e.g. OsString, CString,
       | String, PathBuf, ..) because the truth is that _you_ need to know
       | the rules that the string your program reads or creates must
       | adhere to, if there is no type system that will enforce those
       | rules to you. The way a properly written program has to deal with
       | strings in the different parts of systems could be explored in an
       | entire programming career.
       | 
       | Similarily Rust tends to make you handle all the errors that
       | could occure with file I/O. This can feel complicated, but it
       | could also serve as a reminder on how many potential errors we
       | don't handle in other programming languages (or at least as a big
       | questionmark how these other languages handle or don't handle
       | these errors). Surely you _could_ also ignore those error cases
       | in Rust and just have your program crash, but then it was _your_
       | active decision and not something that hit you out of nowhere
       | like a bag of bricks, with the only realistic option of ignoring
       | it and hoping it will not happen again.
       | 
       | If anything something like Rust gave me a much better
       | understanding why actual good C programs are akin to art.
       | Freeclimbing in a minefield and all that.
        
         | ranger207 wrote:
         | I learned C in a college class where we built a simulated
         | computer from transistors up through assemble before moving to
         | C. From that perspective the K&R C book is fantastically
         | elegant: you can really see why C is sometimes described as a
         | "portable assembly" because it maps closely to assembly
         | instructions and conventions. As a first language above
         | assembly, it's a fantastic language for doing work on limited
         | systems. As a modern application language in the current world
         | of high level abstractions like Unicode and the Internet, it's
         | simply too simple. It was designed for and works relatively
         | well for systems thst you understand completely all the way
         | down to the metal
        
         | anonymoushn wrote:
         | The first program that may have some issue with UTF-8 seems to
         | be on page 18. The trouble with writing a UTF-8 aware
         | "character counting" program is that the definition of
         | "character" is pretty complex. A "correct" program would not
         | fit on one page, and would need to be updated as more emoji
         | ligatures are added to the standard. It would perhaps be good
         | to clarify that "character" means "byte" in this program.
         | 
         | The line counting program on page 19 is correct for UTF-8
         | inputs. The word counting program on page 20 works as specified
         | (the specification says it only uses 3 specific delimiters) for
         | UTF-8 inputs. The digit counting program is correct for UTF-8
         | inputs. The "longest input line" program doesn't really specify
         | what "longest" means, but it finds the one with the most bytes.
         | 
         | There are maybe 2 examples that don't work on UTF-8, if the
         | standard is that "longest line" and "character counting"
         | programs should detect that
         | ":regional_indicator_s::regional_indicator_u:" is 2 characters
         | while ":regional_indicator_u::regional_indicator_s:" is 1
         | character. Such programs may not make a very good introduction
         | to programming though.
        
         | lelanthran wrote:
         | > I recently came around reading The C Programming Language
         | (2nd edition) and was surprised how many of the language
         | decisions I found extremely horrible. I mean all the examples
         | in the first chapter teach you how to do stuff with strings
         | that will fail the moment you throw Unicode into the mix
         | 
         | Define "fail" and define "Unicode".
         | 
         | Does "fail" mean _" iterates through bytes and not
         | characters"_? Does "fail" mean _" can't recognise different
         | encodings of the same 'character'"_?
         | 
         | Does "Unicode" mean UCS2, UTF-16, UTF-32 or UTF-8?
         | 
         | Because, to be honest, quite a log of 'unicode-aware' languages
         | will "faiol" the same way, and they don't have the same excuse
         | as 'strlen()' does, namely being 35 plus years old.
         | 
         | I think the fact that Unicode handling can still be such a mess
         | in applications written on platforms and languages that came
         | decades after C tells me that this is not an easy problem.
         | 
         | C handles UTF-8 byte sequences with the current string
         | functions just fine. You're going to have to manage the mapping
         | between byte sequences and glyphs being displayed to the user,
         | which you're going to have to manage anyway because "Unicode"
         | is so ambiguous. Whatever support a modern language has for
         | Unicode doesn't help all that much when the user-facing glyph
         | isn't part of the language.
         | 
         | What your language thinks is a character and what the end-user
         | thinks is a character are two different things. C is not very
         | different in this regard.
        
         | rmind wrote:
         | It is very easy to say, retrospectively, in year 2022, that the
         | decisions were horrible. You are talking about the language and
         | decisions made in 1970s. The knowledge was different. The
         | computers were different, their capability was different. Try
         | running your Python on PDP-7, an 18-bit system! ASCII vs EBCDIC
         | is a computer architecture issue and it's unfair to blame the
         | language that it doesn't have automatic/transparent support for
         | EBCDIC (stuff from 1960s, by the way!). Unicode simply didn't
         | exist at that time. And so on and so forth.
         | 
         | On the contrary, I would say C aged really well for a language
         | which was created to support an entire zoo of computers and
         | operating systems. It is worth pointing out that the language
         | has progressed a lot since then and you don't have to deal with
         | many old headaches if you write C on a _modern_ CPU
         | architecture.
        
           | pjmlp wrote:
           | On the surface that explanation might make sense, then we
           | start diving into computer archeology and discovering what
           | was being done outside Bell Labs with NEWP, JOVIAL, ALGOL
           | variants, PL/I variants, BLISS, Mesa, Modula-2, PL.8, Lisp,
           | Fortran,....
           | 
           | Naturally it tends to be forgotten, as most UNIX folks set
           | the genesis of computing world in Bell Labs.
        
           | atoav wrote:
           | I do not disagree with any point you made. Not at all.
           | 
           | I just think an introduction to the language in the year 2022
           | should at least aknowledge that the form of string handling
           | shown in the first part of the book should not be imitated. I
           | can see how those examples would make perfect sense in a
           | different age. Maybe I can give the book the benefit of the
           | doubt as it was published in 2012.
           | 
           | Do you have any book recommendations for a more modern C
           | approach?
        
             | Calavar wrote:
             | The 2nd Edition was published in 1988. I would guess that
             | this this 2012 version just adds an extra foreword and some
             | errata?
        
               | atoav wrote:
               | Ah that explains a lot thanks
        
             | SAI_Peregrinus wrote:
             | "Modern C" by Jens Gustedt is an excellent book on a more
             | modern approach to C.
        
           | formerly_proven wrote:
           | > On the contrary, I would say C aged really well for a
           | language which was created to support an entire zoo of
           | computers and operating systems.
           | 
           | This is the case only because the standardized C was more-or-
           | less created as a superset of the many, many C variations
           | that have sprung up until that point. It's also the reason
           | why C leaves so many things up to the implementation or
           | entirely undefined.
           | 
           | Ultimately, this made C a highly portable language, while
           | writing conformant and portable C programs is very difficult.
        
         | adhesive_wombat wrote:
         | > C was a sport it would be free climbing
         | 
         | I think maybe motorcycle racing. Fast, close to the ground, and
         | one seemingly-trivial mistake away from a gruesome result. But
         | also a rush when it goes well, responsive to riders knowing
         | their machines and the terrain inside and out and eligible for
         | lucrative sponsorships.
        
           | jll29 wrote:
           | ...bungee jumping - where you have to make your own string
        
             | adhesive_wombat wrote:
             | I'd say that's more true if you've been handed a strange
             | new micro with a unique architecture and an untried
             | toolchain.
             | 
             | Make your rope as best you can but once you jump it's up to
             | luck and whether the gods are feeling beneficent if you
             | survive, and if you do, if you still have your limbs and
             | retinas attached.
        
         | icedchai wrote:
         | That book is well over 30 years old. Unicode was in the
         | planning stages, but definitely not a thing yet. The decisions
         | were "fine" for the time. This makes me feel even older, since
         | I taught myself C with that book (and another, platform-
         | specific, Amiga book) back when I was a teenager in the late
         | 80's.
        
         | nuc1e0n wrote:
         | By your logic, should z/OS not be used then because it is old
         | and has quirks? EBCDIC is _bizarre_ by modern standards. Non
         | contiguous alphabet for instance.
        
           | Joker_vD wrote:
           | Both Cyrillic and Greek are non-contigious in Unicode, does
           | it make Unicode bizarre by modern standards as well?
        
         | SimplyUnknown wrote:
         | The first edition of the C programming language was released in
         | 1978, the second version in 1988. The first time something on
         | Unicode was mentioned was also 1988, and the consortium was
         | founded in 1991. UTF-8 was proposed in 1992.
         | 
         | Simply put, the book doesn't deal with alternatives to ASCII
         | because there hardly were alternative text encodings at the
         | time of writing in the western world, which is clearly the
         | focus of an English book written by a Canadian and American.
         | 
         | Moreover, the point of the book was to propose the C
         | programming language and showcase how the language works. It's
         | not a book on best practices or how to use C in the real world
         | today. There are other resources for that.
        
       | aaaaaaaaaaab wrote:
       | Ew... I guess this is what 60 years of backwards compatibility
       | looks like?
        
         | Joker_vD wrote:
         | This is what trying to heave 50 years-old, backward compatible
         | API on top of a completely differently designed, 60 years-old,
         | backward compatible API looks like.
        
           | lmz wrote:
           | Another example of putting C on top of a non-UNIX base would
           | probably be VMS e.g. all the file options here (found using a
           | search) http://odl.sysworks.biz/disk$axpdocdec971/progtool/de
           | ccv56/5...
        
             | mek6800d2 wrote:
             | But it works well! Those file options are syntactically
             | optional. I worked on VAX/VMS with Fortran for 5 years and
             | then helped develop generic spacecraft control-center
             | software for NASA in C under Unix. In 1992-1993, we ported
             | the system to VAX/VMS for a European Space Agency project.
             | It went very smoothly and quickly, thanks to DEC's largely
             | complete implementation of the C library (including BSD
             | networking), leaving us with plenty of time to develop the
             | project-specific software. I ported over Sun's rpcgen and
             | GNU's flex and cccp; plus bison or an actual yacc. As you
             | showed, some of the C calls have optional parameters; e.g.,
             | strerror() could take an additional argument, a VMS status
             | code, that would provide a more specific error message than
             | just the ERRNO-based messages. However, in almost all
             | cases, the normal Unix call signatures worked as expected.
             | (I did have to come up with a work-around for our one case
             | of a network server using fork().) Thumbs up on VMS and C!
        
           | aaaaaaaaaaab wrote:
           | This sentence says it all:
           | 
           | >"Normal files" have data in EBCDIC
        
       | AlexanderDhoore wrote:
       | C has nothing to do with this. You'll have the same problems
       | reading from Java.
        
       | [deleted]
        
       | bregma wrote:
       | The one single _standard_ way to read a file in C. There are
       | (potentially infinite) many _non-standard_ ways using third-party
       | vendored libraries or vendor-specific extensions to the standard
       | C library. This article discusses a small subset of the latter,
       | specific to a single vendor.
        
         | anonymoushn wrote:
         | What is the standard way?
        
           | [deleted]
        
           | [deleted]
        
           | [deleted]
        
           | oogali wrote:
           | I'd wager they are referring to the function named "read".
        
             | anonymoushn wrote:
             | That doesn't seem to be part of the C standard library.
        
               | [deleted]
        
           | phao wrote:
           | Using FILE*, fread, fgets, fscanf, ...
        
             | anonymoushn wrote:
             | That is in fact what he does in the article!
        
               | phao wrote:
               | Right, but is he relying on the standard (the C language
               | specification is an ISO standard, btw) behavior
               | guaranteed by the language? Or is it talking about some
               | implementation specific behavior that also happens to
               | fall into the name of fopen, fread, etc?
               | 
               | For C programmers, talking about "standard" implies a
               | quite particular meaning.
        
               | anonymoushn wrote:
               | I did not purchase a C standard from ISO, but a draft
               | specifies that text streams and binary streams are both
               | supported, and that text streams may perform all sorts of
               | implementation-defined destruction on your data. Some
               | small part of the article seems to be related to this.
        
               | phao wrote:
               | About drafts...
               | 
               | Iirc, the last draft before final publication is free and
               | it's just as good.
               | 
               | As another user replied, there is more going on in the
               | post than what is specified, guaranteed, etc, by the
               | standard.
               | 
               | The practice of C programming in an actual system using
               | non-standard things is important. Also, the C language
               | does have its problems, even within the standard.
               | However, pinning to C problems of a not-so-helpful
               | implementation, library, system, etc, is unfair and
               | unhelpful I believe.
        
               | bregma wrote:
               | The streams discussed in the article are neither text nor
               | binary. They're record-oriented files, which are not
               | supported by the C language standard. Operations on
               | record-oriented files are a vendor extension that work
               | however the vendor says they work.
               | 
               | On the other hand, record-oriented files work just peachy
               | with ISO standard COBOL.
        
       | h2odragon wrote:
       | Alternate title: "I've chose to use a C stdlib which sucks on
       | this OS"
       | 
       | I dunno Z/OS, perhaps everything sucks that bad there. I strongly
       | suspect that there's alternate interfaces available that hide
       | this complexity from those afraid to trip over it.
        
         | Joker_vD wrote:
         | I'd say it's just vastly different, and some rather basic
         | assumptions C makes about the environment (basically that it's
         | sorta kinda UNIX-y if you squint hard enough) simply don't
         | hold.
         | 
         | You can still see the UNIX-centric point of view in stdlibs of
         | other languages: I am particularly amused by Golang's "os"
         | package. It's kinda-sorta supposed to be portable and OS- and
         | platform-independent, but it's designed for POSIX-likes first
         | which is why one has to pass 0666 or whatever as permissions
         | when trying to open a file on Windows (even though it is
         | completely ignored).
        
           | salawat wrote:
           | >it's designed for POSIX-likes first which is why one has to
           | pass 0666 or whatever as permissions when trying to open a
           | file on Windows
           | 
           | <Snicker>
           | 
           | Well. That's amusing. So everytime one does file access
           | through Go on a windows machine, one invokes the number of
           | the Beast, eh?
           | 
           | Apropos af if true. Also hilarious if it just by chance
           | worked out that way.
        
             | Joker_vD wrote:
             | Well, as I said, you can pass pretty much whatever, none of
             | those bits (except the owner-writeable bit) do anything
             | because Windows uses a fundamentally different model of
             | access control. And no, Golang's os.Create() passes 0666 to
             | os.OpenFile [0], just as fopen(3) passes 0666 to creat(2)
             | [1] _on Linux_ as well.
             | 
             | [0] https://cs.opensource.google/go/go/+/refs/tags/go1.19:s
             | rc/os...
             | 
             | [1] https://linux.die.net/man/3/fopen
        
       | ale42 wrote:
       | There's no intro about it, but this looks related to IBM
       | mainframes...
        
         | jgtrosh wrote:
         | Yeah, it's tagged under "Z/OS". Incomprehensible without
         | context.
        
       | nuc1e0n wrote:
       | So what is the best practice for reading files from z/OS and
       | OMVS? Should you: Look to see whether the file is binary or text,
       | if it's text what encoding is it in and what is the record length
       | (if there is one). Then open the file as binary and write
       | wrappers for fread and fgets to do the necessary conversions
       | yourself to utf-8 and unix newlines?
       | 
       | BTW, fixed length records? Way old school and yuck. Just use an
       | index into the file with lengths already. What about all the
       | wasted space from records that aren't full?
        
         | Joker_vD wrote:
         | Well, what about all the wasted space from disk clusters that
         | aren't full? IIRC, z/OS actually packs files densely, unlike
         | cluster-based FSs.
        
       | planede wrote:
       | This is the first time I saw "keyword arguments" in fopen's mode
       | argument. I'm not a fan.
       | 
       | https://www.ibm.com/docs/en/zos/2.4.0?topic=functions-fopen-...
        
         | mananaysiempre wrote:
         | Glibc uses those as well[1], ccs=ENCODING opens a file in
         | ENCODING for use with wchar_t functions ("CCS" for "coded
         | character set" being prehistoric standardese for what I
         | colloquially called an encoding here).
         | 
         | I'm not a fan, either, FWIW, but those are the most frequent
         | answer I've seen as to why fopen() accepts a string and not a
         | set of flags like open(). Might be a post hoc rationalization,
         | though,--I'm willing to believe it was originally just a hack
         | for conciseness, for example.
         | 
         | [1]
         | https://www.gnu.org/software/libc/manual/html_node/Opening-S...
        
           | [deleted]
        
       | nemetroid wrote:
       | This seems to be about the z/OS C API.
        
       | nuc1e0n wrote:
       | Man, IBM mainframe stuff is pretty messed up. Attribute splitting
       | much?
        
       ___________________________________________________________________
       (page generated 2022-08-15 23:03 UTC)