[HN Gopher] Show HN: A nice C string API
       ___________________________________________________________________
        
       Show HN: A nice C string API
        
       A convenient C string API, friendly alongside classic C strings.
        
       Author : mickjc750
       Score  : 88 points
       Date   : 2022-12-03 12:31 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | the-printer wrote:
       | Are the any good resources that explain the concept of strings in
       | C, particularly why they're considered to be so difficult to
       | manage? I'm interested in the language, and that along with its
       | safety concerns seem to be the two most frequent complaints
       | against it that I read about online.
        
         | Quentak wrote:
         | I have written a short article explaining why null terminated
         | strings as they exist in C cannot represent proper ASCII and
         | UTF-8 because of the null terminator. It's not a full
         | explanation of how strings work but it might be helpful for
         | you.
         | 
         | https://kttnr.net/blog/null-terminated-strings-are-incorrect...
        
           | lifthrasiir wrote:
           | SQLite does store a null character in strings, it has lots of
           | _documented_ [1] issues in the API level though.
           | 
           | [1] https://www.sqlite.org/nulinstr.html
        
             | Quentak wrote:
             | Thanks for this link.
             | 
             | How do you get the null byte into the string? Is it through
             | casting blob to string? The way I have encountered this is
             | when using the C API in which string arguments for prepared
             | statements are passed as char pointers. If those contain
             | the null byte then the string is cut off.
             | 
             | Allowing null characters and then mishandling them is worse
             | than not allowing them.
        
         | e-dant wrote:
         | C strings are pointers to memory. There are semantics and
         | assumptions encouraging null-character delimited strings, but
         | not every API follows those rules (just got done working with a
         | Windows API that doesn't).
         | 
         | Often, you have to both null-delimit your string and store its
         | length somewhere. That's the dangerous part. Messing either of
         | those up, or passing your string to an API that messes that up,
         | is not safe.
         | 
         | C strings are pointers to memory, either the stack or the heap,
         | and follow exactly the same rules as everything else in that
         | chaotic space: Not many.
        
           | the-printer wrote:
           | Thank you for this. C programming sounds almost like some
           | sort of combat sport. Riveting.
        
             | lelanthran wrote:
             | > Thank you for this. C programming sounds almost like some
             | sort of combat sport. Riveting.
             | 
             | I've done it for decades; it isn't really as bad as hype-
             | attracting headlines would have you believe.
             | 
             | Munitions control, aircraft management systems, industrial
             | automation systems, and many more life-critical systems
             | were programmed in C for decades with comparatively little
             | danger from the language intrinsics leading to death.
             | 
             | It's easy to look at the stats and say "there's a few dozen
             | CVEs annually due to C footguns", but that's a few dozen
             | out of hundreds of millions of deployed systems that are
             | written in C.
             | 
             | In practice, very few lines of C code bypass the type
             | system, so you get much fewer bugs than an equivalent
             | system in the more usual dynamic programming languages
             | (Python, Javascript, etc).
        
               | thesnide wrote:
               | Wondering if the big influx of C derived CVE are old or
               | new code. If it is new code, I'm also wondering about the
               | brain damage that those safe languages causes.
               | 
               | Yes, it is better to have memory safe languages. But it
               | encourages sloppiness as "nothing can happen". Then those
               | folks aren't fit to write anything else. Which closes the
               | feedback loop on inefficient but safe languages.
               | 
               | Which becomes the same thing in airplanes. Pilots don't
               | really know how to fly without instruments anymore.
        
               | wadd1e wrote:
               | >Which becomes the same thing in airplanes. Pilots don't
               | really know how to fly without instruments anymore.
               | 
               | Well that's just a blatantly wrong generalisation you
               | made there, curious as to where you got that from.
               | Consider looking up how pilot training is done before
               | making such assumptions. Even though modern airplanes
               | make heavy use of technology, there are emergency
               | scenarios where lots of instruments may not work, and
               | pilots receive more than enough training to fly an
               | airplane in that scenario just to give one example among
               | tons of others.
               | 
               | edit: grammar
        
             | marssaxman wrote:
             | More like fire-performance: it _looks_ dangerous, and it
             | does require some finesse, but it 's really satisfying when
             | you get in the flow, and burns are both less frequent and
             | less serious than you might imagine as an onlooker.
        
             | pjmlp wrote:
             | It is more like a combat sport, doing martial art moves,
             | while trying to juggle knives between moves.
        
         | jstimpfle wrote:
         | "Strings" are quite an abstract concept. They are a linear
         | sequence of characters. But there are a number of ways to
         | represent them - the simplest of which is a contiguous memory
         | allocation, but depending on the use case you'd need more
         | complex schemes. There are also different ways to do the
         | necessary memory management (e.g. allocate statically at
         | compile time vs dynamically at run time).
         | 
         | One of the most complex representations is probably the string
         | rope datastructure - a balanced tree of string chunks,
         | supporting efficient insertion and removal anywhere in the
         | string.
         | 
         | Specific to C, as well as lots of low-level APIs, is only that
         | strings are often expected to be contiguously laid out in
         | memory and terminated with a NUL (0) byte. So you need to make
         | sure that you always terminate with a NUL after writing to
         | string storage.
         | 
         | Other than that, strings aren't any harder than other aspects
         | of programming with manually managed memory.
         | 
         | Maybe motivated from higher-level dynamic or managed languages,
         | is the popular idea that strings should always be allocated
         | dynamically (like std::string for example), and support
         | operations like string-append with automatic reallocation if
         | the currently allocated memory isn't enough to store the new
         | string.
         | 
         | In practice, that's not true at all - unless you are in a
         | domain where lots of small intermediate strings are generated.
         | This is pretty inefficient anyway and there is likely no point
         | to use C in this case.
         | 
         | By far most strings in most domains are either completely
         | static (use string literals), or are created once in a sequence
         | of append operations and then never changed again. I get by,
         | doing many different things from GUI apps to networking to
         | parsers and interpreters, without any sophisticated string
         | type. All I do is define some printf-like APIs to do logging,
         | for example. Those typically just use a fixed size buffer for
         | the formatting, and then flush that buffer to e.g. stderr. _or_
         | flush it to a dynamically allocated memory buffer, but there
         | almost never is a need to reallocate that string later.
        
         | WalterBright wrote:
         | > why they're considered to be so difficult to manage?
         | 
         | Back in the 90s, I was very experienced with C strings and
         | managing them. Then I chanced to look at BASIC again, and
         | realized that strings in BASIC were so simple and intuitive.
         | Why couldn't C be like that? When I started on the design of D,
         | I decided that it had to make strings as easy to do as BASIC
         | did.
         | 
         | And D does.
         | 
         | The trouble with C strings is the 0 termination of them. This
         | means:
         | 
         | 1. to get the length of the string, you have to scan it. This
         | is expensive.
         | 
         | 2. when manipulating strings, a common error is to get off by
         | one in the storage because of the 0 termination
         | 
         | 3. you cannot get a subset of the string without making a copy.
         | Not only is a copy expensive, but then you have to keep track
         | of the memory for it
         | 
         | 4. there's no way to check for buffer overflows
         | 
         | D's design, which uses a phat pointer (length, ptr) for
         | strings, solves these problems.
        
           | tored wrote:
           | A year ago I picked up the BASIC dialect PureBasic. Pleasant
           | surprise actually, the syntax of the PureBasic dialect is a
           | bit archaic, but if you accept that it is much easier and
           | faster to get anything done compared to C (and C++).
           | Personally I find low level topics easier to grok in
           | PureBasic than in C even though they mirror the same
           | concepts. PureBasic has Unicode strings built in.
           | 
           | It is a bit shame that BASIC has such a bad reputation, there
           | are many BASIC dialects that does the job well still today.
           | 
           | https://www.purebasic.com/documentation/reference/ug_string..
           | ..
        
         | Someone wrote:
         | The problem is that C doesn't have strings; it has functions
         | that treat sequences of non-zero bytes followed by a zero bytes
         | as if they are strings.
         | 
         | So, you can't ask it to create a string that contains the
         | result of appending a string to another one. If you want to
         | append two 'strings', you have to create a buffer large enough
         | to hold the result, and then copy in the two sequences of
         | bytes. And even for doing that, the library functions aren't
         | optimal. The basic "append this string's data to that string,
         | assuming there's enough space to do so" function is _strcat_.
         | It walks the first string to find the zero byte, but to "create
         | a buffer large enough to hold the result" you already must do
         | that.
         | 
         | See for example
         | https://stackoverflow.com/questions/21880730/c-what-is-the-b...
        
           | jstimpfle wrote:
           | You can use snprintf to easily achieve any concatentation
           | you'd like.                   len = snprintf(buffer,
           | buffersize, "%s%s%d", string_1, string_2, int_1);
           | if (len + 1 /*NUL*/ > buffersize)         {             //
           | not enough space         }
           | 
           | You can also use this to dynamically allocate any formatted
           | string                   len = snprintf(NULL, 0, ....);
           | buffersize = len + 1;  /* NUL */                  buffer =
           | allocate(buffersize);              snprintf(buffer,
           | buffersize, ... /*same args as before*/);
        
             | tom_ wrote:
             | I like the printf family too. Any time you're doing a bunch
             | of strcat or whatever it's almost always massively easier
             | to use a format string to get the same result. Very easy to
             | get the desired width/precision/alignment, and if you need
             | numbers, printf has your back. It even does the bounds
             | checking for you! (And how often do you get _that_ in C.)
             | 
             | It won't be as fast, but it's almost always not a problem,
             | and the nice thing about C and C++ is that the char-by-char
             | route is still available when it is.
             | 
             | I like to use asprintf, when available:
             | https://man7.org/linux/man-pages/man3/asprintf.3.html - and
             | when not available, I add it, along the lines of the
             | snippet you present.
             | 
             | Here's something I've found a useful upgrade to asprintf,
             | as it frees the passed-in buffer after expanding the format
             | string. You can just pass the same char ** repeatedly and
             | it'll update the char * appropriately each time.
             | int xasprintf(char**p,const char *fmt,..) {             int
             | n=0;             char *p2=nullptr;             if(fmt) {
             | va_list v;                 va_start(v);
             | n=asprintf(&p2,fmt,v);                 va_end(v);
             | }             if(n>=0) {                 free(*p);
             | *p=p2;             }             return n;         }
        
       | cassepipe wrote:
       | I always use antirez's (Redis creator) `sds` and advertise it
       | whenvever I get the chance. Thanks to the someone who recommended
       | it on HN some years ago. It's a joy to use.
       | 
       | https://github.com/antirez/sds
       | 
       | The trick is the size is hidden before the adress of the
       | buffer.("Learn this one simple trick that will change your life
       | for ever").
       | 
       | From the Readme:
       | 
       | ```
       | 
       | Advantage #1: you can pass SDS strings to functions designed for
       | C functions without accessing a struct member or calling a
       | function
       | 
       | Advantage #2: accessing individual chars is straightforward.
       | 
       | Advantage #3: single allocation has better cache locality.
       | Usually when you access a string created by a string library
       | using a structure, you have two different allocations for the
       | structure representing the string, and the actual buffer holding
       | the string. Over the time the buffer is reallocated, and it is
       | likely that it ends in a totally different part of memory
       | compared to the structure itself. Since modern programs
       | performances are often dominated by cache misses, SDS may perform
       | better in many workloads.
       | 
       | ```
        
         | schemescape wrote:
         | That sounds like the same allocation scheme as used in
         | Microsoft's BSTR type: https://learn.microsoft.com/en-
         | us/previous-versions/windows/...
        
         | kazinator wrote:
         | Thus, sds it cannot be used for the use cases that this library
         | allows.
         | 
         | This library takes string slices without having to allocate or
         | copy memory; it seems to be for use cases involving breaking
         | down strings in complex ways, where good ergonomics and
         | efficiency of obtaining a null-terminated C string are
         | secondary.
        
         | WalterBright wrote:
         | > The trick is the size is hidden before the adress of the
         | buffer.("Learn this one simple trick that will change your life
         | for ever").
         | 
         | The length-prefix string has a major problem - it cannot be
         | sliced to produce another length-prefix string. It has to be
         | copied. Instead, using a phat pointer (size_t length, char*
         | ptr) works very, very well. We've been using it in D for 20
         | years.
         | 
         | I've proposed it for C, too:
         | 
         | https://www.digitalmars.com/articles/C-biggest-mistake.html
        
           | torstenvl wrote:
           | > _it cannot be sliced to produce another length-prefix
           | string_
           | 
           | Come again? Of course it can. It can't be done _in place,_
           | mind you, but that 's a pretty bad way to do any string
           | slicing, regardless of implementation, in a manual memory
           | management environment. Do most programmers expect their
           | slices to result in undefined behavior if they release the
           | larger string they were made from? I doubt it.
        
             | alcover wrote:
             | > Come again? Of course it can.
             | 
             | Oh come on.. I'm pretty sure Walter meant taking a _view_
             | kind of slice. Obviously one can always copy part of a
             | string, but that 's not what _slice_ implies I think.
             | 
             | > It can't be done in place, mind you, but that's a pretty
             | bad way to do any string slicing, regardless of
             | implementation, in a manual memory management environment.
             | 
             | It's not bad. It's the best, most efficient way. O(1)-ish.
             | 
             | > Do most programmers expect their slices to result in
             | undefined behavior if they release the larger string they
             | were made from? I doubt it.
             | 
             | That's what copy-on-write is for : release of the parent is
             | blocked until no views are left on it.
             | 
             | I made a C String lib using CoW. It works well:
             | 
             | https://github.com/alcover/buffet
        
           | quelsolaar wrote:
           | The problem with arrays is not that they decay to pointers,
           | its that they arent pointers to begin with. This:
           | 
           | int x[10];
           | 
           | Sould mean "put 10 ints in memory, and make x the pointer to
           | it.". The thing that messes this up is sizeof. sizeof(x)
           | doesnt give the size of the pointer like it should, it gives
           | you the size of the array. If that was fixed (obviusly it can
           | without breaking everything) then things would be much better
           | and consistent.
        
             | WastingMyTime89 wrote:
             | Yes, _sizeof_ has a primitive as a weird behaviour when
             | used in the scope of a continuous allocation. I agree it's
             | unfortunate.
             | 
             | But "arrays" definitely are pointers. I put "arrays" in
             | quote because C has nothing I would personally call an
             | array. It's just contiguous memory allocation. It's to the
             | point that _10[a]_ and _a[10]_ are desugared to the same
             | thing.
        
         | masklinn wrote:
         | > The trick is the size is hidden before the adress of the
         | buffer.("Learn this one simple trick that will change your life
         | for ever").
         | 
         | This has drawbacks:
         | 
         | 1. you can't convert an existing buffer to an sds buffer
         | 
         | 2. you can't slice into a buffer, because the metadata is part
         | of the string's buffer (even if it's before the pointer)
        
         | lifthrasiir wrote:
         | I don't like SDS for multiple reasons. My biggest complaint is
         | that it's a data structure disguised as a single naive pointer,
         | which is actually harder to use correctly. This kind of
         | "masquerading" pointer is conceptually a linear type, as you
         | can't safely change its length in place and any potential
         | change has to return the modified pointer somehow. No other
         | type in C behaves like this, resulting in more confusion and
         | thus more errors. And I have more counterpoints to those self-
         | claimed advantages as well:
         | 
         | Counterpoint #1: You can't pass SDS strings to functions that
         | accept `char **` (which is a common way to return a string of
         | unknown length, and often can act as an in-out parameter as
         | well).
         | 
         | Counterpoint #2: You rarely access individual "characters"
         | (whatever this means). It is a conscious decision to whether
         | you should iterate over bytes or Unicode scalar values or code
         | points or grapheme clusters, and for this reason it is better
         | to make the decision explicit even though it's C `char` in the
         | surface level.
         | 
         | I have no evidence for nor against advantage #3 though.
        
           | kevin_thibedeau wrote:
           | > No other type in C behaves like this,
           | 
           | Malloc does this.
        
             | lifthrasiir wrote:
             | Malloc is not a type. If you meant to say realloc, a good
             | point and it's indeed a bad interface for the exact reason
             | but still it's not a type.
        
               | Karellen wrote:
               | I believe malloc() was intended, as a number of old-
               | school UNIX implementations of malloc() put the size of
               | the allocation (and possibly other bookkeeping info?) "in
               | front of" the pointer returned, in a similar way to how
               | sds stores the size of its buffer.
        
             | [deleted]
        
         | realgeniushere wrote:
         | Makes me think less of antirez that he doesn't acknowledge that
         | this is the same design as Microsoft's BSTRs, which predate sds
         | by many many years.
        
         | gorgoiler wrote:
         | Whoa. Jamming metadata in the address space before the string
         | pointer is such a clever idea. I don't know enough about C to
         | know how many awkward bugs this might cause, but I know enough
         | about programming to spot exceptional lateral thinking when I
         | see it. Very neat.
         | 
         | I guess the SDS authors might ship a linter to spot all the
         | times you mistakenly use free() instead of sdsfree()? That
         | could make the cleverness more tolerable?
        
           | arcticbull wrote:
           | This is a common approach for things like malloc to use,
           | since you are passing an opaque pointer to arbitrary data
           | into free() which you then expect to quickly do something
           | useful with. It can just walk back the pointer a little to
           | find the header and act on it.
           | 
           | It's pretty weird to see it anywhere other than malloc though
           | especially masquerading as a basic type. It's incompatible
           | with other common patterns like returning via (char *) and
           | you can't identify which deallocator you're supposed to give
           | the result to from the type alone.
        
         | quelsolaar wrote:
         | Optimizing text is hard because you seldom up front know how
         | much memory will be needed and allocations are slow.
         | 
         | Another way to do it is to use:
         | 
         | typedef struct{ size_t allocated; size_t used; char buffer[];
         | }String;
         | 
         | This lets the header and the string be the same allocation.
         | Thats a huge saving. Its also useful to store the allocation
         | and use size separatly so that you can reuse / modify buffers.
         | The used field lets you use memcpy without looking for string
         | termination.
         | 
         | You can make it even more complex by adding flags if the string
         | is on the stack or on the heap. That way you can do things
         | like:
         | 
         | String buffer = MACRO_TO_CREATE_STRING_BUFFER_ON_STACK(256),
         | *b; b = &buffer;
         | 
         | b = do_processing_with_buffer(&b); // allocates on heap a
         | larger buffer if needed
         | 
         | string_free(b); // frees buffer if its on heap.
        
           | jstimpfle wrote:
           | In general it's hard to get more efficient than a simple
           | struct String { const char *buffer; u32 size; }. Your method
           | removes an indirection from the allocated storage, but you'd
           | still need an external pointer to point to that struct in
           | most cases. That, plus retrieving the size now costs an
           | additional dereference. So I wouldn't use your method unless
           | I knew that I'd have to reference the string from multiple
           | locations.
           | 
           | The best way to be efficient is often to make assumptions
           | about the data. Most strings don't need any dynamic
           | allocation after having been "built". So it makes a ton of
           | sense to make a string builder API that returns a final
           | string when it's finished. In this way, you save at least the
           | "allocated" member.
           | 
           | The advantage of the simpler string representation is that it
           | works for any string (or substring) that is contiguous in
           | memory, and is completely decoupled from allocation concerns.
           | E.g. I can easily                   #define
           | STRING(string_literal) \             ((String) {
           | string_literal "", sizeof string_literal - 1 })
           | 
           | , to be able to statically declare such strings like this:
           | String my_string = STRING("Foo bar");
           | 
           | If you have many strings that you know are small, then just
           | the normal nul-terminated C string (without any size field)
           | is as storage-efficient as it gets.
           | 
           | In practice, I find string handling so easy that I rarely
           | even define this struct String. I just pass around strings to
           | functions as two arguments - pointer + size. It feels so
           | light and data flows so easily between APIs, I love it.
        
         | mh7 wrote:
         | Re #3:
         | 
         | A big downside is that you you can't easily take ownership of
         | an existing buffer and treat it as this string type.
         | 
         | string s;
         | 
         | char some_buf[];
         | 
         | string_take(&s, some_buf, some_capacity);
         | 
         | Also you would of course never dynamically allocate string
         | structs, just the data member if needed.
        
         | _448 wrote:
         | > The trick is the size is hidden before the adress of the
         | buffer.
         | 
         | That is how strings use to be stored before C made the choice
         | of using null-terminator. Pascal stored the string size before
         | the string data. The advantage of relying on a terminator
         | symbol is that the string size can be any length where as
         | storing the size at the start forces the string to not exceed
         | certain size.
        
           | [deleted]
        
           | anonymoushn wrote:
           | In execution environments with 64-bit pointers you may have
           | trouble loading a string of more than 16 exabytes into RAM
           | anyway
        
           | tored wrote:
           | Another is Hollerith strings that was used by FORTRAN and TCP
           | protocols.
           | 
           | https://en.wikipedia.org/wiki/Hollerith_constant
        
           | mh7 wrote:
           | strlen() returns a size_t so you're already constrained to a
           | maximum length of SIZE_MAX.
        
             | jcelerier wrote:
             | If you use a different data structure you would maybe use a
             | different API for accessing it too
        
             | jstimpfle wrote:
             | This is hilarious. SIZE_MAX is at least as large as the
             | largest string that you can put in your address space /
             | memory anyway. Which is what the strlen() API already
             | assumes.
             | 
             | That, plus you'd be a fool to store a huge string in this
             | way _anywhere_ (in or out of memory) in any case.
        
               | Someone wrote:
               | > SIZE_MAX is at least as large as the largest string
               | that you can put in your address space / memory anyway.
               | 
               | Not necessarily. A 64-bit system could give processes an
               | address space that's significantly larger than half the
               | full 64-bit address space and have an allocator that
               | allows you to allocate a block of more than _SIZE_MAX_
               | bytes ( _malloc_ takes a _size_t_ , but you can use
               | _calloc_ )
        
               | Karellen wrote:
               | size_t is unsigned, right? ssize_t is the signed version?
               | 
               | On a quick test on my 64-bit system, a C program doing
               | `printf("%zu\n", SIZE_MAX);` outputs
               | 18446744073709551615, which looks like (2^64)-1 to me.
               | 
               | Or is there a thing in the standard that says this isn't
               | always the case?
        
               | arcticbull wrote:
               | ssize_t is a weird one, the only negative value it is
               | guaranteed to store is -1.
               | 
               | > The type ssize_t shall be capable of storing values at
               | least in the range [-1, {SSIZE_MAX}].
               | 
               | [1] https://pubs.opengroup.org/onlinepubs/9699919799/base
               | defs/sy...
        
               | jstimpfle wrote:
               | This doesn't make sense to me. You can't "allocate" more
               | than SIZE_MAX bytes by definition. If you take "allocate"
               | to mean "make it available in the process's address
               | space", that is.
        
               | unwind wrote:
               | Are you sure?
               | 
               | The calloc() [1] function mentioned above takes _two_
               | values of type size_t, and allocates _their product_
               | bytes.
               | 
               | I'm on mobile without (!) the C99 draft spec but at least
               | the man page gives no such restriction.
               | 
               | [1] https://linux.die.net/man/3/calloc
        
               | mek6800d2 wrote:
               | I read something about this recently, somewhere, maybe
               | HN. Specifically, in calloc(), what is done and what
               | should really be done if the multiplication overflows. As
               | will happen, for example, if you try to calloc() two
               | elements of size SIZE_MAX, when SIZE_MAX is the maximum
               | representable unsigned integer value on the machine. So,
               | I don't think calloc() is available or intended as a way
               | to circumvent malloc()'s size restriction.
        
               | ahepp wrote:
               | Isn't size_t defined as being able to fit the largest
               | possible data allocation?
        
               | pjmlp wrote:
               | Indeed, you just need to forget to put a terminator to
               | get a nice memory dump.
        
           | drfuchs wrote:
           | Nit: Many Pascal compilers / runtimes extended the language
           | in non-standard ways, including various schemes for storing
           | string length in front of the string. But nothing like this
           | was ever part of the ISO Pascal standard, and it was
           | certainly not in the "PASCAL User Manual and Report" by
           | Kathleen Jensen and Niklaus Wirth.
           | 
           | In fact, in standard Pascal, string handling is extremely
           | rudimentary; there was no way to express "this variable /
           | parameter / pointer refers to a string with a length not
           | known at compile-time".
        
             | pjmlp wrote:
             | They were on the ISO Extended Pascal, which hardly mattered
             | because by then, USCD Pascal and Object Pascal already had
             | taken over the world of Pascal dialects, both of which had
             | better ways to deal with strings.
             | 
             | https://www.iso.org/standard/18237.html
             | 
             | Additionally, Modula-2 was already available in 1978,
             | sorting out all the issues of original Pascal, with all the
             | features needed for a safe systems programming language in
             | the late 70's.
        
               | drfuchs wrote:
               | In the late 70's, there were production-quality Pascal
               | compilers for DEC 20 / ITS / SAIL, Vax/VMS, IBM 360/370,
               | together covering much of academic computing and most of
               | the ARPAnet. Even consulting Wirth, Knuth couldn't find
               | suitable Modula-2 compilers available for these, so TeX
               | used Pascal and not Modula-2. Near as we heard, it was
               | only ever seriously used on the niche ETH workstation?
        
           | aap_ wrote:
           | Null-termination was not a C invention.
        
           | jstimpfle wrote:
           | The biggest advantage of zero-terminated to me is simplicity,
           | next would be efficiency for really small strings - although
           | this is a fringe concern. Strings with explicit length should
           | at least have a 32-bit length field (maybe 64) IMO - for
           | example, it's common to read files (and store them in
           | contiguous memory) that are larger than 64K.
        
             | abcd_f wrote:
             | The length can be packed, e.g. like utf-8 does it or
             | something similar. The caveat is the cost of unpacking on
             | access, but the memory overhead will be minimal.
        
             | thom wrote:
             | SDS supports 64-bit lengths. It also dynamically changes
             | the size of its size/flags field to accommodate growth. The
             | minimum overhead is an extra char (same as null
             | termination).
        
             | lifthrasiir wrote:
             | Most memory allocators have an internal fragmentation which
             | removes most efficiency gained by zero-termination. In fact
             | it's worse, because zero-termination means that
             | deallocation can't take a size parameter and it can often
             | cause a performance hit for many modern allocators due to
             | cache misses [1].
             | 
             | [1] https://isocpp.org/files/papers/n3778.html
        
             | jbverschoor wrote:
             | Well you could easily use the first 4 bits to indicate how
             | many bytes the length is + 1.                     c0
             | (0b0000) -> length is 0xc = 12         1341 (0b0001) ->
             | length is 0x134 = 69,940       239a42 (0b0010) -> length is
             | 0x239a42 = 145,828            deadbeaf239a47 (0b0111) ->
             | length is 0xdeadbeaf239a4 = 3917404957718948.
             | 
             | This gets you a 7-byte = 56bit number, minimal overhead for
             | smaller strings.
             | 
             | Maybe reserve 0x1111 for future use.
             | 
             | Maybe the other endian makes more sense here, and maybe 0
             | should mean zero-length.
             | 
             | It's probably not very performant compared to other
             | solutions, but you can just shift 4 bits, and you're done
             | 
             | I'm curious how many strings are allocated at a particular
             | point in time (across everything, kernel, os, apps, etc)
        
               | jstimpfle wrote:
               | These considerations can make sense when thinking about
               | storage formats (probably you want to compress the string
               | too), but they are not convenient for in-memory
               | representation where you want to get the location of the
               | first character with a simple member access.
        
               | jbverschoor wrote:
               | It starts at the start of the frame + the first nibble +
               | 1
        
               | [deleted]
        
           | thaumasiotes wrote:
           | > The advantage of relying on a terminator symbol is that the
           | string size can be any length where as storing the size at
           | the start forces the string to not exceed certain size.
           | 
           | In the same way that since we identify unicode code points
           | with a 16-bit value, it's impossible to include U+1D460 in a
           | string?
           | 
           | In the same way that since Matroska files encode the length
           | of their segments, there's a hard upper limit on the length
           | of a segment?
           | 
           | Of course none of those things is actually true. Storing the
           | string size has no implications for how long the string can
           | be. It requires an amount of space, to store the string size,
           | that is logarithmic in the length of the string, and
           | completely insignificant.
        
             | jstimpfle wrote:
             | For sake of simplicity, and for efficiency with really
             | small strings, with a length-prefixed string representation
             | you really want to keep the string length field fixed-size.
             | In general.
        
               | thaumasiotes wrote:
               | Really small strings have a fixed-size length field in
               | any variable-size encoding of the length. They're small,
               | so they fit into whatever the smallest possible length
               | field is.
               | 
               | What do you gain in handling short strings from an
               | inability to handle long ones?
        
               | jstimpfle wrote:
               | Ok I give you this one, but I still don't think that
               | minimizing the size of a length field using a flexible
               | width encoding is a good idea except when talking about
               | extremely specialized string encodings (like compression
               | schemes).
               | 
               | Flexible width encoding is more complicated compared to
               | simple member access to get at the first character. And
               | how do you handle construction of a string whose size you
               | don't know yet? You might have to move the string away to
               | make space for a bigger string length field. I don't like
               | it.
        
               | thaumasiotes wrote:
               | > Flexible width encoding is more complicated compared to
               | simple member access to get at the first character.
               | 
               | I don't think this is true either. It's almost true. But
               | what happens if the string length is 0?
               | 
               | If you make the assumption that you can access the first
               | character of a zero-length string by just grabbing
               | whatever is in memory after the string header, you're
               | going to make the exact mistake the length field is there
               | to stop you from making, a memory access violation. You
               | have to process the length field in order to do any
               | access at all; many strings don't have a first character.
               | 
               | > And how do you handle construction of a string whose
               | size you don't know yet? You might have to move the
               | string away to make space for a bigger string length
               | field.
               | 
               | That's true; you'll either need to be willing to store
               | the character data and the length metadata in separate
               | locations, or you'll need to be willing to occasionally
               | move the data around.
        
               | jstimpfle wrote:
               | Obviously I mean get at the address of the first
               | character, if any. You can't load before you know that
               | what you load is valid. Btw. zero-terminated strings
               | allow you to load unconditionally. Sometimes that's nice.
        
               | thaumasiotes wrote:
               | OK, but now the difference in how complicated it is to
               | read from the string boils down to this:
               | 1. Read the first chunk of the string length.         2.
               | Is it more than 0?
               | 
               | vs                   1. Read the first chunk of the
               | string length.         2. Did we get the whole thing?
               | 3. Is the length more than 0?
               | 
               | That extra step in the variable-length case means
               | checking whether a bit is set in the value you just read.
               | 
               | ---
               | 
               | Also, it occurs to me that this whole discussion is
               | talking about how to serialize or deserialize a string,
               | when the original discussion is over how the string
               | should be represented in memory.
        
             | lelanthran wrote:
             | > In the same way that since we identify unicode code
             | points with a 16-bit value
             | 
             | Not a single 16-bit value. Some codepoints are two 16-bit
             | values.
        
               | kevin_thibedeau wrote:
               | Codepoints are 21-bit values. They may have a more
               | compact encoding but the unencoded form is fixed.
        
         | jlokier wrote:
         | On point 3, you can achieve the same cache locality, without
         | losing the ability to take slices or append, by having the
         | string object contain a pointer to the string bytes, and
         | allocating the bytes _by default_ immediately after the string
         | object.
         | 
         | It is still single allocation, so the allocation is just as
         | fast.
         | 
         | The pointer is in the same cache line as the string bytes in
         | all strings except for slices (and any other fancy indirect
         | string types). Even though the code fetches indirectly via that
         | pointer, the CPU will be able to fetch the initial string byte
         | efficiently as soon as it has the pointer.
        
           | estebank wrote:
           | How would this colocation of the string pointer work? Because
           | these would be in the heap, right? Otherwise the pointer
           | would get invalidated as soon as the enclosing function ends
           | and its stack frame gets discarded. So if it is in the heap
           | then you either have a pointer to the colocated pointer (not
           | very useful, if negligible performance impact) or you're
           | copying the colocated pointer (at which point you're back to
           | square one, having a pointer in the stack and the underlying
           | string in the heap). Am I missing something?
        
             | alcover wrote:
             | Good point. Some SSO (small str optimization) schemes
             | achieve this by pointing back into the struct itself. Gcc
             | String implementation for ex.
        
       | program wrote:
       | It's better not to use types that end with a '_t' because the
       | suffix is reserved in POSIX systems.
       | 
       | https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh...
        
         | cassepipe wrote:
         | On the other hand it is quite handy as a prefix, s_ for
         | structs, e_ for enums, g_ for globals, t_ for simple typedefs,
         | f_ for function pointers typedefs, u_ for unions... sky is the
         | limit !
         | 
         | And it's quite easy to create an highlighting rule for it in
         | vim if you still did not convert to treesitter. Just put in
         | ~/.vim/after/c.vim :
         | 
         | ```
         | 
         | syn match cType /\<\\(t\|s\|e\|u\\)_\w\\+\>/
         | 
         | ```
         | 
         | Boom, custom type highlighting for C ! Pick the the letters you
         | will use.
        
           | lifthrasiir wrote:
           | No wonder why a significant portion of C programmers actually
           | want to keep tags (`struct` in `struct foo`) instead of
           | removing them wih typedef.
        
         | kazinator wrote:
         | I don't agree with that at all; it is "the sky might fall"
         | reasoning.
         | 
         | Just
         | 
         | * have sane naming in your program.
         | 
         | * respect namespaces like _[A-Z] and __
         | 
         | * solve clashes that actually happen
         | 
         | Historically, revisions of POSIX have introduced identifiers
         | that were not in any previously announced namespace. There is
         | no way you can name an identifier that is guaranteed not to
         | clash with POSIX, or any other vendor. For instance the name
         | "openat" was fine to use in a POSIX program once upon a time.
         | 
         | Consider that all strings have the empty string as a suffix.
         | The string "abc" has four suffixes: "abc", "bc", "c" and "".
         | 
         | So, every current and future POSIX identifier has "" as a
         | suffix. This is not just a threat; it is guaranteed! Since
         | every identifier in your program also has a "" suffix, it
         | clashes with that namespace.
         | 
         | What's wrong with the argument is that identifiers don't just
         | have a suffix; they have to be identical in order to actually
         | clash. (Or have identical prefixes, due to truncation of
         | external names in a linker: decades ago, the limits were
         | ridiculously small.)
         | 
         | I doubt that even one person in POSIX standardization would be
         | dumb enough to approve str_t being added as a typedef name in
         | some existing or new header, and multiple approvals are
         | required.
         | 
         | Nobody should be losing any sleep over using _t typedef names
         | in their C code.
        
           | Kwpolska wrote:
           | The argument with "" as suffix sounds quite absurd.
           | 
           | Why do you believe POSIX would never approve a str_t type?
           | Nobody likes raw char arrays, perhaps a future revision of
           | POSIX may decide to make the lives of C programmers easier
           | and implement their own sane string type.
        
             | jstimpfle wrote:
             | I for one like "raw char arrays", and really don't care
             | about missing string functionality in C. I basically use
             | sizeof, snprintf, memcpy and am just fine. I've toyed with
             | defining struct String{ptr,size} sometimes but largely it
             | just gets in the way.
             | 
             | If you think it's necessary, it's very easy to make an
             | argument that you'd have to have a generic type for slices
             | of any type. (Actually, more so than strings, since C is
             | just not a language for domains with a focus on strings).
             | 
             | Now, whether you think a language must have a generic slice
             | type or not, C is simply not the language where you can fit
             | that in.
        
             | [deleted]
        
             | kazinator wrote:
             | Yes, the argument is absurd; it's supposed to be.
             | 
             | Now extending the suffix to "_t" doesn't make it much less
             | absurd. Not qualitatively, just a bit quantitatively less
             | absurd.
             | 
             | Why I suspect POSIX isn't about to add a str_t is that
             | str_t is likely to occur in countless numbers of unknown
             | existing code bases.
             | 
             | And _that_ might be a good reason for avoiding it in a
             | library API, not the _t namespace being reserved.
             | 
             | We can have this variant of the argument: most identifiers
             | end in a lower-case letter, so they land into any one of 26
             | namespaces: the *a namespace, the *b namespace, ... future
             | POSIX identifiers have to be in one of these 26, except
             | those that end in digits or underscores. POSIX does _not_
             | say  "future versions of this standard shall not claim new
             | function or other identifiers ending in e". That doesn't
             | mean you stay away from identifiers ending in "e", right?
             | 
             | I wouldn't avoid str_t in the internals of a program
             | though. In the worst case, a clash happens somewhere and we
             | do some renaming; life goes on.
             | 
             | POSIX's reservation doesn't really mean much; all they are
             | saying is "we have some type names ending in _t, and will
             | likely have more, so watch out". Yes, POSIX will likely
             | have such names, and so will every C programmer and his
             | dog. Whoppee dee. POSIX will likely have new names ending
             | in 'e' also, and so on.
        
         | eps wrote:
         | Yeah, that's _the_ lowest-hanging C pedantry nitpick.
         | 
         | Usually if there's nothing else meaningful that one can say
         | about someone else's project, they will comment on the _t
         | naming... and as anyone with a yota of real-world experience
         | would know it's a complete non-issue outside of a handful top-
         | tier open source projects.
         | 
         | Don't be that guy. Save this comment for when it may actually
         | be relevant.
        
         | mh7 wrote:
         | It's reserved in the same sense that google's style guide
         | 'reserves' struct names starting with a capital letter.
         | 
         | Only ISO C can officially reserve names, everyone else just has
         | their personal code/naming style that you can chooseto follow
         | or not.
        
         | lifthrasiir wrote:
         | I would argue in the other way: the C standard should have a
         | standard string type named `str_t`, and this library is one way
         | to prototype it ;-)
        
       | e-dant wrote:
       | I guess it's nice for a C string API, but what's the motivation
       | to use and create this? Wouldn't externing some C++ symbols (or
       | Rust) work more smoothly?
        
         | lelanthran wrote:
         | > Wouldn't externing some C++ symbols (or Rust) work more
         | smoothly?
         | 
         | For the C++ case, it's not that easy due to C code that cannot
         | handle exceptions thrown in C++ code.
         | 
         | For the Rust bit, I'm not sure - creating the library in Rust
         | and letting it be called from C makes the whole rust library
         | unsafe because the data returned from the Rust API would lose
         | ownership information, and is no more safe than simply writing
         | it in C.
        
       | kazinator wrote:
       | > _Attempting to split a string using non-existent delimiter with
       | str_pop_first_split() [returns an invalid string with .data ==
       | NULL].
       | 
       | But that seems like a valid case: e.g. these are comma-delimited
       | lists of numbers:                 ""  // empty       "1"  // one
       | number       "20,30" // two numbers
       | 
       | the above remark in the documentation seems to be saying (perhaps
       | falsely) that if we try to extract a token from the "1" string
       | using "," as a delimiter, we get an invalid str_t rather than
       | "1".
       | 
       | I don't see coverage for this in the tests. There is a test which
       | uses "123/456/789", which extracts the first two splits, and then
       | just verifies that "789" remains. What the programmer wants is to
       | be able to write a loop which will extract "123", "456" and "789"
       | and _then* hit the terminating case where the invalid str_t is
       | returned.
       | 
       | How many items are in "1,2,3," viewed as comma-separated: three
       | or four?
       | 
       | It would also be a code improvement to replace umpteen
       | repetitions of "(str_t){.data = NULL, .size = 0}" throughout the
       | code with a macro.
        
       | kaba0 wrote:
       | All in all, C is still not expressive enough for even such a
       | basic data structure as strings.
        
       | Diggsey wrote:
       | Slightly ironic that Rust is criticized for having multiple
       | string types, and yet the solution to simplify string handling in
       | C is to introduce the exact same types (str_t == &str, strbuf_t
       | == String) albeit without the safety guarantees.
        
         | estebank wrote:
         | It is still frustrating to me that C still doesn't have a non-
         | allocating method to handle substring references, which both
         | C++ and Rust have. On the other hand I see people trying to
         | parse files, like JSON, in a non-allocating way in Rust and hit
         | a wall until they realize that nodes need to be escaped for
         | anything useful, which requires owning the node's memory
         | (meaning, you need a String or at least Cow<'_, str>, can't get
         | away with a &str).
        
         | adamdusty wrote:
         | I don't think anyone minds that rust has multiple string types
         | just that they're effectively named the same thing so people
         | new to rust have no clue which does what without looking it up.
         | Furthermore people without c/c++ experience mostly wont even
         | know there is a difference since most languages don't give you
         | that control over strings.
         | 
         | If rust string were str and strvec or strbuf no one would care.
        
       | andrewmcwatters wrote:
       | I want C strings that are compatible with string.h.
       | 
       | I want some struct that is a pointer to the char array `s' with
       | size_t `n'.
       | 
       | To meaningfully do this, it means you need auxiliary functions
       | that you execute after calling string.h functions, or you write
       | wrappers that do this for you after calling the relevant string.h
       | functions.
       | 
       | I'm OK with that.
       | 
       | SDS doesn't do this. Most other C string libraries like this one
       | basically do what I'm asking for, but not quite.
       | 
       | I don't want separate structs for reading and writing strings. I
       | just want authors to keep it as simple as possible without
       | diverging too hard from how C strings already work today.
        
         | kevin_thibedeau wrote:
         | I have a personal lib that works like this. It maintains a
         | simple struct with a start pointer and a one-past-the-end
         | pointer. You can use it to construct a view or point into
         | unused space at the end of a string for building ops. NUL
         | termination is preserved so interop with stdlib is always
         | available.
         | 
         | This allows for nicer string handling while always allowing
         | interop with anything expecting a char *. Libraries with their
         | own string implementation always exact a penalty to get a cstr
         | out.
        
       | jandrese wrote:
       | This is kind of a bikeshed argument, but I'd prefer if the view
       | was labeled as such. So instead of str_t it would be strview.
       | Rust makes this same mistake IMHO and it causes a lot of
       | confusion for beginners. I would personally call the strbuf_t
       | strstore but that's even more nitpicky.
       | 
       | Naming things is one of the hardest problems in CS.
        
       | zajio1am wrote:
       | 1. Ditching null termination makes it cumbersome for
       | interoperability with C ecosystem.
       | 
       | 2. It has terrible overhead.
        
         | LegionMammal978 wrote:
         | > All strbuf functions maintain a null terminator at the end of
         | the buffer, and the buffer may be accessed as a regular c
         | string using mybuffer->cstr.
         | 
         | So effectively a str_t works like an std::string_view from C++,
         | and strbuf_t works like an inline std::string.
         | 
         | To produce a null-terminated string from a section of a longer
         | string requires an allocation, unless you can temporarily
         | modify the original string to replace one of its characters
         | with a terminator.
        
           | zajio1am wrote:
           | Well, the documentation says that null terminator is
           | maintained at the end of the buffer (i.e.
           | mybuffer->cstr[mybuffer->capacity - 1]), not at the end of
           | the string stored in the buffer (i.e.
           | mybuffer->cstr[mybuffer->size]).
        
             | LegionMammal978 wrote:
             | Not sure where you're getting that interpretation from. If
             | you look at the actual code, it sets buf->cstr[buf->size] =
             | 0 every time the string is resized. After all, what else
             | could "the buffer may be accessed as a regular c string"
             | possibly mean?
        
               | zajio1am wrote:
               | > Not sure where you're getting that interpretation from.
               | 
               | That is just plain reading of "null terminator at the end
               | of the buffer", as 'buffer' is just place in memory,
               | regardless of what is stored in it. 'End of the buffer'
               | is commonly used for end of such reserved memory, not end
               | of valid data in that memory.
               | 
               | But maintaining the null-terminated string in the buffer
               | is much more useful behavior than just maintaining null
               | terminator at the end of the buffer, so it is likely just
               | sloppiness in the documentation.
        
       | KingLancelot wrote:
        
       | thesz wrote:
       | Having a string type that has "invalid string" value which is
       | different from empty string value is a bliss.
       | 
       | What is important there is that the invalid string value is
       | completely compatible with most C functions - despite actual data
       | pointer is NULL, the length of data is zero so memcmp,
       | memmove/memcpy and most other functions will not segfault.
       | 
       | This is really thought out approach.
       | 
       | Thank you!
        
       | gkfasdfasdf wrote:
       | Hoping someone can educate me, what are the advantages of having
       | the last member of strbuf_t be a variable length array (char
       | cstr[]) instead of just a char*?
        
         | ksherlock wrote:
         | With inline data, only one malloc is needed for the buffer
         | housekeeping and character data. It's also probably slightly
         | better for cache performance since the housekeeping data and
         | string data are together.
        
           | gkfasdfasdf wrote:
           | Ah I see. If you want to refer to a string that was not part
           | of this allocation you would use the other str_t type,
        
         | ComputerGuru wrote:
         | You can store the string as part of the same heap/stack
         | allocation rather than as a separate allocation.
        
       | naasking wrote:
       | Why not ropes?
       | 
       | https://github.com/josephg/librope
        
       | musicale wrote:
       | > str.h defines the following str_t type:
       | typedef struct str_t {            const char* data;
       | size_t size;         } str_t;
       | 
       | Sort of a hybrid of C style (pointer) and Pascal style (bounded
       | array) strings?
        
       | pjmlp wrote:
       | This is the kind of string libraries that WG14 should care about.
       | 
       | Kudos for having a go at it.
        
       | habibur wrote:
       | I use a lib like this, but a few changes.
       | printf("The string is %"PRIstr"\n", PRIstrarg(mystring));
       | 
       | Simpler: printf("the string is : %.*s",mystr.size, mystr.data)
       | 
       | But that's tedious to write. So create a small macro
       | #define ls(x) (x).size,(x).data
       | 
       | And then printf becomes as simple as :
       | printf("the string is : %.*s", ls(mystr));
       | 
       | Though OP's macro is possibly doing more.
        
         | masklinn wrote:
         | > But that's tedious to write. So create a small macro
         | 
         | > #define ls(x) (x).size,(x).data
         | 
         | Doesn't that double-evaluate `x`?
        
           | habibur wrote:
           | It does.
        
       | gjvc wrote:
       | worth comparing to https://cr.yp.to/lib/stralloc.html
        
         | plan999 wrote:
         | You should look at an even better string library. Much more
         | functions and safety for split/joint/tokenizer/etc it's a fork
         | of the plan9 string library bstring.
         | 
         | https://bstring.sourceforge.net/
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2022-12-03 23:01 UTC)