[HN Gopher] Show HN: A nice C string API
___________________________________________________________________
Show HN: A nice C string API
A convenient C string API, friendly alongside classic C strings.
Author : mickjc750
Score : 88 points
Date : 2022-12-03 12:31 UTC (10 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| the-printer wrote:
| Are the any good resources that explain the concept of strings in
| C, particularly why they're considered to be so difficult to
| manage? I'm interested in the language, and that along with its
| safety concerns seem to be the two most frequent complaints
| against it that I read about online.
| Quentak wrote:
| I have written a short article explaining why null terminated
| strings as they exist in C cannot represent proper ASCII and
| UTF-8 because of the null terminator. It's not a full
| explanation of how strings work but it might be helpful for
| you.
|
| https://kttnr.net/blog/null-terminated-strings-are-incorrect...
| lifthrasiir wrote:
| SQLite does store a null character in strings, it has lots of
| _documented_ [1] issues in the API level though.
|
| [1] https://www.sqlite.org/nulinstr.html
| Quentak wrote:
| Thanks for this link.
|
| How do you get the null byte into the string? Is it through
| casting blob to string? The way I have encountered this is
| when using the C API in which string arguments for prepared
| statements are passed as char pointers. If those contain
| the null byte then the string is cut off.
|
| Allowing null characters and then mishandling them is worse
| than not allowing them.
| e-dant wrote:
| C strings are pointers to memory. There are semantics and
| assumptions encouraging null-character delimited strings, but
| not every API follows those rules (just got done working with a
| Windows API that doesn't).
|
| Often, you have to both null-delimit your string and store its
| length somewhere. That's the dangerous part. Messing either of
| those up, or passing your string to an API that messes that up,
| is not safe.
|
| C strings are pointers to memory, either the stack or the heap,
| and follow exactly the same rules as everything else in that
| chaotic space: Not many.
| the-printer wrote:
| Thank you for this. C programming sounds almost like some
| sort of combat sport. Riveting.
| lelanthran wrote:
| > Thank you for this. C programming sounds almost like some
| sort of combat sport. Riveting.
|
| I've done it for decades; it isn't really as bad as hype-
| attracting headlines would have you believe.
|
| Munitions control, aircraft management systems, industrial
| automation systems, and many more life-critical systems
| were programmed in C for decades with comparatively little
| danger from the language intrinsics leading to death.
|
| It's easy to look at the stats and say "there's a few dozen
| CVEs annually due to C footguns", but that's a few dozen
| out of hundreds of millions of deployed systems that are
| written in C.
|
| In practice, very few lines of C code bypass the type
| system, so you get much fewer bugs than an equivalent
| system in the more usual dynamic programming languages
| (Python, Javascript, etc).
| thesnide wrote:
| Wondering if the big influx of C derived CVE are old or
| new code. If it is new code, I'm also wondering about the
| brain damage that those safe languages causes.
|
| Yes, it is better to have memory safe languages. But it
| encourages sloppiness as "nothing can happen". Then those
| folks aren't fit to write anything else. Which closes the
| feedback loop on inefficient but safe languages.
|
| Which becomes the same thing in airplanes. Pilots don't
| really know how to fly without instruments anymore.
| wadd1e wrote:
| >Which becomes the same thing in airplanes. Pilots don't
| really know how to fly without instruments anymore.
|
| Well that's just a blatantly wrong generalisation you
| made there, curious as to where you got that from.
| Consider looking up how pilot training is done before
| making such assumptions. Even though modern airplanes
| make heavy use of technology, there are emergency
| scenarios where lots of instruments may not work, and
| pilots receive more than enough training to fly an
| airplane in that scenario just to give one example among
| tons of others.
|
| edit: grammar
| marssaxman wrote:
| More like fire-performance: it _looks_ dangerous, and it
| does require some finesse, but it 's really satisfying when
| you get in the flow, and burns are both less frequent and
| less serious than you might imagine as an onlooker.
| pjmlp wrote:
| It is more like a combat sport, doing martial art moves,
| while trying to juggle knives between moves.
| jstimpfle wrote:
| "Strings" are quite an abstract concept. They are a linear
| sequence of characters. But there are a number of ways to
| represent them - the simplest of which is a contiguous memory
| allocation, but depending on the use case you'd need more
| complex schemes. There are also different ways to do the
| necessary memory management (e.g. allocate statically at
| compile time vs dynamically at run time).
|
| One of the most complex representations is probably the string
| rope datastructure - a balanced tree of string chunks,
| supporting efficient insertion and removal anywhere in the
| string.
|
| Specific to C, as well as lots of low-level APIs, is only that
| strings are often expected to be contiguously laid out in
| memory and terminated with a NUL (0) byte. So you need to make
| sure that you always terminate with a NUL after writing to
| string storage.
|
| Other than that, strings aren't any harder than other aspects
| of programming with manually managed memory.
|
| Maybe motivated from higher-level dynamic or managed languages,
| is the popular idea that strings should always be allocated
| dynamically (like std::string for example), and support
| operations like string-append with automatic reallocation if
| the currently allocated memory isn't enough to store the new
| string.
|
| In practice, that's not true at all - unless you are in a
| domain where lots of small intermediate strings are generated.
| This is pretty inefficient anyway and there is likely no point
| to use C in this case.
|
| By far most strings in most domains are either completely
| static (use string literals), or are created once in a sequence
| of append operations and then never changed again. I get by,
| doing many different things from GUI apps to networking to
| parsers and interpreters, without any sophisticated string
| type. All I do is define some printf-like APIs to do logging,
| for example. Those typically just use a fixed size buffer for
| the formatting, and then flush that buffer to e.g. stderr. _or_
| flush it to a dynamically allocated memory buffer, but there
| almost never is a need to reallocate that string later.
| WalterBright wrote:
| > why they're considered to be so difficult to manage?
|
| Back in the 90s, I was very experienced with C strings and
| managing them. Then I chanced to look at BASIC again, and
| realized that strings in BASIC were so simple and intuitive.
| Why couldn't C be like that? When I started on the design of D,
| I decided that it had to make strings as easy to do as BASIC
| did.
|
| And D does.
|
| The trouble with C strings is the 0 termination of them. This
| means:
|
| 1. to get the length of the string, you have to scan it. This
| is expensive.
|
| 2. when manipulating strings, a common error is to get off by
| one in the storage because of the 0 termination
|
| 3. you cannot get a subset of the string without making a copy.
| Not only is a copy expensive, but then you have to keep track
| of the memory for it
|
| 4. there's no way to check for buffer overflows
|
| D's design, which uses a phat pointer (length, ptr) for
| strings, solves these problems.
| tored wrote:
| A year ago I picked up the BASIC dialect PureBasic. Pleasant
| surprise actually, the syntax of the PureBasic dialect is a
| bit archaic, but if you accept that it is much easier and
| faster to get anything done compared to C (and C++).
| Personally I find low level topics easier to grok in
| PureBasic than in C even though they mirror the same
| concepts. PureBasic has Unicode strings built in.
|
| It is a bit shame that BASIC has such a bad reputation, there
| are many BASIC dialects that does the job well still today.
|
| https://www.purebasic.com/documentation/reference/ug_string..
| ..
| Someone wrote:
| The problem is that C doesn't have strings; it has functions
| that treat sequences of non-zero bytes followed by a zero bytes
| as if they are strings.
|
| So, you can't ask it to create a string that contains the
| result of appending a string to another one. If you want to
| append two 'strings', you have to create a buffer large enough
| to hold the result, and then copy in the two sequences of
| bytes. And even for doing that, the library functions aren't
| optimal. The basic "append this string's data to that string,
| assuming there's enough space to do so" function is _strcat_.
| It walks the first string to find the zero byte, but to "create
| a buffer large enough to hold the result" you already must do
| that.
|
| See for example
| https://stackoverflow.com/questions/21880730/c-what-is-the-b...
| jstimpfle wrote:
| You can use snprintf to easily achieve any concatentation
| you'd like. len = snprintf(buffer,
| buffersize, "%s%s%d", string_1, string_2, int_1);
| if (len + 1 /*NUL*/ > buffersize) { //
| not enough space }
|
| You can also use this to dynamically allocate any formatted
| string len = snprintf(NULL, 0, ....);
| buffersize = len + 1; /* NUL */ buffer =
| allocate(buffersize); snprintf(buffer,
| buffersize, ... /*same args as before*/);
| tom_ wrote:
| I like the printf family too. Any time you're doing a bunch
| of strcat or whatever it's almost always massively easier
| to use a format string to get the same result. Very easy to
| get the desired width/precision/alignment, and if you need
| numbers, printf has your back. It even does the bounds
| checking for you! (And how often do you get _that_ in C.)
|
| It won't be as fast, but it's almost always not a problem,
| and the nice thing about C and C++ is that the char-by-char
| route is still available when it is.
|
| I like to use asprintf, when available:
| https://man7.org/linux/man-pages/man3/asprintf.3.html - and
| when not available, I add it, along the lines of the
| snippet you present.
|
| Here's something I've found a useful upgrade to asprintf,
| as it frees the passed-in buffer after expanding the format
| string. You can just pass the same char ** repeatedly and
| it'll update the char * appropriately each time.
| int xasprintf(char**p,const char *fmt,..) { int
| n=0; char *p2=nullptr; if(fmt) {
| va_list v; va_start(v);
| n=asprintf(&p2,fmt,v); va_end(v);
| } if(n>=0) { free(*p);
| *p=p2; } return n; }
| cassepipe wrote:
| I always use antirez's (Redis creator) `sds` and advertise it
| whenvever I get the chance. Thanks to the someone who recommended
| it on HN some years ago. It's a joy to use.
|
| https://github.com/antirez/sds
|
| The trick is the size is hidden before the adress of the
| buffer.("Learn this one simple trick that will change your life
| for ever").
|
| From the Readme:
|
| ```
|
| Advantage #1: you can pass SDS strings to functions designed for
| C functions without accessing a struct member or calling a
| function
|
| Advantage #2: accessing individual chars is straightforward.
|
| Advantage #3: single allocation has better cache locality.
| Usually when you access a string created by a string library
| using a structure, you have two different allocations for the
| structure representing the string, and the actual buffer holding
| the string. Over the time the buffer is reallocated, and it is
| likely that it ends in a totally different part of memory
| compared to the structure itself. Since modern programs
| performances are often dominated by cache misses, SDS may perform
| better in many workloads.
|
| ```
| schemescape wrote:
| That sounds like the same allocation scheme as used in
| Microsoft's BSTR type: https://learn.microsoft.com/en-
| us/previous-versions/windows/...
| kazinator wrote:
| Thus, sds it cannot be used for the use cases that this library
| allows.
|
| This library takes string slices without having to allocate or
| copy memory; it seems to be for use cases involving breaking
| down strings in complex ways, where good ergonomics and
| efficiency of obtaining a null-terminated C string are
| secondary.
| WalterBright wrote:
| > The trick is the size is hidden before the adress of the
| buffer.("Learn this one simple trick that will change your life
| for ever").
|
| The length-prefix string has a major problem - it cannot be
| sliced to produce another length-prefix string. It has to be
| copied. Instead, using a phat pointer (size_t length, char*
| ptr) works very, very well. We've been using it in D for 20
| years.
|
| I've proposed it for C, too:
|
| https://www.digitalmars.com/articles/C-biggest-mistake.html
| torstenvl wrote:
| > _it cannot be sliced to produce another length-prefix
| string_
|
| Come again? Of course it can. It can't be done _in place,_
| mind you, but that 's a pretty bad way to do any string
| slicing, regardless of implementation, in a manual memory
| management environment. Do most programmers expect their
| slices to result in undefined behavior if they release the
| larger string they were made from? I doubt it.
| alcover wrote:
| > Come again? Of course it can.
|
| Oh come on.. I'm pretty sure Walter meant taking a _view_
| kind of slice. Obviously one can always copy part of a
| string, but that 's not what _slice_ implies I think.
|
| > It can't be done in place, mind you, but that's a pretty
| bad way to do any string slicing, regardless of
| implementation, in a manual memory management environment.
|
| It's not bad. It's the best, most efficient way. O(1)-ish.
|
| > Do most programmers expect their slices to result in
| undefined behavior if they release the larger string they
| were made from? I doubt it.
|
| That's what copy-on-write is for : release of the parent is
| blocked until no views are left on it.
|
| I made a C String lib using CoW. It works well:
|
| https://github.com/alcover/buffet
| quelsolaar wrote:
| The problem with arrays is not that they decay to pointers,
| its that they arent pointers to begin with. This:
|
| int x[10];
|
| Sould mean "put 10 ints in memory, and make x the pointer to
| it.". The thing that messes this up is sizeof. sizeof(x)
| doesnt give the size of the pointer like it should, it gives
| you the size of the array. If that was fixed (obviusly it can
| without breaking everything) then things would be much better
| and consistent.
| WastingMyTime89 wrote:
| Yes, _sizeof_ has a primitive as a weird behaviour when
| used in the scope of a continuous allocation. I agree it's
| unfortunate.
|
| But "arrays" definitely are pointers. I put "arrays" in
| quote because C has nothing I would personally call an
| array. It's just contiguous memory allocation. It's to the
| point that _10[a]_ and _a[10]_ are desugared to the same
| thing.
| masklinn wrote:
| > The trick is the size is hidden before the adress of the
| buffer.("Learn this one simple trick that will change your life
| for ever").
|
| This has drawbacks:
|
| 1. you can't convert an existing buffer to an sds buffer
|
| 2. you can't slice into a buffer, because the metadata is part
| of the string's buffer (even if it's before the pointer)
| lifthrasiir wrote:
| I don't like SDS for multiple reasons. My biggest complaint is
| that it's a data structure disguised as a single naive pointer,
| which is actually harder to use correctly. This kind of
| "masquerading" pointer is conceptually a linear type, as you
| can't safely change its length in place and any potential
| change has to return the modified pointer somehow. No other
| type in C behaves like this, resulting in more confusion and
| thus more errors. And I have more counterpoints to those self-
| claimed advantages as well:
|
| Counterpoint #1: You can't pass SDS strings to functions that
| accept `char **` (which is a common way to return a string of
| unknown length, and often can act as an in-out parameter as
| well).
|
| Counterpoint #2: You rarely access individual "characters"
| (whatever this means). It is a conscious decision to whether
| you should iterate over bytes or Unicode scalar values or code
| points or grapheme clusters, and for this reason it is better
| to make the decision explicit even though it's C `char` in the
| surface level.
|
| I have no evidence for nor against advantage #3 though.
| kevin_thibedeau wrote:
| > No other type in C behaves like this,
|
| Malloc does this.
| lifthrasiir wrote:
| Malloc is not a type. If you meant to say realloc, a good
| point and it's indeed a bad interface for the exact reason
| but still it's not a type.
| Karellen wrote:
| I believe malloc() was intended, as a number of old-
| school UNIX implementations of malloc() put the size of
| the allocation (and possibly other bookkeeping info?) "in
| front of" the pointer returned, in a similar way to how
| sds stores the size of its buffer.
| [deleted]
| realgeniushere wrote:
| Makes me think less of antirez that he doesn't acknowledge that
| this is the same design as Microsoft's BSTRs, which predate sds
| by many many years.
| gorgoiler wrote:
| Whoa. Jamming metadata in the address space before the string
| pointer is such a clever idea. I don't know enough about C to
| know how many awkward bugs this might cause, but I know enough
| about programming to spot exceptional lateral thinking when I
| see it. Very neat.
|
| I guess the SDS authors might ship a linter to spot all the
| times you mistakenly use free() instead of sdsfree()? That
| could make the cleverness more tolerable?
| arcticbull wrote:
| This is a common approach for things like malloc to use,
| since you are passing an opaque pointer to arbitrary data
| into free() which you then expect to quickly do something
| useful with. It can just walk back the pointer a little to
| find the header and act on it.
|
| It's pretty weird to see it anywhere other than malloc though
| especially masquerading as a basic type. It's incompatible
| with other common patterns like returning via (char *) and
| you can't identify which deallocator you're supposed to give
| the result to from the type alone.
| quelsolaar wrote:
| Optimizing text is hard because you seldom up front know how
| much memory will be needed and allocations are slow.
|
| Another way to do it is to use:
|
| typedef struct{ size_t allocated; size_t used; char buffer[];
| }String;
|
| This lets the header and the string be the same allocation.
| Thats a huge saving. Its also useful to store the allocation
| and use size separatly so that you can reuse / modify buffers.
| The used field lets you use memcpy without looking for string
| termination.
|
| You can make it even more complex by adding flags if the string
| is on the stack or on the heap. That way you can do things
| like:
|
| String buffer = MACRO_TO_CREATE_STRING_BUFFER_ON_STACK(256),
| *b; b = &buffer;
|
| b = do_processing_with_buffer(&b); // allocates on heap a
| larger buffer if needed
|
| string_free(b); // frees buffer if its on heap.
| jstimpfle wrote:
| In general it's hard to get more efficient than a simple
| struct String { const char *buffer; u32 size; }. Your method
| removes an indirection from the allocated storage, but you'd
| still need an external pointer to point to that struct in
| most cases. That, plus retrieving the size now costs an
| additional dereference. So I wouldn't use your method unless
| I knew that I'd have to reference the string from multiple
| locations.
|
| The best way to be efficient is often to make assumptions
| about the data. Most strings don't need any dynamic
| allocation after having been "built". So it makes a ton of
| sense to make a string builder API that returns a final
| string when it's finished. In this way, you save at least the
| "allocated" member.
|
| The advantage of the simpler string representation is that it
| works for any string (or substring) that is contiguous in
| memory, and is completely decoupled from allocation concerns.
| E.g. I can easily #define
| STRING(string_literal) \ ((String) {
| string_literal "", sizeof string_literal - 1 })
|
| , to be able to statically declare such strings like this:
| String my_string = STRING("Foo bar");
|
| If you have many strings that you know are small, then just
| the normal nul-terminated C string (without any size field)
| is as storage-efficient as it gets.
|
| In practice, I find string handling so easy that I rarely
| even define this struct String. I just pass around strings to
| functions as two arguments - pointer + size. It feels so
| light and data flows so easily between APIs, I love it.
| mh7 wrote:
| Re #3:
|
| A big downside is that you you can't easily take ownership of
| an existing buffer and treat it as this string type.
|
| string s;
|
| char some_buf[];
|
| string_take(&s, some_buf, some_capacity);
|
| Also you would of course never dynamically allocate string
| structs, just the data member if needed.
| _448 wrote:
| > The trick is the size is hidden before the adress of the
| buffer.
|
| That is how strings use to be stored before C made the choice
| of using null-terminator. Pascal stored the string size before
| the string data. The advantage of relying on a terminator
| symbol is that the string size can be any length where as
| storing the size at the start forces the string to not exceed
| certain size.
| [deleted]
| anonymoushn wrote:
| In execution environments with 64-bit pointers you may have
| trouble loading a string of more than 16 exabytes into RAM
| anyway
| tored wrote:
| Another is Hollerith strings that was used by FORTRAN and TCP
| protocols.
|
| https://en.wikipedia.org/wiki/Hollerith_constant
| mh7 wrote:
| strlen() returns a size_t so you're already constrained to a
| maximum length of SIZE_MAX.
| jcelerier wrote:
| If you use a different data structure you would maybe use a
| different API for accessing it too
| jstimpfle wrote:
| This is hilarious. SIZE_MAX is at least as large as the
| largest string that you can put in your address space /
| memory anyway. Which is what the strlen() API already
| assumes.
|
| That, plus you'd be a fool to store a huge string in this
| way _anywhere_ (in or out of memory) in any case.
| Someone wrote:
| > SIZE_MAX is at least as large as the largest string
| that you can put in your address space / memory anyway.
|
| Not necessarily. A 64-bit system could give processes an
| address space that's significantly larger than half the
| full 64-bit address space and have an allocator that
| allows you to allocate a block of more than _SIZE_MAX_
| bytes ( _malloc_ takes a _size_t_ , but you can use
| _calloc_ )
| Karellen wrote:
| size_t is unsigned, right? ssize_t is the signed version?
|
| On a quick test on my 64-bit system, a C program doing
| `printf("%zu\n", SIZE_MAX);` outputs
| 18446744073709551615, which looks like (2^64)-1 to me.
|
| Or is there a thing in the standard that says this isn't
| always the case?
| arcticbull wrote:
| ssize_t is a weird one, the only negative value it is
| guaranteed to store is -1.
|
| > The type ssize_t shall be capable of storing values at
| least in the range [-1, {SSIZE_MAX}].
|
| [1] https://pubs.opengroup.org/onlinepubs/9699919799/base
| defs/sy...
| jstimpfle wrote:
| This doesn't make sense to me. You can't "allocate" more
| than SIZE_MAX bytes by definition. If you take "allocate"
| to mean "make it available in the process's address
| space", that is.
| unwind wrote:
| Are you sure?
|
| The calloc() [1] function mentioned above takes _two_
| values of type size_t, and allocates _their product_
| bytes.
|
| I'm on mobile without (!) the C99 draft spec but at least
| the man page gives no such restriction.
|
| [1] https://linux.die.net/man/3/calloc
| mek6800d2 wrote:
| I read something about this recently, somewhere, maybe
| HN. Specifically, in calloc(), what is done and what
| should really be done if the multiplication overflows. As
| will happen, for example, if you try to calloc() two
| elements of size SIZE_MAX, when SIZE_MAX is the maximum
| representable unsigned integer value on the machine. So,
| I don't think calloc() is available or intended as a way
| to circumvent malloc()'s size restriction.
| ahepp wrote:
| Isn't size_t defined as being able to fit the largest
| possible data allocation?
| pjmlp wrote:
| Indeed, you just need to forget to put a terminator to
| get a nice memory dump.
| drfuchs wrote:
| Nit: Many Pascal compilers / runtimes extended the language
| in non-standard ways, including various schemes for storing
| string length in front of the string. But nothing like this
| was ever part of the ISO Pascal standard, and it was
| certainly not in the "PASCAL User Manual and Report" by
| Kathleen Jensen and Niklaus Wirth.
|
| In fact, in standard Pascal, string handling is extremely
| rudimentary; there was no way to express "this variable /
| parameter / pointer refers to a string with a length not
| known at compile-time".
| pjmlp wrote:
| They were on the ISO Extended Pascal, which hardly mattered
| because by then, USCD Pascal and Object Pascal already had
| taken over the world of Pascal dialects, both of which had
| better ways to deal with strings.
|
| https://www.iso.org/standard/18237.html
|
| Additionally, Modula-2 was already available in 1978,
| sorting out all the issues of original Pascal, with all the
| features needed for a safe systems programming language in
| the late 70's.
| drfuchs wrote:
| In the late 70's, there were production-quality Pascal
| compilers for DEC 20 / ITS / SAIL, Vax/VMS, IBM 360/370,
| together covering much of academic computing and most of
| the ARPAnet. Even consulting Wirth, Knuth couldn't find
| suitable Modula-2 compilers available for these, so TeX
| used Pascal and not Modula-2. Near as we heard, it was
| only ever seriously used on the niche ETH workstation?
| aap_ wrote:
| Null-termination was not a C invention.
| jstimpfle wrote:
| The biggest advantage of zero-terminated to me is simplicity,
| next would be efficiency for really small strings - although
| this is a fringe concern. Strings with explicit length should
| at least have a 32-bit length field (maybe 64) IMO - for
| example, it's common to read files (and store them in
| contiguous memory) that are larger than 64K.
| abcd_f wrote:
| The length can be packed, e.g. like utf-8 does it or
| something similar. The caveat is the cost of unpacking on
| access, but the memory overhead will be minimal.
| thom wrote:
| SDS supports 64-bit lengths. It also dynamically changes
| the size of its size/flags field to accommodate growth. The
| minimum overhead is an extra char (same as null
| termination).
| lifthrasiir wrote:
| Most memory allocators have an internal fragmentation which
| removes most efficiency gained by zero-termination. In fact
| it's worse, because zero-termination means that
| deallocation can't take a size parameter and it can often
| cause a performance hit for many modern allocators due to
| cache misses [1].
|
| [1] https://isocpp.org/files/papers/n3778.html
| jbverschoor wrote:
| Well you could easily use the first 4 bits to indicate how
| many bytes the length is + 1. c0
| (0b0000) -> length is 0xc = 12 1341 (0b0001) ->
| length is 0x134 = 69,940 239a42 (0b0010) -> length is
| 0x239a42 = 145,828 deadbeaf239a47 (0b0111) ->
| length is 0xdeadbeaf239a4 = 3917404957718948.
|
| This gets you a 7-byte = 56bit number, minimal overhead for
| smaller strings.
|
| Maybe reserve 0x1111 for future use.
|
| Maybe the other endian makes more sense here, and maybe 0
| should mean zero-length.
|
| It's probably not very performant compared to other
| solutions, but you can just shift 4 bits, and you're done
|
| I'm curious how many strings are allocated at a particular
| point in time (across everything, kernel, os, apps, etc)
| jstimpfle wrote:
| These considerations can make sense when thinking about
| storage formats (probably you want to compress the string
| too), but they are not convenient for in-memory
| representation where you want to get the location of the
| first character with a simple member access.
| jbverschoor wrote:
| It starts at the start of the frame + the first nibble +
| 1
| [deleted]
| thaumasiotes wrote:
| > The advantage of relying on a terminator symbol is that the
| string size can be any length where as storing the size at
| the start forces the string to not exceed certain size.
|
| In the same way that since we identify unicode code points
| with a 16-bit value, it's impossible to include U+1D460 in a
| string?
|
| In the same way that since Matroska files encode the length
| of their segments, there's a hard upper limit on the length
| of a segment?
|
| Of course none of those things is actually true. Storing the
| string size has no implications for how long the string can
| be. It requires an amount of space, to store the string size,
| that is logarithmic in the length of the string, and
| completely insignificant.
| jstimpfle wrote:
| For sake of simplicity, and for efficiency with really
| small strings, with a length-prefixed string representation
| you really want to keep the string length field fixed-size.
| In general.
| thaumasiotes wrote:
| Really small strings have a fixed-size length field in
| any variable-size encoding of the length. They're small,
| so they fit into whatever the smallest possible length
| field is.
|
| What do you gain in handling short strings from an
| inability to handle long ones?
| jstimpfle wrote:
| Ok I give you this one, but I still don't think that
| minimizing the size of a length field using a flexible
| width encoding is a good idea except when talking about
| extremely specialized string encodings (like compression
| schemes).
|
| Flexible width encoding is more complicated compared to
| simple member access to get at the first character. And
| how do you handle construction of a string whose size you
| don't know yet? You might have to move the string away to
| make space for a bigger string length field. I don't like
| it.
| thaumasiotes wrote:
| > Flexible width encoding is more complicated compared to
| simple member access to get at the first character.
|
| I don't think this is true either. It's almost true. But
| what happens if the string length is 0?
|
| If you make the assumption that you can access the first
| character of a zero-length string by just grabbing
| whatever is in memory after the string header, you're
| going to make the exact mistake the length field is there
| to stop you from making, a memory access violation. You
| have to process the length field in order to do any
| access at all; many strings don't have a first character.
|
| > And how do you handle construction of a string whose
| size you don't know yet? You might have to move the
| string away to make space for a bigger string length
| field.
|
| That's true; you'll either need to be willing to store
| the character data and the length metadata in separate
| locations, or you'll need to be willing to occasionally
| move the data around.
| jstimpfle wrote:
| Obviously I mean get at the address of the first
| character, if any. You can't load before you know that
| what you load is valid. Btw. zero-terminated strings
| allow you to load unconditionally. Sometimes that's nice.
| thaumasiotes wrote:
| OK, but now the difference in how complicated it is to
| read from the string boils down to this:
| 1. Read the first chunk of the string length. 2.
| Is it more than 0?
|
| vs 1. Read the first chunk of the
| string length. 2. Did we get the whole thing?
| 3. Is the length more than 0?
|
| That extra step in the variable-length case means
| checking whether a bit is set in the value you just read.
|
| ---
|
| Also, it occurs to me that this whole discussion is
| talking about how to serialize or deserialize a string,
| when the original discussion is over how the string
| should be represented in memory.
| lelanthran wrote:
| > In the same way that since we identify unicode code
| points with a 16-bit value
|
| Not a single 16-bit value. Some codepoints are two 16-bit
| values.
| kevin_thibedeau wrote:
| Codepoints are 21-bit values. They may have a more
| compact encoding but the unencoded form is fixed.
| jlokier wrote:
| On point 3, you can achieve the same cache locality, without
| losing the ability to take slices or append, by having the
| string object contain a pointer to the string bytes, and
| allocating the bytes _by default_ immediately after the string
| object.
|
| It is still single allocation, so the allocation is just as
| fast.
|
| The pointer is in the same cache line as the string bytes in
| all strings except for slices (and any other fancy indirect
| string types). Even though the code fetches indirectly via that
| pointer, the CPU will be able to fetch the initial string byte
| efficiently as soon as it has the pointer.
| estebank wrote:
| How would this colocation of the string pointer work? Because
| these would be in the heap, right? Otherwise the pointer
| would get invalidated as soon as the enclosing function ends
| and its stack frame gets discarded. So if it is in the heap
| then you either have a pointer to the colocated pointer (not
| very useful, if negligible performance impact) or you're
| copying the colocated pointer (at which point you're back to
| square one, having a pointer in the stack and the underlying
| string in the heap). Am I missing something?
| alcover wrote:
| Good point. Some SSO (small str optimization) schemes
| achieve this by pointing back into the struct itself. Gcc
| String implementation for ex.
| program wrote:
| It's better not to use types that end with a '_t' because the
| suffix is reserved in POSIX systems.
|
| https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh...
| cassepipe wrote:
| On the other hand it is quite handy as a prefix, s_ for
| structs, e_ for enums, g_ for globals, t_ for simple typedefs,
| f_ for function pointers typedefs, u_ for unions... sky is the
| limit !
|
| And it's quite easy to create an highlighting rule for it in
| vim if you still did not convert to treesitter. Just put in
| ~/.vim/after/c.vim :
|
| ```
|
| syn match cType /\<\\(t\|s\|e\|u\\)_\w\\+\>/
|
| ```
|
| Boom, custom type highlighting for C ! Pick the the letters you
| will use.
| lifthrasiir wrote:
| No wonder why a significant portion of C programmers actually
| want to keep tags (`struct` in `struct foo`) instead of
| removing them wih typedef.
| kazinator wrote:
| I don't agree with that at all; it is "the sky might fall"
| reasoning.
|
| Just
|
| * have sane naming in your program.
|
| * respect namespaces like _[A-Z] and __
|
| * solve clashes that actually happen
|
| Historically, revisions of POSIX have introduced identifiers
| that were not in any previously announced namespace. There is
| no way you can name an identifier that is guaranteed not to
| clash with POSIX, or any other vendor. For instance the name
| "openat" was fine to use in a POSIX program once upon a time.
|
| Consider that all strings have the empty string as a suffix.
| The string "abc" has four suffixes: "abc", "bc", "c" and "".
|
| So, every current and future POSIX identifier has "" as a
| suffix. This is not just a threat; it is guaranteed! Since
| every identifier in your program also has a "" suffix, it
| clashes with that namespace.
|
| What's wrong with the argument is that identifiers don't just
| have a suffix; they have to be identical in order to actually
| clash. (Or have identical prefixes, due to truncation of
| external names in a linker: decades ago, the limits were
| ridiculously small.)
|
| I doubt that even one person in POSIX standardization would be
| dumb enough to approve str_t being added as a typedef name in
| some existing or new header, and multiple approvals are
| required.
|
| Nobody should be losing any sleep over using _t typedef names
| in their C code.
| Kwpolska wrote:
| The argument with "" as suffix sounds quite absurd.
|
| Why do you believe POSIX would never approve a str_t type?
| Nobody likes raw char arrays, perhaps a future revision of
| POSIX may decide to make the lives of C programmers easier
| and implement their own sane string type.
| jstimpfle wrote:
| I for one like "raw char arrays", and really don't care
| about missing string functionality in C. I basically use
| sizeof, snprintf, memcpy and am just fine. I've toyed with
| defining struct String{ptr,size} sometimes but largely it
| just gets in the way.
|
| If you think it's necessary, it's very easy to make an
| argument that you'd have to have a generic type for slices
| of any type. (Actually, more so than strings, since C is
| just not a language for domains with a focus on strings).
|
| Now, whether you think a language must have a generic slice
| type or not, C is simply not the language where you can fit
| that in.
| [deleted]
| kazinator wrote:
| Yes, the argument is absurd; it's supposed to be.
|
| Now extending the suffix to "_t" doesn't make it much less
| absurd. Not qualitatively, just a bit quantitatively less
| absurd.
|
| Why I suspect POSIX isn't about to add a str_t is that
| str_t is likely to occur in countless numbers of unknown
| existing code bases.
|
| And _that_ might be a good reason for avoiding it in a
| library API, not the _t namespace being reserved.
|
| We can have this variant of the argument: most identifiers
| end in a lower-case letter, so they land into any one of 26
| namespaces: the *a namespace, the *b namespace, ... future
| POSIX identifiers have to be in one of these 26, except
| those that end in digits or underscores. POSIX does _not_
| say "future versions of this standard shall not claim new
| function or other identifiers ending in e". That doesn't
| mean you stay away from identifiers ending in "e", right?
|
| I wouldn't avoid str_t in the internals of a program
| though. In the worst case, a clash happens somewhere and we
| do some renaming; life goes on.
|
| POSIX's reservation doesn't really mean much; all they are
| saying is "we have some type names ending in _t, and will
| likely have more, so watch out". Yes, POSIX will likely
| have such names, and so will every C programmer and his
| dog. Whoppee dee. POSIX will likely have new names ending
| in 'e' also, and so on.
| eps wrote:
| Yeah, that's _the_ lowest-hanging C pedantry nitpick.
|
| Usually if there's nothing else meaningful that one can say
| about someone else's project, they will comment on the _t
| naming... and as anyone with a yota of real-world experience
| would know it's a complete non-issue outside of a handful top-
| tier open source projects.
|
| Don't be that guy. Save this comment for when it may actually
| be relevant.
| mh7 wrote:
| It's reserved in the same sense that google's style guide
| 'reserves' struct names starting with a capital letter.
|
| Only ISO C can officially reserve names, everyone else just has
| their personal code/naming style that you can chooseto follow
| or not.
| lifthrasiir wrote:
| I would argue in the other way: the C standard should have a
| standard string type named `str_t`, and this library is one way
| to prototype it ;-)
| e-dant wrote:
| I guess it's nice for a C string API, but what's the motivation
| to use and create this? Wouldn't externing some C++ symbols (or
| Rust) work more smoothly?
| lelanthran wrote:
| > Wouldn't externing some C++ symbols (or Rust) work more
| smoothly?
|
| For the C++ case, it's not that easy due to C code that cannot
| handle exceptions thrown in C++ code.
|
| For the Rust bit, I'm not sure - creating the library in Rust
| and letting it be called from C makes the whole rust library
| unsafe because the data returned from the Rust API would lose
| ownership information, and is no more safe than simply writing
| it in C.
| kazinator wrote:
| > _Attempting to split a string using non-existent delimiter with
| str_pop_first_split() [returns an invalid string with .data ==
| NULL].
|
| But that seems like a valid case: e.g. these are comma-delimited
| lists of numbers: "" // empty "1" // one
| number "20,30" // two numbers
|
| the above remark in the documentation seems to be saying (perhaps
| falsely) that if we try to extract a token from the "1" string
| using "," as a delimiter, we get an invalid str_t rather than
| "1".
|
| I don't see coverage for this in the tests. There is a test which
| uses "123/456/789", which extracts the first two splits, and then
| just verifies that "789" remains. What the programmer wants is to
| be able to write a loop which will extract "123", "456" and "789"
| and _then* hit the terminating case where the invalid str_t is
| returned.
|
| How many items are in "1,2,3," viewed as comma-separated: three
| or four?
|
| It would also be a code improvement to replace umpteen
| repetitions of "(str_t){.data = NULL, .size = 0}" throughout the
| code with a macro.
| kaba0 wrote:
| All in all, C is still not expressive enough for even such a
| basic data structure as strings.
| Diggsey wrote:
| Slightly ironic that Rust is criticized for having multiple
| string types, and yet the solution to simplify string handling in
| C is to introduce the exact same types (str_t == &str, strbuf_t
| == String) albeit without the safety guarantees.
| estebank wrote:
| It is still frustrating to me that C still doesn't have a non-
| allocating method to handle substring references, which both
| C++ and Rust have. On the other hand I see people trying to
| parse files, like JSON, in a non-allocating way in Rust and hit
| a wall until they realize that nodes need to be escaped for
| anything useful, which requires owning the node's memory
| (meaning, you need a String or at least Cow<'_, str>, can't get
| away with a &str).
| adamdusty wrote:
| I don't think anyone minds that rust has multiple string types
| just that they're effectively named the same thing so people
| new to rust have no clue which does what without looking it up.
| Furthermore people without c/c++ experience mostly wont even
| know there is a difference since most languages don't give you
| that control over strings.
|
| If rust string were str and strvec or strbuf no one would care.
| andrewmcwatters wrote:
| I want C strings that are compatible with string.h.
|
| I want some struct that is a pointer to the char array `s' with
| size_t `n'.
|
| To meaningfully do this, it means you need auxiliary functions
| that you execute after calling string.h functions, or you write
| wrappers that do this for you after calling the relevant string.h
| functions.
|
| I'm OK with that.
|
| SDS doesn't do this. Most other C string libraries like this one
| basically do what I'm asking for, but not quite.
|
| I don't want separate structs for reading and writing strings. I
| just want authors to keep it as simple as possible without
| diverging too hard from how C strings already work today.
| kevin_thibedeau wrote:
| I have a personal lib that works like this. It maintains a
| simple struct with a start pointer and a one-past-the-end
| pointer. You can use it to construct a view or point into
| unused space at the end of a string for building ops. NUL
| termination is preserved so interop with stdlib is always
| available.
|
| This allows for nicer string handling while always allowing
| interop with anything expecting a char *. Libraries with their
| own string implementation always exact a penalty to get a cstr
| out.
| jandrese wrote:
| This is kind of a bikeshed argument, but I'd prefer if the view
| was labeled as such. So instead of str_t it would be strview.
| Rust makes this same mistake IMHO and it causes a lot of
| confusion for beginners. I would personally call the strbuf_t
| strstore but that's even more nitpicky.
|
| Naming things is one of the hardest problems in CS.
| zajio1am wrote:
| 1. Ditching null termination makes it cumbersome for
| interoperability with C ecosystem.
|
| 2. It has terrible overhead.
| LegionMammal978 wrote:
| > All strbuf functions maintain a null terminator at the end of
| the buffer, and the buffer may be accessed as a regular c
| string using mybuffer->cstr.
|
| So effectively a str_t works like an std::string_view from C++,
| and strbuf_t works like an inline std::string.
|
| To produce a null-terminated string from a section of a longer
| string requires an allocation, unless you can temporarily
| modify the original string to replace one of its characters
| with a terminator.
| zajio1am wrote:
| Well, the documentation says that null terminator is
| maintained at the end of the buffer (i.e.
| mybuffer->cstr[mybuffer->capacity - 1]), not at the end of
| the string stored in the buffer (i.e.
| mybuffer->cstr[mybuffer->size]).
| LegionMammal978 wrote:
| Not sure where you're getting that interpretation from. If
| you look at the actual code, it sets buf->cstr[buf->size] =
| 0 every time the string is resized. After all, what else
| could "the buffer may be accessed as a regular c string"
| possibly mean?
| zajio1am wrote:
| > Not sure where you're getting that interpretation from.
|
| That is just plain reading of "null terminator at the end
| of the buffer", as 'buffer' is just place in memory,
| regardless of what is stored in it. 'End of the buffer'
| is commonly used for end of such reserved memory, not end
| of valid data in that memory.
|
| But maintaining the null-terminated string in the buffer
| is much more useful behavior than just maintaining null
| terminator at the end of the buffer, so it is likely just
| sloppiness in the documentation.
| KingLancelot wrote:
| thesz wrote:
| Having a string type that has "invalid string" value which is
| different from empty string value is a bliss.
|
| What is important there is that the invalid string value is
| completely compatible with most C functions - despite actual data
| pointer is NULL, the length of data is zero so memcmp,
| memmove/memcpy and most other functions will not segfault.
|
| This is really thought out approach.
|
| Thank you!
| gkfasdfasdf wrote:
| Hoping someone can educate me, what are the advantages of having
| the last member of strbuf_t be a variable length array (char
| cstr[]) instead of just a char*?
| ksherlock wrote:
| With inline data, only one malloc is needed for the buffer
| housekeeping and character data. It's also probably slightly
| better for cache performance since the housekeeping data and
| string data are together.
| gkfasdfasdf wrote:
| Ah I see. If you want to refer to a string that was not part
| of this allocation you would use the other str_t type,
| ComputerGuru wrote:
| You can store the string as part of the same heap/stack
| allocation rather than as a separate allocation.
| naasking wrote:
| Why not ropes?
|
| https://github.com/josephg/librope
| musicale wrote:
| > str.h defines the following str_t type:
| typedef struct str_t { const char* data;
| size_t size; } str_t;
|
| Sort of a hybrid of C style (pointer) and Pascal style (bounded
| array) strings?
| pjmlp wrote:
| This is the kind of string libraries that WG14 should care about.
|
| Kudos for having a go at it.
| habibur wrote:
| I use a lib like this, but a few changes.
| printf("The string is %"PRIstr"\n", PRIstrarg(mystring));
|
| Simpler: printf("the string is : %.*s",mystr.size, mystr.data)
|
| But that's tedious to write. So create a small macro
| #define ls(x) (x).size,(x).data
|
| And then printf becomes as simple as :
| printf("the string is : %.*s", ls(mystr));
|
| Though OP's macro is possibly doing more.
| masklinn wrote:
| > But that's tedious to write. So create a small macro
|
| > #define ls(x) (x).size,(x).data
|
| Doesn't that double-evaluate `x`?
| habibur wrote:
| It does.
| gjvc wrote:
| worth comparing to https://cr.yp.to/lib/stralloc.html
| plan999 wrote:
| You should look at an even better string library. Much more
| functions and safety for split/joint/tokenizer/etc it's a fork
| of the plan9 string library bstring.
|
| https://bstring.sourceforge.net/
| [deleted]
___________________________________________________________________
(page generated 2022-12-03 23:01 UTC)