[HN Gopher] C Posix-compliant argument parsing in 42 LoC, inspir...
       ___________________________________________________________________
        
       C Posix-compliant argument parsing in 42 LoC, inspired by Duff's
       device
        
       Author : camel-cdr
       Score  : 70 points
       Date   : 2023-01-04 11:41 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | camel-cdr wrote:
       | This uses switch abuse to handle short and long arguments in one
       | place:                   ...         else if (ARG_LONG("option"))
       | case 'o': {         } ...
       | 
       | The idea was to leave all the decisions to the library user, so
       | there are no automatic help pages and error messages. The library
       | essentially just gives you a new language primitive to work with.
       | 
       | Example usage (also included in the file):                   //
       | assumes argv and argc exist              ARG_BEGIN {
       | if (0) {                         case 'a': a = 1; ARG_FLAG();
       | break;                         case 'b': b = 1; ARG_FLAG();
       | break;                         case 'c': c = 1; ARG_FLAG();
       | break;                         case '\0': readstdin = 1; break;
       | } else if (ARG_LONG("reverse")) case 'r': {
       | reverse = 1;                         ARG_FLAG();
       | } else if (ARG_LONG("input")) case 'i': {
       | input = ARG_VAL();                 } else if (ARG_LONG("output"))
       | case 'o': {                         output = ARG_VAL();
       | } else if (ARG_LONG("help")) case 'h': case '?': {
       | printf("Usage: %s [OPTION...] [STRING...]\n", argv0);
       | puts("Example usage of arg.h\n");
       | puts("Options:");                         puts("  -a,
       | set a to true");                         puts("  -b,
       | set a to true");                         puts("  -c,
       | set a to true");                         puts("  -r, --reverse
       | set reverse to true");                         puts("  -i,
       | --input=STR    set input string to STR");
       | puts("  -o, --output=STR   set output string to STR");
       | puts("  -h, --help         display this help and exit");
       | return EXIT_SUCCESS;                 } else { default:
       | fprintf(stderr,                                 "%s: invalid
       | option '%s'\n"                                 "Try '%s --help'
       | for more information.\n",                                 argv0,
       | *argv, argv0);                         return EXIT_FAILURE;
       | }         } ARG_END;                 // argv and argc now hold
       | the non-option arguments
        
       | planede wrote:
       | I appreciate the macromancy, but it's not very readable. I'm sure
       | it's very efficient though.
        
         | [deleted]
        
         | joosters wrote:
         | Perfect for the all-too-common situation where your performance
         | bottleneck is in command line argument parsing...
        
       | zokier wrote:
       | What does posix compliant mean in this context?
        
         | klyrs wrote:
         | It (edit: claims to) conforms to the POSIX spec for command
         | line arguments
         | 
         | https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
        
           | planede wrote:
           | That document does not describe long options. I guess that
           | can be treated as a conforming extension.
           | 
           | The C implementation does not seem to have support for
           | optional option arguments.
           | 
           | The C implementation also does not seem to support "--" as a
           | delimiter for option arguments, although that's only a
           | guideline.
           | 
           | So I guess it can be used to implement POSIX compliant
           | interfaces that don't use optional option arguments.
           | 
           | edit: supporting "--" as delimiter might not be a guideline:
           | 
           |  _The utilities in the Shell and Utilities volume of
           | POSIX.1-2017 that claim conformance to these guidelines shall
           | conform completely to these guidelines as if these guidelines
           | contained the term "shall" instead of "should"._
           | 
           | edit2: Seems like I was completely wrong about supporting
           | "--", when tried out it does seem to support it. One weird
           | corner cases is the call `<prog> -- -`, where "-" isn't
           | interpreted as enabling stdin, but treated as a regular
           | positional argument. `echo asd | cat -- -` reads from stdin
           | with GNU cat.
        
             | mtlmtlmtlmtl wrote:
             | "--" is a GNU originated thing afaik.
        
               | planede wrote:
               | Maybe, but it's part of POSIX now. Anyway, I misread the
               | code, and it's actually supported.
        
             | camel-cdr wrote:
             | > So I guess it can be used to implement POSIX compliant
             | interfaces that don't use optional option arguments.
             | 
             | Yes, that was my intention.
        
             | klyrs wrote:
             | On a closer read, the comment mentions plan9's arg(3):
             | 
             | https://9fans.github.io/plan9port/man/man3/arg.html
             | 
             | Generally, I think it's pretty normal to interpret "POSIX
             | compliant" as a minimum requirement, and then extend that
             | to your own whims as long as you don't break the POSIX
             | part.
        
       | cperciva wrote:
       | Another option with similar levels of macro (ab)use is my "magic
       | getopt" which lets you write code like this:
       | const char * ch;             while ((ch = GETOPT(argc, argv)) !=
       | NULL) {                     GETOPT_SWITCH(ch) {
       | GETOPT_OPT("-a"):                             aflag = 1;
       | break;                     GETOPT_OPTARG("--bar"):
       | printf("bar: %s\n", optarg);                             break;
       | GETOPT_OPTARG("-f"):                             printf("foo:
       | %s\n", optarg);                             break;
       | GETOPT_MISSING_ARG:                             printf("missing
       | argument to %s\n", ch);                             /*
       | FALLTHROUGH */                     GETOPT_DEFAULT:
       | usage();                     }             }
        
         | ufo wrote:
         | I'm intrigued. Is there somewhere where we could find the full
         | definition of these macros?
        
           | cperciva wrote:
           | https://github.com/Tarsnap/libcperciva/blob/master/util/geto.
           | ..
           | 
           | https://github.com/Tarsnap/libcperciva/blob/master/util/geto.
           | ..
           | 
           | BSD licensed, of course.
        
         | ZephyrP wrote:
         | Very neat. Do you implement option parsing in "magic getopt" or
         | can you (somehow?) handle setting up the option string
         | arguments used by the more familiar variants in getopt(3)?
        
           | cperciva wrote:
           | All the option parsing is done in functions called by the
           | macros. The "getopt string" isn't needed since the
           | GETOPT_OPT() and GETOPT_OPTARG() "case statements" convey the
           | same information (the set of valid options).
        
         | chungy wrote:
         | switch options falling through doesn't seem like magic to me.
         | Rather, a common practice.
        
           | cperciva wrote:
           | That's not the magic part...
        
       | asveikau wrote:
       | This is off topic, but your stretchy buffer does one of my pet
       | peeves:                   (a)->at = realloc((a)->at, (a)->_cap *
       | sizeof *(a)->at))
       | 
       | When realloc fails, the old value of (a)->at will be leaked.
       | 
       | And memory allocation failure also leads to null pointer
       | dereference in this code.
        
         | camel-cdr wrote:
         | Ah, my way of dealing with memory allocation failure is to
         | write a xmalloc/xcalloc/xrealloc that exit on error. So the
         | idea would be that the library users can just:
         | #unded malloc         #define malloc xmalloc         ...
         | 
         | I suppose I could make this the default behavior, and make it
         | overwritable.
        
           | mtlmtlmtlmtl wrote:
           | Doesn't this break things like Valgrind?
        
             | asveikau wrote:
             | Firstly, an xmalloc will call malloc and add error checks.
             | So malloc is still called.
             | 
             | But more fundamentally, Valgrind hooks into the allocator
             | (and the program at large) much more deeply. You could
             | write your own allocator (many people deploy something
             | other than libc's) and it would still figure it out.
        
         | [deleted]
        
       | xfennec wrote:
       | How this code can enter the "if (0)" part?
        
         | mtlmtlmtlmtl wrote:
         | ARG_BEGIN opens a switch statement. So it can enter any of the
         | cases, or it'll jump to the else if part.
        
           | kreetx wrote:
           | Is there any reason this is not written as `switch(..) {
           | <case statements>; default: <else statements> }`. I.e is
           | there a reason for `if(0)`?
        
             | jsmith45 wrote:
             | The whole `if` portion is for if the second character of
             | the argument (after the initial `-`) is a second `-` [0].
             | 
             | In that case, the switch jumped to `case '-':` hidden
             | within the macro, and the if statement is about processing
             | long arguments (like `--help`). arguments only available as
             | a short flag should never get executed as part of a short
             | flag, so putting them inside the if(0) case is an easy
             | option.
             | 
             | An alternative without if(0) of any variety would be:
             | ARG_BEGIN {                 if (ARG_LONG("reverse")) case
             | 'r': {                         reverse = 1;
             | ARG_FLAG();                 } else if (ARG_LONG("input"))
             | case 'i': {                         input = ARG_VAL();
             | } else if (ARG_LONG("output")) case 'o': {
             | output = ARG_VAL();                 } else if
             | (ARG_LONG("help")) case 'h': case '?': {
             | printf("Usage: %s [OPTION...] [STRING...]\n", argv0);
             | puts("Example usage of arg.h\n");
             | puts("Options:");                         puts("  -a,
             | set a to true");                         puts("  -b,
             | set a to true");                         puts("  -c,
             | set a to true");                         puts("  -r,
             | --reverse      set reverse to true");
             | puts("  -i, --input=STR    set input string to STR");
             | puts("  -o, --output=STR   set output string to STR");
             | puts("  -h, --help         display this help and exit");
             | return EXIT_SUCCESS;                 } else { default:
             | fprintf(stderr,                                 "%s:
             | invalid option '%s'\n"                                 "Try
             | '%s --help' for more information.\n",
             | argv0, *argv, argv0);                         return
             | EXIT_FAILURE;                 }                 break;
             | case 'a': a = 1; ARG_FLAG(); break;                 case
             | 'b': b = 1; ARG_FLAG(); break;                 case 'c': c
             | = 1; ARG_FLAG(); break;                 case '\0':
             | readstdin = 1; break;         } ARG_END;
             | 
             | The downside is that those few short flag only arguments no
             | longer line up nicely with the others, and they come after
             | the default, which could be a little bit confusing.
             | 
             | Footnote: [0] Except if that second dash is the last
             | character of the argument, in which case -- is the end of
             | flags marker, meaning any further arguments that begin with
             | `-` are just funky positional paramaters
        
         | tempodox wrote:
         | The `switch` statement in `ARG_BEGIN`.
        
         | cesarb wrote:
         | The way "switch" works in C is very weird. It behaves more like
         | a "goto", where each "case ...:" is a label to which it can
         | jump.
         | 
         | So when the character is '-', it starts just before the "if
         | (0)" (this part is hidden within the ARG_BEGIN macro), and as
         | you noted, it will never enter that block. However, when the
         | character is 'a', 'b', 'c', or nothing (the end-of-string
         | marker), it will jump directly to the corresponding "case ...:"
         | label, even though it's within that "unreachable" block.
        
         | pcwalton wrote:
         | A switch can jump inside an if body that contains case labels,
         | causing the condition not to be evaluated.
         | 
         | C switch/case semantics are... not the most obvious.
        
         | mark_undoio wrote:
         | Because there's really a switch statement (hidden behind a
         | macro) that will jump to labels within that block.
         | 
         | The fact that the if condition is false means it won't just run
         | the whole block straight through but you can still jump to a
         | label in it. A goto statement would also allow you to jump into
         | an otherwise-unreachable block.
        
       | Arch-TK wrote:
       | If you don't need long options and don't want to use getopt (for
       | whatever misguided reason) then I wrote a (rare) blog post about
       | doing this without any macro abuse: https://the-
       | tk.com/post/2021/07/29/option-parsing-on-a-budge... .
        
       | metadat wrote:
       | For the initiated, Duff's Device is a technique for manually
       | implementing loop unrolling.
       | 
       | https://en.wikipedia.org/wiki/Duff%27s_device
        
         | Arnavion wrote:
         | ( _un_ initiated. The initiated would already know what it is.)
        
           | metadat wrote:
           | Thanks for the typo correction.
        
       | gabrielsroka wrote:
       | Compliant
        
         | [deleted]
        
         | metadat wrote:
         | I clicked through to the link before reading any comments:
         | 
         | "In C? I wonder if it's English only?"
         | 
         | Then proceeded to become extremely confused:
         | 
         | "How is this going to help me parse a complaint? It looks like
         | arg parsing..."
         | 
         | ^^
        
         | classified wrote:
         | Still, complaint argument parsing is an important and
         | undervalued skill, POSIX or not.
        
         | camel-cdr wrote:
         | Whoops, is it possible to edit the title?
        
       | kens wrote:
       | I realized that I essentially never use a "switch" statement. It
       | seems like a control-flow construct that made sense in the 1970s
       | to help the compiler generate jump tables, but doesn't seem
       | particularly useful now. Moreover, it seems error-prone with
       | accidental fall-through. And doing "fancy" things with it makes
       | code very hard to read. Does anyone else find "switch" kind of
       | redundant?
        
         | aidenn0 wrote:
         | I use switch statements all the time for state-machines. There
         | are linters that will warn on fallthrough; you should use one
         | of them.
        
         | oweiler wrote:
         | I still find switch statements easier on the eye than a chain
         | of if-else-if-...else.
        
       ___________________________________________________________________
       (page generated 2023-01-05 23:01 UTC)