https://traxys.me/sheshat_pantheon_3.html

pantheon: Parsing command line arguments

28 September 2024

1 - The Goal
2 - Lexing command line arguments
3 - Parsing command line arguments provided a description
4 - Derive Macro
4.1 - Introduction to procedural macros
4.2 - Parsing a struct
4.3 - Implementing the trait
4.4 - Handling Subcommands

After managing free memory I wanted to implement another feature, but
it launched me into a very long yak shave. The next few (I can think
of at least three!) posts will take us on bit of a side journey
before we can come back to the kernel (hades).

This first post is on writing a library to parse command line
arguments (I know, this seems to be very removed from kernel
development).

It's named sheshat, after the ancient Egyptian goddess of writing
(among others).

The Goal

Before writing any code let's talk a bit about what we want. This
project is going to be a standalone crate to parse command line-style
arguments. Because we may not have access to the standard library in
all contexts (maybe we want to use our library in the kernel?), we
are going to write a no_std library.

Here are the different kind of arguments we want to parse:

  * Short flags: -a
  * Fused short flags -ab
  * Short options with fused values -afoo
  * Short options -a foo
  * Long flags --long
  * Long options with fused values --long=foo
  * Long options --long foo
  * Positional arguments foo

We also want to respect common patterns, like -- marking the end of
options & - being a valid value. This crate won't handle whitespace
splitting, the main input will be an array of &str (or to be more
user-friendly of T: AsRef<str>).

The main interface we are going to expose is a derive macro, given a
struct Args:

struct Args<'a> {
    switch: bool,
    other_switch: bool,
    x_option: bool,
    y_option: bool,
    long: &'a str,
    first_positional: u64,
    remaining_positional: Vec<u64>,
}

We are going to be able to parse something like --switch -xy 1 2 3
--long=str into:

Args {
    switch: true,
    other_switch: false,
    x_option: true,
    y_option: true,
    long: "str",
    first_positional: 1,
    remaining_positional: vec![2, 3],
}

In order to choose how fields map to arguments we will need to
introduce some attributes, but we will talk more about that when we
describe the derive macro itself.

Note: In the example we introduced a borrowed string, the goal is to
be able to borrow from the arguments.

Lexing command line arguments

"Lexing" arguments seem a bit redundant, as we have already
established that the library won't perform whitespace splitting,
which is a big part of parsing arguments. Well this is a task best
left to the program "launcher" whatever it is (a shell for example),
but we still have some complexity to wrangle.

The interface for this step has been very inspired by clap_lex, it
helped me a lot to understand the edge cases!

We will introduce two main components:

/// A wrapper around an array of arguments
struct Arguments<'a, T: AsRef<str>>(&'a [T]);
/// The current index in the array of arguments
struct ArgCursor(usize);

impl Arguments<'a, T> {
    fn peek_arg(&self, cursor: &ArgCursor) -> Option<ParsedArgument<'a>>;
    fn advance(&self, cursor: &mut ArgCursor);
    /// Equivalent to calling peek_arg & advance
    fn next_arg(&self, cursor: &mut ArgCursor) -> Option<ParsedArgument<'a>>;
}

Those methods are mainly an abstracted way to iterate over an array,
the interesting methods are on the ParsedArgument.

This struct mainly exposes two kind of methods:

  * Checks
      + is_opt_end to check for --.
      + is_long to check if the current value is a long argument.
      + is_short to check if the current value is a short argument.
  * Parsers
      + as_long, returning the argument name & optionally the value
        if the string contained a =.
      + as_short, returning a ShortArgument.

We need to be careful when categorizing arguments, as we don't want
--foo to match a short argument, nor do we want -- to match a long
argument.

The reason for introducing a ShortArgument structure is that short
arguments are inherently ambiguous. -abc could be parsed as either -a
-b -c or as the options -a taking the value bc (or even -a and -b
taking the value c). The ShortArgument is an iterator over the chars,
with a method to return the remaining string if it should be a value.
This means that the argument parsing layer is able to correctly drive
the parsing with the information it has on which letter takes an
argument.

Parsing command line arguments provided a description

This layer is going to take two inputs:

  * A description of which arguments exist
  * An array of strings

It will then output an iterator of parsed arguments (being either a
flag or an option)

The interface is pretty similar to getopt_long, we will introduce a
struct Argument that contains the short letter, the long name &
whether this argument takes a value. It will also contain a name (of
an arbitrary type), used by the caller to identify which arguments
have been provided.

#[non_exhaustive]
#[derive(Debug)]
pub struct Argument<'a, N> {
    pub name: N,
    pub short: Option<char>,
    pub long: Option<&'a str>,
    pub takes_value: bool,
}

The output will be a ParsedArgument:

#[derive(Debug, PartialEq, Eq)]
pub enum ParsedArgument<'a, N> {
    Positional(&'a str),
    Flag(N),
    Option(N, &'a str),
}

We will introduce a (different) struct Arguments that will store the
state of our parsing. We need mainly three pieces of information:

  * The lexing ArgCursor
  * The current (if any) lexing ShortArgument
  * Whether options have ended (i.e. -- has been encountered)

We can then implement all the logic in the Iterator::next method! The
Iterator::Item will be a Result<ParsedArgument<_>, Error>. Error can
either be unknown arguments, or mismatches between arguments & their
values.

The next method is pretty long, but I'll summarize what needs to be
done:

  * If we have encountered -- then we return anything in the input as
    a Positional
  * Else if the next string is not an argument return a Positional

If we are in neither of those cases we are parsing an argument. This
can either be a short, or a long argument, the handling is slightly
different but in essence we have to do the following:

  * Find the Argument corresponding to either the short letter or
    long option
  * If the argument requires a value extract it:
      + It can either be in the same parsed argument (as the
        remaining of the short, or after a =)
      + Or it can be the next value in the input stream

This makes the bulk of the library, and it has tests covering 100% of
it! (according to cargo-llvm-cov). The interface of this layer is
really verbose to consume, as we need to write a getopt-like loop to
parse the arguments. We can do much better with a derive macro!

Derive Macro

Writing a derive macro won't be very simple, because of our rule of
using no external dependencies we won't have access to syn or quote.
Both those crates are pretty complicated, so we won't be redeveloping
a clone from scratch, we are going to write some ad-hoc code that
will not be as robust.

Introduction to procedural macros

Derive macros are procedural macros. They are implemented as
functions that are dynamically loaded into the compiler, taking
"rust" code as an input & outputting rust code.

This is done through the TokenStream type, which is mostly an
iterator over TokenTree items. TokenTree are really low level, the
rust compiler has done very little processing on them, they are
mostly the output of lexing the input string. TokenTrees are an
enumeration with the following variants:

  * Ident: An identifier (i.e. mostly an unquoted sequence of
    letters)
  * Literal: A string literal, char literal or number literal
  * Punct: A symbol (+, =, ,, ...)
  * Group: Another TokenStream enclosed by a delimiter ((..), {..} or
    [..]).

As you can see we don't have much to go by, we will need to match the
rust syntax ourselves, this is why syn is so useful.

A derive macro takes a TokenStream representing the item it was
applied on, and outputs a TokenStream that is appended after the
item. To define a proc-macro we need to create a separate crate:

[package]
name = "sheshat-derive"
version = "0.1.0"
edition = "2021"

[lib]
proc-macro = true

[dependencies]

We can then register our macro:

#[proc_macro_derive(Sheshat, attributes(sheshat))]
pub fn sheshat(input: TokenStream) -> TokenStream {
    todo!()
}

This will define a derive macro named Sheshat, which accepts
attributes of the form #[sheshat(...)].

Parsing a struct

In order to simplify as much as possible our implementation we are
going to only support structs with fields (i.e. struct Foo { ... }).
Whenever we encounter an invalid situation we are going to simply
panic!. This generates terrible error messages, in order to generate
good messages we would need to generate a TokenStream which emits
compile_error! with the correct span (which is implemented by syn).
Doing that is pretty involved, and my implementation is already 750
lines of code, so I will need to cope with the error messages.

To guide our parsing we can look at the rust reference. We can see
that we first need to (optionally) parse a repetition of
OuterAttribute. Those are items of the form:

#[derive(Sheshat)]
#[sheshat(something)]
struct Args {}

We are going to support only one outer attribute: #[sheshat(borrow
(<lifetime>))], where <lifetime> will be a lifetime that is allowed
to borrow the input.

Warning: When parsing attributes you will see all the attributes on
the item. For example if the struct derives serde's attributes you
could encounter those. This is going to be the case in all of the
following section, you need to take care to ignore attributes that
are not related to your macro.

After handling the required attributes we need to scan (meaning read
the tokens but not doing anything with them) the possible visibility
and the struct keyword.

The next token should be the name of the struct, which we must store
as we are going to need output code of the form impl <trait> for <the
derived struct> { }.

We know encounter our first tricky parsing: we may need to parse a
group of generic parameters. You may have realized while reading the
introduction that a TokenTree::Group is never delimited by <..>.
Indeed, < and > are TokenTree::Punct. This means that we must read
from a < until the corresponding >, as we could have an arbitrary
stack of nested <, >. This can be done easily by maintaining a depth
counter, and exiting the parsing when we encounter > & the depth was
1.

We can now parse the fields of the struct. This starts pretty much
like the struct, with attributes & visibility, followed by an
identifier for the name. We have to parse the type of the field, and
doing this cleanly is pretty involved too, as types can't be pretty
complicated. Fortunately we don't care about the internals of the
type, we only need to interact with the full type of the field. This
means that we can read until we find either a , or the end of the
struct. Well, we still need to maintain a depth counter, due to
possible generic parameters that could introduce commas that don't
mean the end of the field's type.

For more information on exactly how this is implemented please refer
to the full implementation.

Implementing the trait

We know have the following information:

  * Global attributes on the struct
  * The name of the struct
  * The generic parameters of the struct
  * Each field, with the following information:
      + The name of the field
      + The type of the field
      + Attributes that say if the argument is short, long or both

We are going to map fields that are neither short nor long to
positional arguments. Before doing anything else we need to define
the trait we are going to implement:

#[derive(Debug)]
pub enum Error<'a, E, N> {
    Parsing(E),
    /// An error in the argument parsing
    InvalidArgument(args::Error<'a, N>),
    TooManyPositional,
    MissingPositional(&'static str),

    MissingArgument(N),
}

pub trait Sheshat<'a>: Sized {
    type ParseErr;
    type Name;

    fn parse_arguments<T: AsRef<str>>(args: &'a [T]) -> Result<Self, Error<'a, Self::ParseErr, Self::Name>>;
}

We can easily generate a <name>Names enum for mapping each short/long
field to a variant.

To parse the arguments we are going to need to:

  * Build the array of Argument to pass to our argument parsing
    Iterator
  * Run the argument iterator and store temporarily the obtained
    values
  * Parse each obtained value if needed (for example u64)
  * Validate that all required values have been supplied
  * Create the struct instance

The main issue for all those operations is that we want a different
behavior entirely depending on the type:

  * bool are flags
  * &'a str are passed as is
  * ...

Well, we could use traits for this, they are made to abstract over
types right? Unfortunately the ... hide something really important.
We want to support all types that implement FromStr to be parsed.
This means that we just asked something currently impossible from the
compiler: the ability to use a trait for an operation, except for a
few specific types. The name for this is specialization, and it's a
pretty complicated feature that is not ready at all.

So is our derive macro impossible then? No! In macros we can use a
kind of specialization: autoref specialization.

Autoref specialization

Autoref specialization is a pretty neat trick that allows macro to
perform specialization. I'm not going to explain how this works,
please look at the linked article for that, but I am going to
describe how to obtain all the different behaviours we want.

First we are going to introduce a struct To:

pub struct To<T>(pub PhantomData<T>);

This struct is going to be the value on which we perform the
specialization by calling a method.

We are then going to introduce a number of markers that represent the
different behaviours we want:

/// For parsing `bool` values
pub struct FlagArg;
/// For parsing `&'a str` values
pub struct IdOptArg<'a>(PhantomData<&'a ()>);
/// For parsing `impl FromStr` values
pub struct ParseOptArg<T>(PhantomData<T>);
/// For parsing `Option<&'a str>` values
pub struct OptionalIdOptArg<'a>(PhantomData<&'a ()>);
/// For parsing `Option<impl FromStr> values
pub struct OptionalParseOptArg<T>(PhantomData<T>);
/// For parsing `impl Extend<&'a str>` values
pub struct SequenceIdArg<'a, S>(PhantomData<(&'a (), S)>);
/// For parsing `impl Extend<impl FromStr>` values
pub struct SequenceParseArg<S, T>(PhantomData<(S, T)>);

We are then going to define an _arg method on To that returns the
appropriate marke for the type! Because we need to forward the
generic type/lifetime to the marker we need include a generic
argument to _arg which is the type of the field we extracted in the
derive macro.

This is the essence of autoref specialization, we will have a
different _arg method for each situation, and the compiler will
choose the first one that matches.

pub trait ViaFlagArg {
    fn _arg<T>(&self) -> FlagArg {
        FlagArg
    }
}
impl ViaFlagArg for &&&&&&To<bool> {}

pub trait ViaIdOptArg {
    fn _arg<'a, T: 'a>(&self) -> IdOptArg<'a> {
        IdOptArg(Default::default())
    }
}
impl<'a> ViaIdOptArg for &&&&&To<&'a str> {}

pub trait ViaOptionalIdOptArg {
    fn _arg<'a, T: 'a>(&self) -> OptionalIdOptArg<'a> {
        OptionalIdOptArg(Default::default())
    }
}
impl<'a> ViaOptionalIdOptArg for &&&&To<Option<&'a str>> {}

pub trait ViaOptionalParseOptArg {
    fn _arg<T>(&self) -> OptionalParseOptArg<T> {
        OptionalParseOptArg(Default::default())
    }
}
impl<T> ViaOptionalParseOptArg for &&&To<Option<T>> where T: FromStr {}

pub trait ViaSequenceIdArg {
    fn _arg<'a, S>(&self) -> SequenceIdArg<'a, S> {
        SequenceIdArg(Default::default())
    }
}
impl<'a, S> ViaSequenceIdArg for &&To<S> where S: Extend<&'a str> + Default {}

pub trait AppendParse<T: FromStr> {
    fn append(&mut self, v: &str) -> Result<(), T::Err>;
}

impl<S, T> AppendParse<T> for S
where
    T: FromStr,
    S: Extend<T>,
{
    fn append(&mut self, v: &str) -> Result<(), T::Err> {
        self.extend(core::iter::once(v.parse()?));
        Ok(())
    }
}

pub trait Id<S> {}
impl<S> Id<S> for S {}

pub trait ViaSequenceParseArg<T, S> {
    /// The issue here is that we can only use 1 type parameter in `_arg`, but our marker has 2.
    /// I think there is another way to fix this, by adding the type of To<T> to each trait,
    /// but this works to, it allows to link the trait & function traits
    fn _arg<Sp: Id<S>>(&self) -> SequenceParseArg<S, T> {
        SequenceParseArg(Default::default())
    }
}
impl<T, S> ViaSequenceParseArg<T, S> for &To<S>
where
    /// We need to restrict Extend<_> to be for T, because Extend has two possible impls
    /// One for copy & references, one for "owned" values.
    /// IntoIterator allows use to choose the "owned" case.
    S: IntoIterator<Item = T>,
    S: AppendParse<T>,
    T: FromStr,
{
}

pub trait ViaParseOptArg {
    fn _arg<T>(&self) -> ParseOptArg<T> {
        ParseOptArg(Default::default())
    }
}
impl<T> ViaParseOptArg for To<T> where T: FromStr {}

In order to get the marker we can do the following:

let {field_name}_token = (&&&&&&To::<{field_type}>(Default::default()))._arg::<{field_type}>();

Note that the number of & is equal to the number of & in the bool
implementation. When trying to call the _arg method the compiler will
first try with the supplied value. If this value does not implement
the ViaFlagArg trait, the compiler will dereference once the value,
arriving at &&&&&To::<T>, where it will try the ViaIdOptArg trait.

The compiler will then stop at the first dereference that works,
giving us the correct marker for our type!

We can then implement methods with the same name of each marker for
the different kind of specialization we need (does the field take a
value, what is the inital value, how do we parse it, ....). The
entire parsing code is available here.

With our markers we can trivially generate the initial value &
realize the value, as this is simply a for loop in the derive macro.
Extracting the values from the iterator is a bit more involved.

Parsing each argument

We are going to have a loop of this form:

let desc = &[ /* Generate the `Argument` from the {field_name}_token values */ ];
// Make the argument iterator
let args = Arguments::new(desc, args);
for arg in args {
    match arg? {
        ::sheshat::ParsedArgument::Positional(value) => ...,
        ::sheshat::ParsedArgument::Flag(name) => ...,
        ::sheshat::ParsedArgument::Option(name, value) => ...,
    }
}

Generating the actions to perform for Flag and Option are pretty
easy:

match name {
    /// For each field
    {struct_name}Name::{field} => {field}_token.handle(....)?,
}

For positional arguments we don't have something that is as trivial.
We need a bit of state that allows us to know which positional
argument we need to interact with. We can define a mut
positional_index = 0, and match on the value of the index.

This gives something of the form:

match positional_index {
    0 => field_0_token.handle(...)?,
    1 => field_1_token.handle(...)?,
    _ => /* Raise an error due to too many positional arguments */,
}

We then need to:

  * Increment the positional_index for most arguments
  * Don't increment it for Sequences (types implementing AppendParse)

This is pretty straightforward to do with our marker types.

Handling Subcommands

Subcommands are pretty useful when caller can perform different
functions depending on the input. They are positional arguments that
change the meaning of the subsequent arguments.

The article is already pretty long, so I won't go into too many
details on them, but they are pretty easy to fit in.

In essence, we can define another derive macro, on an enum this time,
which will implement the following trait:

pub trait SheshatSubCommand<'a>: Sized {
    type SubCommandErr;

    fn parse_subcommand<T: AsRef<str>>(
        subcommand: &'a str,
        args: lex::Arguments<'a, T>,
        cursor: lex::ArgCursor,
    ) -> Result<Self, SubCommandError<'a, Self::SubCommandErr>>;
}

When the argument parsing loop encounters the index of the subcommand
it calls this method on the subcommand type, passing it the arguments
that have yet to be parsed. The parsing can then continue exactly as
before, in a recursive manner.

Do note that the exact implementation is a bit tricky around the
errors, in order to avoid allocations.

With this argument parsing library we are going to be able to write
auxiliary tooling which is needed for us to continue to implement our
kernel.

For more in depth details please look at the code! It's not too
complicated, though it's written with very few comments.

Back to index

Made with Verin
The source code is licensed MIT. The website content is licensed CC
BY NC SA 4.0. RSS feed