https://traxys.me/sheshat_pantheon_3.html
pantheon: Parsing command line arguments
28 September 2024
1 - The Goal
2 - Lexing command line arguments
3 - Parsing command line arguments provided a description
4 - Derive Macro
4.1 - Introduction to procedural macros
4.2 - Parsing a struct
4.3 - Implementing the trait
4.4 - Handling Subcommands
After managing free memory I wanted to implement another feature, but
it launched me into a very long yak shave. The next few (I can think
of at least three!) posts will take us on bit of a side journey
before we can come back to the kernel (hades).
This first post is on writing a library to parse command line
arguments (I know, this seems to be very removed from kernel
development).
It's named sheshat, after the ancient Egyptian goddess of writing
(among others).
The Goal
Before writing any code let's talk a bit about what we want. This
project is going to be a standalone crate to parse command line-style
arguments. Because we may not have access to the standard library in
all contexts (maybe we want to use our library in the kernel?), we
are going to write a no_std library.
Here are the different kind of arguments we want to parse:
* Short flags: -a
* Fused short flags -ab
* Short options with fused values -afoo
* Short options -a foo
* Long flags --long
* Long options with fused values --long=foo
* Long options --long foo
* Positional arguments foo
We also want to respect common patterns, like -- marking the end of
options & - being a valid value. This crate won't handle whitespace
splitting, the main input will be an array of &str (or to be more
user-friendly of T: AsRef).
The main interface we are going to expose is a derive macro, given a
struct Args:
struct Args<'a> {
switch: bool,
other_switch: bool,
x_option: bool,
y_option: bool,
long: &'a str,
first_positional: u64,
remaining_positional: Vec,
}
We are going to be able to parse something like --switch -xy 1 2 3
--long=str into:
Args {
switch: true,
other_switch: false,
x_option: true,
y_option: true,
long: "str",
first_positional: 1,
remaining_positional: vec![2, 3],
}
In order to choose how fields map to arguments we will need to
introduce some attributes, but we will talk more about that when we
describe the derive macro itself.
Note: In the example we introduced a borrowed string, the goal is to
be able to borrow from the arguments.
Lexing command line arguments
"Lexing" arguments seem a bit redundant, as we have already
established that the library won't perform whitespace splitting,
which is a big part of parsing arguments. Well this is a task best
left to the program "launcher" whatever it is (a shell for example),
but we still have some complexity to wrangle.
The interface for this step has been very inspired by clap_lex, it
helped me a lot to understand the edge cases!
We will introduce two main components:
/// A wrapper around an array of arguments
struct Arguments<'a, T: AsRef>(&'a [T]);
/// The current index in the array of arguments
struct ArgCursor(usize);
impl Arguments<'a, T> {
fn peek_arg(&self, cursor: &ArgCursor) -> Option>;
fn advance(&self, cursor: &mut ArgCursor);
/// Equivalent to calling peek_arg & advance
fn next_arg(&self, cursor: &mut ArgCursor) -> Option>;
}
Those methods are mainly an abstracted way to iterate over an array,
the interesting methods are on the ParsedArgument.
This struct mainly exposes two kind of methods:
* Checks
+ is_opt_end to check for --.
+ is_long to check if the current value is a long argument.
+ is_short to check if the current value is a short argument.
* Parsers
+ as_long, returning the argument name & optionally the value
if the string contained a =.
+ as_short, returning a ShortArgument.
We need to be careful when categorizing arguments, as we don't want
--foo to match a short argument, nor do we want -- to match a long
argument.
The reason for introducing a ShortArgument structure is that short
arguments are inherently ambiguous. -abc could be parsed as either -a
-b -c or as the options -a taking the value bc (or even -a and -b
taking the value c). The ShortArgument is an iterator over the chars,
with a method to return the remaining string if it should be a value.
This means that the argument parsing layer is able to correctly drive
the parsing with the information it has on which letter takes an
argument.
Parsing command line arguments provided a description
This layer is going to take two inputs:
* A description of which arguments exist
* An array of strings
It will then output an iterator of parsed arguments (being either a
flag or an option)
The interface is pretty similar to getopt_long, we will introduce a
struct Argument that contains the short letter, the long name &
whether this argument takes a value. It will also contain a name (of
an arbitrary type), used by the caller to identify which arguments
have been provided.
#[non_exhaustive]
#[derive(Debug)]
pub struct Argument<'a, N> {
pub name: N,
pub short: Option,
pub long: Option<&'a str>,
pub takes_value: bool,
}
The output will be a ParsedArgument:
#[derive(Debug, PartialEq, Eq)]
pub enum ParsedArgument<'a, N> {
Positional(&'a str),
Flag(N),
Option(N, &'a str),
}
We will introduce a (different) struct Arguments that will store the
state of our parsing. We need mainly three pieces of information:
* The lexing ArgCursor
* The current (if any) lexing ShortArgument
* Whether options have ended (i.e. -- has been encountered)
We can then implement all the logic in the Iterator::next method! The
Iterator::Item will be a Result, Error>. Error can
either be unknown arguments, or mismatches between arguments & their
values.
The next method is pretty long, but I'll summarize what needs to be
done:
* If we have encountered -- then we return anything in the input as
a Positional
* Else if the next string is not an argument return a Positional
If we are in neither of those cases we are parsing an argument. This
can either be a short, or a long argument, the handling is slightly
different but in essence we have to do the following:
* Find the Argument corresponding to either the short letter or
long option
* If the argument requires a value extract it:
+ It can either be in the same parsed argument (as the
remaining of the short, or after a =)
+ Or it can be the next value in the input stream
This makes the bulk of the library, and it has tests covering 100% of
it! (according to cargo-llvm-cov). The interface of this layer is
really verbose to consume, as we need to write a getopt-like loop to
parse the arguments. We can do much better with a derive macro!
Derive Macro
Writing a derive macro won't be very simple, because of our rule of
using no external dependencies we won't have access to syn or quote.
Both those crates are pretty complicated, so we won't be redeveloping
a clone from scratch, we are going to write some ad-hoc code that
will not be as robust.
Introduction to procedural macros
Derive macros are procedural macros. They are implemented as
functions that are dynamically loaded into the compiler, taking
"rust" code as an input & outputting rust code.
This is done through the TokenStream type, which is mostly an
iterator over TokenTree items. TokenTree are really low level, the
rust compiler has done very little processing on them, they are
mostly the output of lexing the input string. TokenTrees are an
enumeration with the following variants:
* Ident: An identifier (i.e. mostly an unquoted sequence of
letters)
* Literal: A string literal, char literal or number literal
* Punct: A symbol (+, =, ,, ...)
* Group: Another TokenStream enclosed by a delimiter ((..), {..} or
[..]).
As you can see we don't have much to go by, we will need to match the
rust syntax ourselves, this is why syn is so useful.
A derive macro takes a TokenStream representing the item it was
applied on, and outputs a TokenStream that is appended after the
item. To define a proc-macro we need to create a separate crate:
[package]
name = "sheshat-derive"
version = "0.1.0"
edition = "2021"
[lib]
proc-macro = true
[dependencies]
We can then register our macro:
#[proc_macro_derive(Sheshat, attributes(sheshat))]
pub fn sheshat(input: TokenStream) -> TokenStream {
todo!()
}
This will define a derive macro named Sheshat, which accepts
attributes of the form #[sheshat(...)].
Parsing a struct
In order to simplify as much as possible our implementation we are
going to only support structs with fields (i.e. struct Foo { ... }).
Whenever we encounter an invalid situation we are going to simply
panic!. This generates terrible error messages, in order to generate
good messages we would need to generate a TokenStream which emits
compile_error! with the correct span (which is implemented by syn).
Doing that is pretty involved, and my implementation is already 750
lines of code, so I will need to cope with the error messages.
To guide our parsing we can look at the rust reference. We can see
that we first need to (optionally) parse a repetition of
OuterAttribute. Those are items of the form:
#[derive(Sheshat)]
#[sheshat(something)]
struct Args {}
We are going to support only one outer attribute: #[sheshat(borrow
())], where will be a lifetime that is allowed
to borrow the input.
Warning: When parsing attributes you will see all the attributes on
the item. For example if the struct derives serde's attributes you
could encounter those. This is going to be the case in all of the
following section, you need to take care to ignore attributes that
are not related to your macro.
After handling the required attributes we need to scan (meaning read
the tokens but not doing anything with them) the possible visibility
and the struct keyword.
The next token should be the name of the struct, which we must store
as we are going to need output code of the form impl for { }.
We know encounter our first tricky parsing: we may need to parse a
group of generic parameters. You may have realized while reading the
introduction that a TokenTree::Group is never delimited by <..>.
Indeed, < and > are TokenTree::Punct. This means that we must read
from a < until the corresponding >, as we could have an arbitrary
stack of nested <, >. This can be done easily by maintaining a depth
counter, and exiting the parsing when we encounter > & the depth was
1.
We can now parse the fields of the struct. This starts pretty much
like the struct, with attributes & visibility, followed by an
identifier for the name. We have to parse the type of the field, and
doing this cleanly is pretty involved too, as types can't be pretty
complicated. Fortunately we don't care about the internals of the
type, we only need to interact with the full type of the field. This
means that we can read until we find either a , or the end of the
struct. Well, we still need to maintain a depth counter, due to
possible generic parameters that could introduce commas that don't
mean the end of the field's type.
For more information on exactly how this is implemented please refer
to the full implementation.
Implementing the trait
We know have the following information:
* Global attributes on the struct
* The name of the struct
* The generic parameters of the struct
* Each field, with the following information:
+ The name of the field
+ The type of the field
+ Attributes that say if the argument is short, long or both
We are going to map fields that are neither short nor long to
positional arguments. Before doing anything else we need to define
the trait we are going to implement:
#[derive(Debug)]
pub enum Error<'a, E, N> {
Parsing(E),
/// An error in the argument parsing
InvalidArgument(args::Error<'a, N>),
TooManyPositional,
MissingPositional(&'static str),
MissingArgument(N),
}
pub trait Sheshat<'a>: Sized {
type ParseErr;
type Name;
fn parse_arguments>(args: &'a [T]) -> Result>;
}
We can easily generate a Names enum for mapping each short/long
field to a variant.
To parse the arguments we are going to need to:
* Build the array of Argument to pass to our argument parsing
Iterator
* Run the argument iterator and store temporarily the obtained
values
* Parse each obtained value if needed (for example u64)
* Validate that all required values have been supplied
* Create the struct instance
The main issue for all those operations is that we want a different
behavior entirely depending on the type:
* bool are flags
* &'a str are passed as is
* ...
Well, we could use traits for this, they are made to abstract over
types right? Unfortunately the ... hide something really important.
We want to support all types that implement FromStr to be parsed.
This means that we just asked something currently impossible from the
compiler: the ability to use a trait for an operation, except for a
few specific types. The name for this is specialization, and it's a
pretty complicated feature that is not ready at all.
So is our derive macro impossible then? No! In macros we can use a
kind of specialization: autoref specialization.
Autoref specialization
Autoref specialization is a pretty neat trick that allows macro to
perform specialization. I'm not going to explain how this works,
please look at the linked article for that, but I am going to
describe how to obtain all the different behaviours we want.
First we are going to introduce a struct To:
pub struct To(pub PhantomData);
This struct is going to be the value on which we perform the
specialization by calling a method.
We are then going to introduce a number of markers that represent the
different behaviours we want:
/// For parsing `bool` values
pub struct FlagArg;
/// For parsing `&'a str` values
pub struct IdOptArg<'a>(PhantomData<&'a ()>);
/// For parsing `impl FromStr` values
pub struct ParseOptArg(PhantomData);
/// For parsing `Option<&'a str>` values
pub struct OptionalIdOptArg<'a>(PhantomData<&'a ()>);
/// For parsing `Option values
pub struct OptionalParseOptArg(PhantomData);
/// For parsing `impl Extend<&'a str>` values
pub struct SequenceIdArg<'a, S>(PhantomData<(&'a (), S)>);
/// For parsing `impl Extend` values
pub struct SequenceParseArg(PhantomData<(S, T)>);
We are then going to define an _arg method on To that returns the
appropriate marke for the type! Because we need to forward the
generic type/lifetime to the marker we need include a generic
argument to _arg which is the type of the field we extracted in the
derive macro.
This is the essence of autoref specialization, we will have a
different _arg method for each situation, and the compiler will
choose the first one that matches.
pub trait ViaFlagArg {
fn _arg(&self) -> FlagArg {
FlagArg
}
}
impl ViaFlagArg for &&&&&&To {}
pub trait ViaIdOptArg {
fn _arg<'a, T: 'a>(&self) -> IdOptArg<'a> {
IdOptArg(Default::default())
}
}
impl<'a> ViaIdOptArg for &&&&&To<&'a str> {}
pub trait ViaOptionalIdOptArg {
fn _arg<'a, T: 'a>(&self) -> OptionalIdOptArg<'a> {
OptionalIdOptArg(Default::default())
}
}
impl<'a> ViaOptionalIdOptArg for &&&&To