https://serokell.io/blog/parser-combinators-in-haskell

 
serokell-logo-1serokell-logo-2

  * 
  * Projects
  * Labs
  * Shop
  * Blog
  * Academy
  * Team
  * Contacts
  * Hire us

Your browser seems to have problems showing our website properly so
it's switched to a simplified version.
ResearchProgrammingArtificial IntelligenceInterviewsOther

Parser Combinators in Haskell

Heitor Toledo Lassarote de Paula
Article by Heitor Toledo Lassarote de Paula
December 20th, 2021
41 min read
TwitterFacebookLinkedin TelegramMail WhatsAppCopy Link
Upvote blog postBlog post is upvoted
20

Welcome! If you are reading this, you likely have decided to take on
the journey of learning parser combinators. We hope this article will
make your adventure smoother and hopefully give you a strong
foundation for writing your grammars.

This article is composed of three major parts. In the first part, we
will implement a small parser combinator library from scratch, which
should hopefully help to give a feeling of how industrial-strength
parsing combinators work. In the second part, we will learn how to
use the Megaparsec library to implement a parser for S-expressions.
Finally, as a bonus, we will use the power of Template Haskell to
implement a quasi-quoter for our parser.

What is a parser combinator and why use them?

In 2001, Daan Leijen and Erik Meijer published a paper titled Parsec:
Direct Style Monadic Parser Combinators For The Real World,
describing the parsec library, whose design consequently influenced
various others, such as megaparsec, attoparsec, trifecta, and even
libraries outside the Haskell ecosystem, such as NimbleParsec for
Elixir, parsec for Python, FParsec for F#, among others.

Parser combinators are known to be simple to use without requiring
external tools or too many concepts to learn. That is, they are
ordinary Haskell constructors that can easily be combined and
returned with other parsers because of their nature as functions.
This, in turn, makes them idiomatic to use, and being a popular
choice among Haskellers, their ecosystem is pretty developed. You
should have no trouble finding tutorials and documentation on how to
use them.

It's important to notice that parser combinators also have their
flaws. Parser combinators may be implemented as either LL(1) or LL([?])
parsers, which, elaborating on the points described in Kirill
Andreev's How to Implement an LR(1) Parser:

 1. Don't support left recursion. That is to say, if there is a rule
    foo = foo *> bar, you might need to refactor your grammar
    otherwise it will loop forever.
 2. Don't resolve conflicts, which we will see as an example later in
    this article. This means the ordering of rules can influence the
    output, and you need to be careful when defining rules that can
    consume the same input.
      + In an LL(1) implementation, it will pick the first
        alternative that maches the input.
      + In an LL([?]) implementation, it will try the longest possible
        match, which may be very inneficient.
 3. Many parser combinator libraries will backtrack, which will also
    be exemplified later, meaning that you might need to be careful
    to avoid performance penalties.
 4. The backtracking mechanism may, in turn, lead to exponential time
    complexities.

This shouldn't, however, discourage you from their use, as their
benefits outweigh the drawbacks. Ultimately, the parsing library of
your choice should be chosen based on your needs, and parser
combinators are a good fit for most use cases.

Implementing a simple parser combinator from scratch

Before we delve into megaparsec, it may be interesting to implement
our parser combinator from scratch. We will make a simple one that
should hopefully help you understand how parser combinators work.

If you instead prefer to jump right into the action, then skip to the
Megaparsec tutorial.

Implementation

We will begin by importing the appropriate definitions in a new file
called Parser.hs:

module Parser where

import Control.Applicative (Alternative (..))
import Data.List (nub)

We will create two data structures that represent our parser.

The first of them represents some error that may happen while
parsing. For now, let's have three errors: one that is raised when we
are expecting input but there is nothing to consume (EndOfInput), one
when we find an unexpected character (Unexpected), and one for any
error messages that the user might want to raise (CustomError).

Our second structure is the actual parser. It's represented as a
function that takes some input and either returns a list of errors,
if Left, or the result of parsing the input together with the rest of
the input (that was not parsed), if Right.

data Error i e
  = EndOfInput  -- Expected more input, but there is nothing
  | Unexpected i  -- We didn't expect to find this element
  | CustomError e  -- Extra errors the user may want to create
  | Empty  -- Used in `Alternative` implementation of `empty`
  deriving (Eq, Show)

newtype Parser i e a = Parser
  { runParser :: [i] -> Either [Error i e] (a, [i])
  }

This is the meaning of each of the type variables:

  * i: The input stream. For most cases, we will use Char, as [Char]
    is the same as String. It's interesting to notice that keeping
    this as a type variable allows us to parse things that are not
    strings as well.
  * e: The type of custom error messages. If we don't have those, we
    may use Void instead.
  * a: The result of our parsing function. It represents the
    structure parsed from our consumed input.

Let's begin by creating the most primitive function: satisfy. It
tests the current character with a predicate, and if it succeeds, it
will advance the parser and return the consumed character.

We must take care that we are not at the end of the input, however,
in which case we should fail.

satisfy :: (i -> Bool) -> Parser i e i
satisfy predicate = Parser $ \input ->
  case input of
    [] -> Left [EndOfInput]
    hd : rest
      | predicate hd -> Right (hd, rest)
      | otherwise    -> Left [Unexpected hd]

It's also common to define a function that only succeeds if the input
strictly matches some given element.

char :: Eq i => i -> Parser i e i
char i = satisfy (== i)

If you load this in GHCi, you should now be able to test it:

>>> :l Parser

>>> runParser (char 'h') "hello"
Right ('h', "ello")

>>> runParser (char 'h') "greetings"
Left [Unexpected 'g']

>>> runParser (char 'h') ""
Left [EndOfInput]

Let's now create Monad and Alternative instances for our parser. For
educative purposes, we will also explicitly implement Functor and
Applicative instead of fmap = liftM and (<*>) = ap.

We start with Functor. Since we map over the a type variable obtained
in the case of a successful parse, we can only apply f in case of
Right.

instance Functor (Parser i e) where
  fmap f (Parser p) = Parser $ \input ->
    case p input of
      Left err -> Left err
      Right (output, rest) -> Right (f output, rest)

Since Either is a monad, we can use do-notation. Here's an
implementation that's a little shorter:

instance Functor (Parser i e) where
  fmap f (Parser p) = Parser $ \input -> do
    (output, rest) <- p input
    pure (f output, rest)

Next up is Applicative. A good intuition behind Applicatives is that
pure represents an identity element, while <*> distributes one
argument over the other. Think of this like a multiplication, where 0
* a = a * 0 = 0, while 1 * a = a * 1 = a.

To illustrate this, observe this table:

                         Left parser failed    Left parser succeeded
Right parser failed    Error from left parser Error from right parser
Right parser succeeded Error from left parser Success

Note that even if both parsers fail, we only want the error from the
first one, as we execute the second one only if the first one
succeeded.

This is due to the Either part of the parser, where Left is like 0
and Right is like 1. pure represents our 1, so it should create a
parser that always succeeds without consuming any input.

And similarly to Either, we want <*> to fail in case any of the
operands result in a failed parse given some input. This
implementation tries to run the left operand, which consumes some
input and produces a function as well as the unconsumed input. The
rest of the input is given to the right operand, and in case of
success, it applies the function to the argument.

instance Applicative (Parser i e) where
  pure a = Parser $ \input -> Right (a, input)

  Parser f <*> Parser p = Parser $ \input ->
    case f input of
      Left err -> Left err
      Right (f', rest) ->
        case p rest of
          Left err -> Left err
          Right (output, rest') -> Right (f' output, rest')

Or with Either's monad instance:

instance Applicative (Parser i e) where
  pure a = Parser $ \input -> Right (a, input)

  Parser f <*> Parser p = Parser $ \input -> do
    (f', rest) <- f input
    (output, rest') <- p rest
    pure (f' output, rest')

In our Monad instance, we set return = pure as usual. Our bind is
more interesting: we try to run the given parser with the input, and
if it succeeds, we give the produced output (of type a) to our
continuation k, which produces a parser that is executed with the
remainder of the output.

instance Monad (Parser i e) where
  return = pure

  Parser p >>= k = Parser $ \input ->
    case p input of
      Left err -> Left err
      Right (output, rest) ->
        let
          Parser p' = k output
        in
        p' rest

Again, using the fact that Either is a monad, we can simplify this
implementation:

instance Monad (Parser i e) where
  return = pure

  Parser p >>= k = Parser $ \input -> do
    (output, rest) <- p input
    runParser (k output) rest

With our Applicative and Monad instances, we are now able to parse
sequentially. For example, we defined char above to let us parse a
single character, but what if we wanted to parse many of them?

string :: Eq i => [i] -> Parser i e [i]
string = traverse char

If traverse is new to you, it's also equivalent to these definitions:

string :: Eq i => [i] -> Parser i e [i]
string [] = pure []
string (x : xs) = (:) <$> char x <*> string xs

Or even in terms of Monad:

string [] = return []
string (x : xs) = do
  y <- char x
  ys <- string xs
  return (y : ys)

Let's try it in GHCi:

>>> runParser (string "Haskell") "Haskell"
Right ("Haskell","")

>>> runParser (string "Haskell") "Halloween"
Left [Unexpected 'l']

Finally, our last instance: Alternative. This may be new for you, so
I will explain. First, take a look at its definition, which can be
found in Control.Applicative:

class Applicative f => Alternative f where
  empty :: f a
  (<|>) :: f a -> f a -> f a

I mentioned that pure acts as 1, which is a parser that always
succeeds without consuming input. empty works like the converse (or
0), that is a parser that always fails without consuming input.

For (<|>), we want to implement it such that the first parser to
succeed is used. That is, we first try the left operand, and failing
that, we try the second one. The table below shall guide you:

                          Left parser failed    Left parser succeeded
Right parser failed    Errors from both parsers Left parser
Right parser succeeded Right parser             Left parser

Here is an implementation of Alternative. We constrain i and e with
Eq so we can use nub to remove duplicated errors. Note that this is
not the best or most efficient way to merge errors, but for our
purposes, it's good enough.

instance (Eq i, Eq e) => Alternative (Parser i e) where
  empty = Parser $ \_ -> Left [Empty]

  Parser l <|> Parser r = Parser $ \input ->
    case l input of
      Left err ->
        case r input of
          Left err' -> Left $ nub $ err <> err'
          Right (output, rest) -> Right (output, rest)
      Right (output, rest) -> Right (output, rest)

Finally, let's test it!

>>> runParser (string "hello" <|> string "greetings") "hello, world"
Right ("hello",", world")

>>> runParser (string "hello" <|> string "greetings") "greetings, world"
Right ("greetings",", world")

>>> runParser (string "hello" <|> string "greetings") "bye, world"
Left [Unexpected 'b']

>>> runParser (empty <|> pure ()) ""
Right ()

>>> runParser (pure () <|> empty) ""
Right ()

We can see our errors composing with a more complicated expression:

>>> runParser ((string "hello" *> string ", globe") <|> string "greetings") "hello, world"
Left [Unexpected 'w',Unexpected 'h']

Hopefully, this should give you an intuition of how parser
combinators work. Popular parser combinator libraries are more
complex than this, as they tend to keep track of extra state for
better error messages, optimization, information about line and
column, etc., but the overall operations between them are similar.

The complete code for this section with solutions to the exercises
below can be found here.

Exercises

 1. Change char so that it returns errors such as Left [Expected 'h'
    'g'] ("Expected 'h', but got 'g'").

Solution Begin by creating a new error:

data Error i e
  = ...
  | Expected i i

Next, we will refactor the code originally in satisfy into token,
just so we avoid duplicating code. Then, rewrite satisfy to use this
new function, and rewrite char to make use of it as well.

The important part is simply that we change the error message.

token :: (i -> Error i e) -> (i -> Bool) -> Parser i e i
token mkErr predicate = Parser $ \input ->
  case input of
    [] -> Left [EndOfInput]
    hd : rest
      | predicate hd -> Right (hd, rest)
      | otherwise    -> Left [mkErr hd]

satisfy :: (i -> Bool) -> Parser i e i
satisfy = token Unexpected

char :: Eq i => i -> Parser i e i
char i = token (Expected i) (== i)

In GHCi you should see:

>>> parse (char 'h') "bye"
Left [Expected 'h' 'b']

---------------------------------------------------------------------

 2. Create an eof :: Parser i e () function which succeeds only if we
    have arrived at the end of the input.
      + runParser (string "hello" <* eof) "hello" should return Right
        ("hello", "")
      + runParser (strung "hello, world" <* eof) "" should return
        Left [ExpectedEndOfFile ','] ("Expected EOF, but got ','")

Solution First, create a new error:

data Error i e
  = ...
  | ExpectedEndOfFile i

Now create a function that returns Right only if there is no more
input, and an error otherwise.

eof :: Parser i e ()
eof = Parser $ \input ->
  case input of
    []     -> Right ((), [])
    hd : _ -> Left [ExpectedEndOfFile hd]

Trying it out in GHCi:

>>> parse (string "bye" <* eof) "bye"
Right ("bye","")

>>> parse (string "bye" <* eof) "byebye"
Left [ExpectedEndOfFile 'b']

---------------------------------------------------------------------

 3. Change the definition of Parser so it contains a [i] -> Offset ->
    (Either [Error i e] (Offset, a, [i])), where type Offset = Int.
    The offset indicates how many positions we have parsed.
      + You may also change Error to store an offset so that you can
        indicate where an error occurred.
      + runParser (string "hey") 0 "hi" should return Left [Error
        {erOffset = 1, erError = Expected 'e' 'i'}] ("Unexpected 'e',
        but got 'i', at position 1")

Solution The solution is mostly mechanically replacing definitions
where errors occurred.

First, we rename Error to ErrorType, and create a new Error type:

data Error i e = Error
  { erOffset :: Offset
  , erError :: ErrorType i e
  } deriving (Eq, Show)

data ErrorType i e
  = ...

If you've done the previous exercises, then token may be changed now
to take offset as input to the parser's function, and return offset +
1 on success. If an error occurred instead, we now construct Error
with the given input. Our mkErr function should not take an ErrorType
instead of Error.

token :: (i -> ErrorType i e) -> (i -> Bool) -> Parser i e i
token mkErr predicate = Parser $ \input offset ->
  case input of
    [] -> Left [Error offset EndOfInput]
    hd : rest
      | predicate hd -> Right (offset + 1, hd, rest)
      | otherwise    -> Left [Error offset $ mkErr hd]

If you haven't done the previous exercises, then change satisfy in a
similar manner, taking offset as input and returning offset + 1, and
error Error offset $ Unexpected hd on the error.

For the remainder of the functions, change the parser function to
also take offset as an argument, and just return it unaffected. Take
care to chain these offsets as required. For example, this is the new
definition of Applicative:

instance Applicative (Parser i e) where
  pure a = Parser $ \input offset -> Right (offset, a, input)

  Parser f <*> Parser p = Parser $ \input offset ->
    case f input offset of
      Left err -> Left err
      Right (offset', f', rest) ->
        case p rest offset' of
          Left err -> Left err
          Right (offset'', output, rest') -> Right (offset'', f' output, rest')

Finally, in parse, use 0 as the initial offset.

Keep in mind it's also possible to operate with line and column when
you know you have a function operating in characters.

>>> parse (string "hey") "hi"
Left [Error {erOffset = 1, erError = Expected 'e' 'i'}]

The reader may also choose to instead refactor the parser type to use
a custom-defined State type that contains the input stream and
offset. Many improvements are possible, but we choose to keep it
simple for educational purposes.

---------------------------------------------------------------------


What's next?

In this section, we've learned how to make a simple parsing
combinator library. There are still various improvements that can be
done. To name a few of them, we can implement more instances, such as
Semigroup and Monoid, IsString, MonadFail, MonadPlus, among others.
Error messages can be improved by better keeping track of location
(such as line and column), supporting more types such as Text and
ByteString, pretty-printing error messages, etc.

We will conclude this part here, and if you are interested, proceed
to the next part where Megaparsec, an industrial-strength parser
combinator library, will be introduced with a more practical
application than our simple examples in this section.

Megaparsec tutorial

We will now take a look at a practical application of parser
combinators: parsing S-expressions. Hopefully, this example will give
you a solid foundation for creating parsers for your grammars.

Megaparsec is the go-to parser for industrial projects, so we will
also use it in our tutorial. If you wish to see how it compares to
other parser combinator libraries in Haskell, you can read its
README.md.

Keep in mind you should choose the parser library that better suits
your needs. If you plan to parse machine-written data, for example,
then Attoparsec may be a better choice. Its interface is similar to
Megaparsec's, so the knowledge you will learn here will also be
useful.

Without any further ado, we will now present how to use Megaparsec to
parse some S-expressions.

Representing S-expressions

To begin, create a file called SExp.hs. Also, make sure to get the
megaparsec package.

We will use the following data structure to represent various
S-expressions. Gradually, we will write a parser that represents each
of these cases.

module SExp where

import Data.Void (Void)
import Text.Megaparsec
import Text.Megaparsec.Char

newtype Identifier = Identifier
  { getId :: String
  } deriving (Show)

data SExp
  = SSExp    SExp [SExp]  -- (foo 0 "hello" bar), (bar (baz 1)), (foo)
  | SInteger Integer  -- 42
  | SString  String  -- "hello, world"
  | SBool    Bool  -- false, true
  | SId      Identifier  -- foo
  deriving (Show)

We also need an alias for the type which megaparsec uses.

type Parser = Parsec
  -- The type for custom error messages. We have none, so use `Void`.
  Void
  -- The input stream type. Let's use `String` for now, but for
  -- better performance, you might want to use `Text` or `ByteString`.
  String

Booleans

Let's start with the simplest of the parsers: SBool. We use <|> which
tries one parser, and failing that, will try the other one.

bool :: Parser Bool
bool = False <$ string "false" <|> True <$ string "true"

We will also make a parser to represent our SExp-specific variants.
As we progress, we will add more parsers to this list.

atom :: Parser SExp
atom = choice
  [ SBool <$> bool
  ]

Here we use choice, which is the same as running <|> between each
element of the list. choice [a, b, c] = a <|> b <|> c.

It's important to notice that choice, as well as many other useful
combinators, are part of the parser-combinators package, which
Megaparsec conveniently re-exports for us. Should you fail to find a
specific function in Megaparsec's documentation, chances are that
it's defined in parser-combinators instead.

Now, fire up GHCi and test that we can parse it! megaparsec provides
the parseTest function which allows us to conveniently test our
parsers.

>>> parseTest atom "false"
SBool False

>>> parseTest atom "hey"
1:1:
  |
1 | hey
  | ^^^
unexpected "hey"
expecting "false" or "true"

Integers

To parse integers, one strategy is to parse as many digits (chars
whose value is between '0' and '9') as possible. This will give us a
list of characters, which we can then read.

For this parser, we use the some function, which tries to run a given
parser 1 or more times. numberChar is exported from
Text.Megaparsec.Char and parses any character that matches a digit.
In regex, this is equivalent to [0-9]+. This will return us a [Char]
(aka String), which we can read to make it into an Integer.

integer :: Parser Integer
integer = read <$> some numberChar

We now change our atom to also include integers.

atom :: Parser SExp
atom = choice
  [ SBool <$> bool
  , SInteger <$> integer
  ]

Let's also test it in GHCi:

>>> parseTest atom "42"
SInteger 42

>>> parseTest atom "hey"
1:1:
  |
1 | hey
  | ^^^
unexpected "hey"
expecting "false", "true", or numeric character

Although the error message is good, wouldn't it better if it said
"integer" instead of "numeric character"? For this, we can use labels
, which allows us to give a name to a parser. To do this, we can
simply change our parser like this:

integer :: Parser Integer
integer = label "integer" $ read <$> some numberChar

If you prefer, it's also possible to use <?>, which does the same as
label.

integer :: Parser Integer
integer = read <$> some numberChar <?> "integer"

And now the error message should be slightly improved:

>>> parseTest atom "hey"
1:1:
  |
1 | hey
  | ^^^
unexpected "hey"
expecting "false", "true" or integer

You can also use label "boolean" in our bool parser to change the
message to expecting boolean or integer.

Exercise: Can you change this parser so it also accepts negative
integers?

Solution One possible solution using `<|>`:

integer :: Parser Integer
integer = label "integer" $ read <$> (some numberChar <|> ((:) <$> char '-' <*> some numberChar))

However, later in the tutorial, we will see an even better way that
uses Megaparsec's own mechanisms.

---------------------------------------------------------------------

Strings

To parse a string, the idea is to find everything that is between
double-quotes. This approach has a weakness, though: we won't be able
to parse escaped quotes. Later in this tutorial, we will present a
solution to escaped characters using Megaparsec's lexing mechanisms.

To achieve this, we will use two new functions: between and
takeWhileP. Using between open close x is equivalent to open *> x <*
close, while takeWhileP will read as many characters as possible that
satisfy some predicate.

Note that takeWhileP also accepts, optionally, a label for each
character that it tries to read. For now, we will leave it as
Nothing.

str :: Parser String
str = label "string" $ between (char '"') (char '"') (takeWhileP Nothing (/= '"'))
-- Or:
--- str = label "string" $ char '"' *> takeWhileP Nothing (/= '"') <* char '"'

In atom, include SString <$> str.

>>> parseTest atom "\"hey\""
SString "hey"

Identifiers

An identifier will be like a variable in a programming language. We
can choose any naming convention for such a thing, but here I chose
the one that is frequently used in C-like languages: the first letter
must be a letter or an underscore, while the remainder may be a
letter, digit, or underscore.

Here we use the many function, which is very similar to the some
function discussed before. The difference is that many tries to run
the given parser 0 or more times. In regex, our identifier parser is
equivalent to [a-zA-Z_][a-zA-Z0-9_]*.

identifier :: Parser Identifier
identifier = label "identifier" $ do
  first <- letterChar <|> char '_'
  rest <- many $ alphaNumChar <|> char '_'
  pure $ Identifier $ first : rest
-- Or:
-- identifier = Identifier <$> label "identifier" ((:) <$> (letterChar <|> char '_') <*> many (alphaNumChar <|> char '_'))

Now include SId <$> identifier to atom.

Warning: When parsing with <|> or choice, the order of the parsers
matters. Take this as an example:

>>> parseTest (SBool <$> bool <|> SId <$> identifier) "true"
SBool True

>>> parseTest (SId <$> identifier <|> SBool <$> bool) "true"
SId (Identifier {getId = "true"})

In the first case, we parse a boolean before an identifier, which
causes true to be matched by bool. However, in the second example,
using identifier before bool caused true to be recognized as an
identifier. You should be careful while considering such things. In
most cases, you can assume that parsers are greedy. Notice that in
the second case, the bool parser will never be executed.

In other words, make sure that you parse a boolean in atom before you
parse an identifier.

S-expressions

Finally, let's move on to our last parser, which parses S-expressions
themselves, such as (foo 42 "hey") or (bar).

First, we will do it the naive way, which may come as the first
natural solution to someone just learning Megaparsec, and then we
will do it in a better manner using extra tools offered by
Megaparsec.

S-expressions: the awkward way

The idea here is to parse an atom, followed by 0 or more atoms. Of
course, this means that we want to use many to parse such atoms.
Additionally, an S-expression may be surrounded by parenthesis, so we
can use between for such a task.

This parser, however, is not as simple as before, but let's first do
it the wrong way (which won't work properly), analyze the problem,
and come up with a solution to it.

sexp :: Parser (SExp, [SExp])
sexp = label "S-expression" $ between (char '(') (char ')') ((,) <$> atom <*> many atom)
-- Or:
-- sexp = label "S-expression" $ char '(' *> ((,) <$> atom <*> many atom) <* char ')'

Note that (,) <$> atom <*> many atom is pretty similar to some atom.
parser-combinators exports a non-empty some that you can use instead,
and have the parser return Parser (NonEmpty SExp). Megaparsec exports
the ordinary list version instead, so if you want to use this one,
you'll need to import it manually and include a dependency on
parser-combinators.

After adding it to atom with uncurry SSExp <$> sexp, we can test it
in GHCi:

>>> parseTest atom "(foo)"
SSExp (SId (Identifier {getId = "foo"})) []

Ok, looks good so far. What if we try to apply an argument to foo?

>>> parseTest atom "(foo bar)"
1:5:
  |
1 | (foo bar)
  |     ^
unexpected space
expecting ')', '_', S-expression, alphanumeric character, boolean, identifier, integer, or string

Uh-oh. We can't parse spaces, as the error makes evident. Either we
close parenthesis (stop parsing this S-expression), continue the
definition of the identifier (_ or alphanumeric character), put
another atom (boolean, identifier, integer, or string), or another
S-expression (by opening parenthesis).

One solution that could come to one's mind is to use the function
sepBy, which takes a parser and a separator, and use it as atom
`sepBy` space. In this case, space is a parser defined by Megaparsec
that skips zero or more spaces.

>>> parseTest (between (char '(') (char ')') $ SSExp <$> atom <*> (space *> atom `sepBy` space)) "(foo bar)"
SSExp (SId (Identifier {getId = "foo"})) [SId (Identifier {getId = "bar"})]

Of course, this may raise other questions: what if we have a space
character before or after the opening and closing parenthesis?

While explicitly skipping spaces works, the megaparsec documentation
already offers a solution for this kind of problem, namely, a lexer.
Indeed, we will see that some of our previous definitions could have
been written more elegantly with the help of this module.

The important bit for us now is this:

    Parsing of white space is an important part of any parser. We
    propose a convention where every lexeme parser assumes no spaces
    before the lexeme and consumes all spaces after the lexeme; this
    is what the lexeme combinator does, and so it's enough to wrap
    every lexeme parser with lexeme to achieve this. Note that you'll
    need to call space manually to consume any white space before the
    first lexeme (i.e. at the beginning of the file).

S-expressions: the elegant way

The first thing you should do is import the appropriate lexing
module. Megaparsec recommends to import this module qualified.

import qualified Text.Megaparsec.Char.Lexer as L

Now we need to define what is whitespace. Until now, with the usage
of space, we have considered that it may be a space, tab, newline,
carriage return, form feed, or vertical tab. Megaparsec can go beyond
that and also consider line comments (such as -- comment) and block
comments (such as {- comment -}) as whitespace. Since we get those
for free, what about using ;; for line comments and /* */ for block
comments?

Begin by defining skipSpace. We use the space helper function from
the lexer module, which expects a function to parse 1 or more spaces,
a line comment, and a block comment.

skipSpace :: Parser ()
skipSpace = L.space
  -- Like `space`, but skips 1 or more space characters.
  space1
  -- Skip from ;; until a newline.
  (L.skipLineComment ";;")
  -- Skip from /* until */. There is also `skipBlockComment`, but it doesn't handle nested comments.
  (L.skipBlockCommentNested "/*" "*/")

And now define a lexeme parser, which indicates how to consume space
after the given parser.

lexeme :: Parser a -> Parser a
lexeme = L.lexeme skipSpace

Now you can use lexeme in each of the parsers, and they will
automatically skip whitespace after their definitions. Note that we
also use lexeme before our first opening parenthesis, so we can skip
space just after the (.

sexp :: Parser (SExp, [SExp])
sexp = label "S-expression" $ lexeme $
  between (lexeme (char '(')) (char ')') ((,) <$> atom <*> many atom)

You may add lexeme to other parsers where it's desirable to skip
whitespace after the rule. For instance:

integer :: Parser Integer
integer = label "integer" $ lexeme $ read <$> some numberChar

Note: The reader should do the same for str, bool, and identifier.
Alternatively, adding lexeme to atom will also work.

And now we can parse S-expressions properly:

>>> parseTest atom "(  foo  /* 8-) */  bar  ;; look mom, comments! \n  )"
SSExp (SId (Identifier {getId = "foo"})) [SId (Identifier {getId = "bar"})]

Our main parser function

Finally, let's make a function that solves the last two outstanding
problems.

For the first one, as mentioned by the Megaparsed documentation,
lexeme will skip whitespace after a parser, but we also need to deal
with whitespace at the beginning of the file.

For the second one, we want to parse everything until we find an eof
since our parser will currently ignore anything after a complete
atom. The two defects are exemplified below:

>>> parseTest atom "  42"
1:1:
  |
1 |   42
  | ^^^^
unexpected "  42"
expecting S-expression, boolean, identifier, integer, or string

>>> parseTest atom "42 foo bar"
SInteger 42

We will create a function that takes our input string, and either
returns an error message or the actual result. Such function
"bootstraps" our parser with between skipSpace eof atom, which will
solve the two defects. It will also serve as an auxiliary function
that can be called by client code, rather than parseTest which is
more useful for working in GHCi.

parseSExp :: String -> Either String SExp
parseSExp input =
  let
    outputE = parse
      -- Skip any whitespace at the beginning, expect the end of input after the atom.
      (between skipSpace eof atom)
      -- Name of the source file, you can give it any name you want. I leave it blank for now.
      ""
      -- The actual string to be parsed
      input
  in
  -- If we get Left, it will be an `ParseErrorBundle`, let's pretty print it for now.
  -- If we get Right, do nothing.
  case outputE of
    Left err -> Left $ errorBundlePretty err
    Right output -> Right output

Now in GHCi:

>>> case parseSExp "   42" of { Left e -> putStrLn e; Right r -> print r }
SInteger 42

>>> case parseSExp "42 foo" of { Left e -> putStrLn e; Right r -> print r }
1:4:
  |
1 | 42 foo
  |    ^
unexpected 'f'
expecting end of input

One more case: Double

There is one more caveat that Megaparsec users should know about:
backtracking. We show a somewhat contrived but realistic example. To
illustrate this, let's add one more case to our SExp:

data SExp
  = ...
  | SDouble  Double  -- 42f, 3.1415f

We will parse it as such: first, some digits, optionally followed by
a . and some more digits, and an f at the end. In regex, this is
equivalent to [0-9](\.[0-9])?f. We achieve the optional part using
the optional function, which returns a Parser (Maybe a).

double :: Parser Double
double = label "double" $ lexeme $ read <$> do
  left <- some numberChar
  rightM <- optional $ do
    char '.'
    some numberChar
  char' 'f'  -- Like char, but case-insensitive
  pure $ case rightM of
    Nothing -> left
    Just right -> left ++ "." ++ right

Now, we add it to atom: SDouble <$> double. Depending whether you
added double before or after integer, the result may be surprising:

                                                                   `double`
Parser \ Position    `double` before `integer`                     after
                                                                   `integer`
                     1:3:
                       |
                     1 | 42
parseTest atom "42"    |   ^                                       SDouble 42.0
                     unexpected end of input
                     expecting '.', 'F', 'f', or numeric character


parseTest atom "42f" SInteger 42                                   SInteger 42

So it only got one case correctly. To fix it, one solution would be
to use try, which allows Megaparsec to do as much backtracking as it
wants.

-- Note: Other cases omitted for brevity.
atom :: Parser SExp
atom = choice
  [ SDouble <$> try double
  , SInteger <$> integer
  ]

And it should fix the problem. However, this is not ideal in terms of
performance, as it will cause Megaparsec to consume the entirety of
the double to later find that there is no f, backtrack, and finally
try again with an integer.

In such a situation this is trivial, but care must be taken in more
complex examples, for example, what if we wanted to have (foo bar) as
well as tuples such as (foo bar, baz quz)? The S-expression could be
very long and lead to exponential complexity in the worst-case
scenario. Recently, a very similar problem was fixed in one of our
projects.

As we described in the introduction of this tutorial, it's important
to notice Megaparsec normally operates as a LL(1) parser, meaning it
tries to look at only one character ahead when executing parsers in
an alternative. Using try allows us to consume an arbitrary number of
characters before backtracking, allowing it to be used as a LL([?])
parser.

In this situation, the best thing to do is to refactor the grammars
to share common parts. The code above could be written without try as
such:

numeric :: Parser SExp
numeric = label "number" $ lexeme $ do
  -- Left side before . or f
  left <- some numberChar

  -- .0f or f
  rightM <- optional $ choice
    [ do
        char '.'
        right <- some numberChar
        char' 'f'
        pure $ left ++ "." ++ right
    , do
        char' 'f'
        pure left
    ]

  -- If we had the right side, it's a double, otherwise an integer.
  pure $ case rightM of
    Nothing -> SInteger $ read left
    Just right -> SDouble $ read right

And in atom, we simply use numeric instead of the two previous
parsers.

Rewriting with Megaparsec's lexer

A lot of the things could be written using the lexer module from
Megaparsec. In this section, we will present the simplified versions
for functions that can be improved.

Because we will use floatingOrInteger, you should include the
scientific package and import it:

import Data.Scientific

And here is our rewritten parser. Note that our new numeric has a
quirk: 1.0 is parsed as an integer by Megaparsec. A resolution to
this is left as an exercise to the reader.

-- signed takes a parser on how to skip space after the sign.
integer :: Parser Integer
integer = label "integer" $ lexeme $ L.signed skipSpace L.decimal

double :: Parser Double
double = label "double" $ lexeme $ L.signed skipSpace L.float <* char' 'f'

-- charLiteral also takes escaped characters into question.
str :: Parser String
str = label "string" $ lexeme $ char '"' *> manyTill L.charLiteral (char '"')

numeric :: Parser SExp
numeric = lexeme $ do
  value <- L.signed skipSpace L.scientific
  case floatingOrInteger value of
    Left d -> SDouble d <$ char' 'f'
    Right i -> do
      f <- optional $ char' 'f'
      pure $ case f of
        -- Note: 1.0 is an integer, but 1.1 is a parser error.
        Nothing -> SInteger i
        Just _ -> SDouble $ fromIntegral i

Note the changes to str as well. manyTill will run L.charLiteral
taking characters, including escaped characters until the matching "
is found.

As a bonus, we got various improvements:

 1. No need to read values.
 2. We can use signed to parse negative numbers.
 3. We can parse escaped characters in strings.

The functions for integer and double were kept for reference, even
though numeric is used instead.

Now let's check it in GHCi:

>>> parseTest atom "(foo -42 \"with \\\" now!\")"
SSExp (SId (Identifier {getId = "foo"})) [SInteger (-42),SString "with \" now!"]

And I hope that with this you can better understand how parser
combinators work. Now get to hacking your languages!

The complete code for this part can be found here.

Bonus: quasi-quotations

This section is better understood if you have some Template Haskell
(TH) knowledge. If you haven't learned about it, then I strongly
encourage you to check out our Introduction to Template Haskell
before trying out this section.

Note that this section is not necessary for our knowledge of parser
combinators, and it's purely a bonus section on how to do more with
our S-expressions.

For this part, you will need the template-haskell package.

Long story short, Template Haskell is a mechanism that allows GHC to
evaluate and analyze code during compile-time. One such application
is using QuasiQuoters, which allows us to run custom parsers during
compile time. Here we will use them for two purposes: to use
S-expressions as Haskell expressions and to use S-expressions as
Haskell patterns, both of which can be parsed during compilation.

We will create a new file called TH.hs, whose definition is given
below.

module TH where

import Control.Monad ((<=<))
import Language.Haskell.TH.Quote (QuasiQuoter (..))
import Language.Haskell.TH.Syntax (Q, dataToPatQ, liftData)

import SExp (SExp, parseSExp)

sexp :: QuasiQuoter
sexp = QuasiQuoter
  { quoteExp  = liftData <=< parseOrThrow
  , quotePat  = dataToPatQ (const Nothing) <=< parseOrThrow
  , quoteType = const $ fail "No S-expression parser for type"
  , quoteDec  = const $ fail "No S-expression parser for declarations"
  }
  where
    parseOrThrow :: String -> Q SExp
    parseOrThrow = either fail pure . parseSExp

Additionally, liftData and dataToPatQ assume that our parsed SExp
implement the Data type class. We will change our SExp module to
automatically derive this datatype for us. First, enable the
DeriveDataTypeable language extension, so GHC can automatically
implement Data for us.

You also need to import Data from Data.Data.

{-# LANGUAGE DeriveDataTypeable #-}

module SExp where

import Data.Data (Data)
-- The rest of the file goes here...

And now change Identifier and SExp to derive Data:

newtype Identifier = Identifier
  { getId :: String
  } deriving (Data, Show)

data SExp
  = SSExp    SExp [SExp]  -- (foo 0 "hello" bar), (bar (baz 1)), (foo)
  | SInteger Integer  -- 42
  | SString  String  -- "hello, world"
  | SBool    Bool  -- false, true
  | SId      Identifier  -- foo
  | SDouble  Double  -- 42f, 3.1415f
  deriving (Data, Show)

The basic contents of the TH module are summarized as follows:

  * Each field of QuasiQuoter provides a String parser for a
    different Haskell syntax.
  * quoteExp creates a Haskell expression.
      + This will allow us to write expressions such as let mySExp =
        [sexp|(foo 42)|].
      + We use our standard Megaparsec parser, parseSExp, but fail if
        it can't parse (halting compilation). liftData then
        transforms our SExp into an Exp.
      + If this is your first time seeing <=<, it's just another way
        to compose monadic functions. g <=< f is the same as \x -> g
        =<< f x.
      + If you've read our previous TH tutorial, it creates an Exp
        just like [e||].
  * quotePat creates a Haskell pattern.
      + Example: case mySExp of {[sexp|(foo 42)|] -> True; _ ->
        False}.
      + Similarly to quoteExp, we use dataToPatQ (const Nothing) to
        transform our SExp into a Pat.
          o Confusingly, liftData is defined as dataToExpQ (const
            Nothing).
          o The const Nothing can be replaced to handle type-specific
            cases. We are fine with the default behavior for now but
            we'll change it later.
      + Creates a Pat, like [p||].
  * quoteType creates a Haskell type.
      + Since we have no SExp, SInteger etc types matching our
        constructors, we will fail when someone tries to use our
        parser as types.
      + Creates a Type, like [t||].
  * quoteDec creates a Haskell declaration.
      + Similarly to quoteType, we won't define this one.
      + Creates a Dec, like [d||].

Now enable the QuasiQuotes extension in GHCi and you should be able
to test it. Notice how failing to parse instead causes GHCi to fail
to interpret the given S-expression (compilation error).

>>> :set -XQuasiQuotes
>>> [sexp|(foo "bar" 42)|]
SSExp (SId (Identifier {getId = "foo"})) [SString "bar",SInteger 42]

>>> [sexp|no parenthesis|]
Out [19]:

<interactive>:1135:7: error:
    * 1:4:
  |
1 | no parenthesis
  |    ^
unexpected 'p'
expecting end of input

    * In the quasi-quotation: [sexp|no parenthesis|]

There are still two improvements we could do, however. The first one
is that Megaparsec could use the actual Haskell location:

>>> let SInteger x = [sexp|.|] in x
(omitted for brevity, error in 1:1)

As for the second one, it would be much better if we could bind
S-expression variables into Haskell ones, such as the following:

foo [sexp|(foo $x)|] = Just x
foo _                = Nothing

Using Haskell locations

Let's change our parser runner a bit so it starts with the current
line and column. We will create a new auxiliary parseSExpWithPos,
which takes a SourcePos and uses it for the initial state. This
function was created from our old parseSExp.

Additionally, we create a new parseSExp with the default position.

We now call runParser' instead of directly calling parse, as it
allows us to give it an initial state and build such a state with the
default settings. Keep in mind that we're getting into the internal
behavior of Megaparsec here, so changes in the package can break this
code.

parseSExp :: String -> Either String SExp
parseSExp = parseSExpWithPos (initialPos "")

parseSExpWithPos :: SourcePos -> String -> Either String SExp
parseSExpWithPos pos@(SourcePos file _line _column) input =
  let
    initState = State
      { stateInput = input
      , stateOffset = 0
      , statePosState = PosState
        { pstateInput = input
        , pstateOffset = 0
        , pstateSourcePos = pos
        , pstateTabWidth = defaultTabWidth
        , pstateLinePrefix = ""
        }
      , stateParseErrors = []
      }
    outputE = snd $ runParser'
      -- Skip any whitespace at the beginning, expect the end of input after the atom.
      (between skipSpace eof atom)
      -- The input state with the string to be parsed.
      initState
  in
  -- If we get Left, it will be a `ParseErrorBundle`, let's pretty print it.
  -- If we get Right, do nothing.
  case outputE of
    Left err -> Left $ errorBundlePretty err
    Right output -> Right output

Now we can change our TH modules. We will make new imports:

import Language.Haskell.TH.Syntax (Loc (..), Q, dataToPatQ, liftData, location)
import Text.Megaparsec (SourcePos (..), mkPos)

import SExp (SExp, parseSExpWithPos)

And we will change parseOrThrow as follows:

    parseOrThrow :: String -> Q SExp
    parseOrThrow input = do
      posTH <- location
      let pos = SourcePos
            (loc_filename posTH)
            (mkPos $ fst $ loc_start posTH)
            (mkPos $ snd $ loc_start posTH)
      either fail pure $ parseSExpWithPos pos input

With it, we now use location to query where the splice occurred and
put it in the format that Megaparsec expects. Trying it in GHCi gives
me:

>>> let SInteger x = [sexp|foo 42|] in x
<interactive>:1730:24: error:
    * <interactive>:1730:28:
     |
1730 | foo 42
     |
unexpected '4'
expecting end of input

    * In the quasi-quotation: [sexp|foo 42|]

For visualization, it may be better if we create a small test file:

module Test where

import TH (sexp)

testQQ = [sexp|foo 42|]

Compiling it results in:

Test.hs:5:16: error:
    * Test.hs:5:20:
  |
5 | foo 42
  |
unexpected '4'
expecting end of input

    * In the quasi-quotation: [sexp|foo 42|]
  |
5 | testQQ = [sexp|foo 42|]
  |                ^^^^^^^^

You may freely change how errors are printed by changing
errorBundlePretty with a custom error printing function, but for this
tutorial, this is good enough for us.

Binding variables in patterns

As mentioned, it would be nice if we could bring S-expression
bindings into the Haskell scope. For this, let's add a new term into
our SExp:

data SExp
  -- other constructors omitted for brevity
  | STH String  -- $meta

We may now add a parser to extract it. Let's reuse our identifier for
this:

th :: Parser String
th = label "TH variable" $ lexeme $ char '$' *> fmap getId identifier

Don't forget to add it to atom with STH <$> th.

We are now ready to add it to our TH module. First, we will need some
new imports:

import Data.Typeable (Typeable, cast)
import Language.Haskell.TH.Lib (varP)
import Language.Haskell.TH.Syntax (Loc (..), Name, Q, dataToPatQ, liftData, location, mkName) -- no Loc and location if you've skipped the previous item

import SExp (SExp (STH), parseSExpWithPos)  -- or parseSExp if you've skipped the previous item

Typeable is a type class that allows us to try to cast some type into
another. Here, we attempt a cast into SExp.

We will add a new function to aid us:

sexp = -- ommited for brevity
  where
    antiQuoteVar :: Typeable a => a -> Maybe (Q Pat)
    antiQuoteVar var = case cast var of
      Nothing -> Nothing
      Just var' -> case var' of
        STH var'' -> Just $ varP $ mkName var''
        _other    -> Nothing

We can now replace const Nothing with antiQuoteVar in our
quasi-quoter:

  , quotePat  = dataToPatQ antiQuoteVar <=< parseOrThrow

Now check whether it works:

>>> case [sexp|(foo 42)|] of
...  [sexp|(foo $answer)|] -> print answer
...  _ -> putStrLn "oops"
SInteger 42

It's also possible to give a meaning to $answer at the expression
level by having it capture a Haskell variable instead. Let's change
TH a bit to support it:

    antiQuoteVar :: Typeable a => (Name -> Q b) -> a -> Maybe (Q b)
    antiQuoteVar mkVar var = case cast var of
      Nothing -> Nothing
      Just var' -> case var' of
        STH var'' -> Just $ mkVar $ mkName var''
        _other    -> Nothing

And our QuasiQuoter now has:

  { quoteExp  = dataToExpQ (antiQuoteVar varE) <=< parseOrThrow
  , quotePat  = dataToPatQ (antiQuoteVar varP) <=< parseOrThrow

Don't forget to import dataToExpQ, which now replaces liftData as
well as varE. We can now test our quasi-quoter again:

>>> let x = SInteger 42 in case [sexp|(foo $x)|] of
          [sexp|(foo $answer)|] -> print answer
          _ -> putStrLn "oops"
SInteger 42

And that's it! You now support quasi-quotations.

The complete code for the parser and quasi-quoter may be found here.

Conclusion

In this post, we've introduced the theory and a practical application
of parser combinators. We've discussed what are they as well as their
pros and cons, implemented a simple parser combinator from scratch,
learned how to write an S-expression parser with Megaparsec, and
created a simple quasi-quotation to consume input at compile-time.

Some improvements can still be made and they are left as an exercise
to the reader. For example, the S-expression parser operates on
Strings, but Megaparsec also supports Text and ByteString as the
input stream which should have better performance. On top of that, if
the reader is familiar with type families (otherwise check out our
article on Type Families), they may try to improve our parser from
scratch to also operate on more data types.

Appendix

Common parsing functions

Below we present some other useful functions and combinators you may
want to use. We show their names, types, and comment about them.

     Name             Type                   Documentation
many           Alternative f => f  Parse 0 or more occurrences of the
               a -> f [a]          given parser.
some           Alternative f => f  Parse 1 or more occurrences of the
               a -> f [a]          given parser.
               Alternative m => a  Parse 0 or 1 occurence of the
option         -> m a -> m a       given parser, with a fallback a if
                                   it fails.
optional       Alternative f => f  Like option, but returns a Maybe
               a -> f (Maybe a)    instead.
                                   Parses various occurences of the
                                   given parser separated by the
sepBy, sepBy1  Alternative m => m  given separator. Functions
               a -> m sep -> m [a] prefixed with 1 parse 1 or more,
                                   while the ones without parse 0 or
                                   more.
                                   Like the sepBy variants, but
sepEndBy,      Alternative m => m  additionally allow the separator
sepEndBy1      a -> m sep -> m [a] to appear optionally at the end of
                                   the sequence.
               Alternative m => m  Like the sepEndBy variants, but
endBy, endBy1  a -> m sep -> m [a] the separator at the end is not
                                   optional.
               (MonadParsec e s m,
char           Token s ~ Char) =>  Parses a single character matching
               Token s -> m (Token the given char.
               s)
               (MonadParsec e s m,
char'          Token s ~ Char) =>  Like char, but case-invariant.
               Token s -> m (Token
               s)
               MonadParsec e s m   Parses many characters matching
string         => Tokens s -> m    the given string.
               (Tokens s)
               (MonadParsec e s m,
string'        FoldCase (Tokens    Like string, but case-invariant.
               s)) => Tokens s ->
               m (Tokens s)
               MonadPlus m => m a  Allows parsing generic
makeExprParser -> [[Operator m a]] expressions, supporting infix,
               -> m a              prefix, and postfix operators with
                                   precedences.

Further reading

  * Parser Combinators in Elixir
  * How to Implement an LR(1) Parser
  * Parsec: "try a <|> b" considered harmful
  * Quasiquotation 101

For more Haskell tutorials, you can check out our Haskell articles or
follow us on Twitter or Dev.

tagged:
haskellparsers
Share:
       
Upvote blog postBlog post is upvoted
20 upvotes
Get new articles via email
No spam - you'll only receive stuff we'd like to read ourselves.
[                    ]
[ ]AcceptPrivacy notice
[ ]AcceptPrivacy notice
Subscribe
Parser Combinators in Haskell

  * What is a parser combinator and why use them?
  * Implementing a simple parser combinator from scratch
      + Implementation
      + Exercises
      + What's next?
  * Megaparsec tutorial
      + Representing S-expressions
      + Booleans
      + Integers
      + Strings
      + Identifiers
      + S-expressions
          o S-expressions: the awkward way
          o S-expressions: the elegant way
      + Our main parser function
      + One more case: Double
      + Rewriting with Megaparsec's lexer
  * Bonus: quasi-quotations
      + Using Haskell locations
      + Binding variables in patterns
  * Conclusion
  * Appendix
      + Common parsing functions
      + Further reading

More from Serokell
telegram blockchain competition serokelltelegram blockchain
competition serokell
TON: Is Haskell More Fun than FunC?
We participated in the Blockchain Developer competition announced by
the Telegram Open Network team. Here is what we submitted as our
entry and what we learned while working on it.
October 30th, 2019
10 min read
Past and Present of Haskell: Interview with Simon Peyton JonesPast
and Present of Haskell: Interview with Simon Peyton Jones
Past and Present of Haskell: Interview with Simon Peyton Jones
Some time ago, we got a fantastic opportunity to interview Simon
Peyton Jones, a key contributor to the design of Haskell and GHC.
Read (or watch) our interview with him to find more about the past,
present, and future of Haskell.
July 1st, 2021
18 min read
Haskell. History of a Community-Powered LanguageHaskell. History of a
Community-Powered Language
Haskell. History of a Community-Powered Language
Did you know that Haskell is a language with over 30 years of
history? Denis Oleynikov recounts the whole story, from start to
Haskell Prime.
August 23rd, 2019
4 min read
(+372) 699-1531
hi@serokell.io
Pille tn 7/5-13, Kesklinna linnaosa,
Tallinn, Harju maakond, 10135, Estonia
Serokell: Rate5.0based on 9 Google Business reviews
       
BiotechFintechManaged IT servicesBlockchainML consultingPrivacy
noticeHow we work
 
(c) 2015-2021 Serokell
Close
[                    ]