write you a parser for okay acceptable!

Written: 2021-01-05 to 2021-02-26
by Norman Liu

Parsec is a beautiful library, but it’s DIFFICULT. Every time I’m working on a problem that requires an actual parser beyond regex, I die a little inside. I used Parsec for my Scheme interpreter, and it was probably the most frustrating part of the project.

This isn’t the fault of the authors, though. Parsers are complicated beasts, and I definitely didn’t know what I was doing the first time around.

There’s an interesting problem on Codewars - given an arbitrary molecule string, return the count of each constituent atom.

CH3CO2H -- acetate

Sounds pretty simple, right?

import Data.Char (isDigit)
import Data.List
import Data.Map qualified as Map
import Text.Read (readMaybe)
import Data.Maybe (fromMaybe, maybeToList)

parseMolec :: String -> [(Char, Int)]
parseMolec [] = []
parseMolec [c] = [(c, 1)]
parseMolec (c:cs) = let (ns, cs') = span isDigit cs
                     in (c, (fromMaybe 1 . readMaybe) ns) : parseMolec cs'

>>> parseMolec "CH3CO2H"
[('C',1),('H',3),('C',1),('O',2),('H',1)]

That’s it, we’re done. End of post.

Obviously, we’re not done. Even organic chemists deal with atoms beyond C, H, O, N.

>>> parseMolec "Mg"
[('M', 1), ('g', 1)]

To make matters more complicated, chemists are just like programmers - lazy. For polymeric materials with repeating groups, there’s a handy way to express their structure. For example, the formula for PVC is C2H3Cl. I guess that would just be VC - because it forms arbitrarily long chains, it’s written as (C2H3Cl)n. What a tradeoff - chemists making their lives easier at our expense. If we had to parse something like

CH3(CH2)2CONH2 -- butyramide

It seems like we can treat a group of parentheses as a subexpression on its own, multiplying every number by the coefficient that comes after it.

You can look at this notation as a form of run-length encoding, where you express repeating substructures of data as pairs (s, n). It’s the simplest form of data compression, and it’s easy to encode an arbitrary string. Decoding it is slightly harder, especially when you consider that these expressions can be nested:

 K4(ON(SO3)2)2 -- Fremy's salt

Let’s take a step back and write a (roughly) formal grammar for this problem.

atom = uppercase, [lowercase], [number]
molecule = {atom | "(", molecule, ")", [number]}

In EBNF, curly braces denote repetition, and square brackets denote optionals. The regular expression is left as an exercise for the reader (i.e. I don’t want to figure it out.)

We know that these expressions are recursive, so our parser is also going to be recursive. We consume an input string, apply these rules one character at a time, and treat the results of deeper recursive calls as atomic values. We’ll also need to continue off from where the deeper recursive calls left off.

Let’s start off with a simple data type and work from there. We know that a parser is a function that will take in a String, and return some value from consuming that string, along with the remainder of the input. There’s also a chance that our parser might fail, so we’ll reflect this on the type level by wrapping the result in a Maybe. If we were doing something less trivial, we might use Either or ExceptT for more descriptive errors.

newtype Parse a = P { runParse :: String -> Maybe (a, String) } 

Our Parser type is just the State monad with the added possibility of failure. As we consume our string, we’re going to thread around the input as we consume it. Let’s try to turn our parser into a monad so we can do this implicitly.

instance Functor Parse where
    fmap f p = P $ \s -> runParse p s >>= \(a, s') -> Just (f a, s')

The >>= in this case refers to our Maybe context - the \(a, s') function will only be applied if the parser action succeeds. This is where monads really shine, letting you sequence effects in a way that abstracts out the repetition of managing nested contexts.

Any time you find yourself writing a lambda expression just to pass it as the first parameter of a monadic binding, you can usually abstract that away with a Kleisli fish. I mean arrow.

-- (>=>) :: Monad m => (a -> m b) -> (b -> m c) -> (a -> m c)

instance Functor Parse where
  fmap f p = P $ runParse p >=> \(a, s') -> Just (f a, s')

But: since every Monad is an Applicative, we can basically get all the lower classes for free. Even better.

instance Functor Parse where
  fmap = liftM 

instance Applicative Parse where
    (<*>) = ap
    pure = return

Now we have to actually write some code. We know that in order to qualify our data type as a Monad, it needs to fulfill the monadic laws and have a lawful implementation of (>>=) and return.

To put a value in the bare minimum context of a monad it shouldn’t have any effects per se. In our case, return should be a parser that doesn’t consume any input, and successfully yields a value regardless.

instance Monad Parse where
  return :: a -> Parse a
  return x = P $ Just . (x , )
  -- return x = P $ \s -> Just (x, s)

The tricky part is figuring out what it means to bind a parser into the next. In order to get sequential evaluation of two parsers, f and g, g should be run on the remaining input from running f. If f fails to parse its input, then g should also fail.

(>>=) :: Parse a -> (a -> Parse b) -> Parse b
p >>= f = P $ runParse p -- apply the function held by p 
            >=> \(a, s') -> -- if it succeeds, it will return a tuple (value, newState)
              let g = f a  -- when `f` is applied to `a`, it will yield a value of type `m b` 
               in runParse g s' -- i.e. (a P holding a function `String -> Maybe (b, String)`)

As usual, we can use arrows to make this a lot cleaner.

(>>=) :: Parse a -> (a -> Parse b) -> Parse b
p >>= f = P $ runParse p >=> (first f >>> uncurry runParse) 

This is a little precarious, though. Parsing is a task, that more often that not, will fail. If a single parser is unable to successfully read a string, we’d like to be able to try an alternative function before giving up. Good thing this is already implemented for us. The Alternative instance for Maybe looks something like

instance Alternative Maybe where
  empty = Nothing
  Nothing <|> b = b
  Just a  <|> _ = Just a 

Where we only need to evaluate b if a is empty. You can think of <|> as || lifted to effectful types. If you squint hard enough, it looks like a stickbug. We’ll call it that from here on out, but it’s more like a choice operator if anything. In our case, if one parser succeeds, we don’t need to try the other one. We can take advantage of the fact that functions are Applicative, and use liftA2 to write the equivalent of \s -> runParse a s <|> runParse b s.

instance Alternative Parse where
    empty = P (const Nothing)
    a <|> b = P $ liftA2 (<|>) (runParse a) (runParse b)

Now that we have a rudimentary state machine, we can write the basis for open-ended parsers. We only really need 3 simple primitives:

-- we need to be able to consume a value from our input
anyChar :: Parse Char
anyChar = P $ \case []     -> Nothing
                    (x:xs) -> Just (x, xs)

-- we need to verify that we've parsed everything
eof :: Parse ()
eof = P $ \case [] -> Just ((), [])
                _  -> Nothing

-- we need to be able to conditionally fail
cond :: (a -> Bool) -> Parse a -> Parse a
cond test parser = do
    res <- parser
    if test res
       then pure res
       else empty

Let’s try this out on some simple inputs now:

>>> runParse anyChar "hello"
Just ('h', "ello")

>>> runParse ( cond (\c -> c == 'h') anyChar ) "xello"
Nothing

A parser that parses unconditionally is boring, and a parser that only parses a single character is useless. Since we’ve proven that our parser is a Monad, Applicative, Functor, and Alternative, we get a ton of functions for free.

We’ll start off by defining something to parse a specific character (we already have):

char :: Char -> Parser Char
char c = cond (== c) anyChar

then some more general selectors:

upper, lower, digit :: Parse Char
upper = cond isUpper anyChar
lower = cond isLower anyChar
digit = cond isDigit anyChar

Now we have everything we need to parse a simple atom (again). Control.Applicative exports some really useful functions for Alternative types, like optional, some, and many that makes our lives a LOT easier. You can think of these functions as generalized versions of the regex qualifiers ?, +, and *. Instead of transforming the match behaviour of a single pattern, they’ll apply an Alternative action zero or one times, one or more times, or an unlimited amount. You can implement a regex engine using parser combinators and these operators pretty trivially, since monadic parsers are Turing-complete.

If you’re savvy, you might recognize some and many as the Kleene plus and Kleene star, respectively. When we apply a specific parser, all we’re doing is providing a set of strings that might be matched:

foo :: Parse String
foo = string "foo" <|> string "fooo"
-- /foo?/

charO :: Parse Char
charO = char 'o'
-- /o/

Transforming a parser with these operators yields a much larger set of matches:

foooo :: Parse String
foooo = do
  f <- char 'f'
  o <- some (char 'o')
  pure $ (f : o)
-- /fo+/

foooo will now match a countably infinite set of strings, in the form:

["fo", "foo", "fooo", "foooo", ...]

Formally, the Kleene star represents the repeated concatenation of a set with the result of cross-concatenating each of its elements together, yielding a new set (including the set’s identity):

lowers :: [Char] 
lowers = ['a' .. 'z']

lowers0 :: [String]
lowers0 = [""]

lowers1 :: [String]
lowers1 = lowers0 <> [[c] | c <- lowers] -- include empty string

lowers2 :: [String]
lowers2 = lowers1 <> [(c:c1) | c <- lowers, c1 <- lowers1]
  -- ["", "a", "b", ... "aa", "ab", "ac" ...]

lowers3 :: [String]
lowers3 = lowers2 <> [(c:c2) | c <- lowers, c2 <- lowers2]
  -- ["", "a", ..., "aa", ... "aaa", "aab", ...]

...

lowersN :: Int -> [String]
lowersN n = [0..n] >>= (\x -> sequence $ replicate x lowers)
  -- [sequence $ replicate x "abcde" | x <- [0..n]]

which is isomorphic to the set of all the n-ary Cartesian powers of the set for n <- 1.., plus the set’s identity element. The Kleene plus just omits the identity element. With this distinction, the Kleene star forms a Monoid, while the Kleene plus is only a Semigroup. In fact, Kleene stars construct a free monoid over their inputs.

This is where regular expressions are truly beautiful - when you write a regex, you’re specifying exactly what should be matched, with regards to the set of all inputs the resulting automaton will succeed on. You could rewrite /foo*/ as /(fo|foo|fooo|foooo ...), but that isn’t feasible for people with finite lifetimes. Having these operators on hand allows you to arbitrarily expand the set of accepted inputs for your parser.

In the Alternative context, these operators will run an action repeatedly until they yield empty - if you aren’t passing a state parameter around, your program will likely enter an infinite loop. Just something to keep in mind.

The Alternative class is similar to Monoid - the stickbug operator <|> is an associative binary function between two Alternatives, and empty is an identity element such that:

(a <|> b) <|> c = a <|> (b <|> c)

empty <|> empty = empty
empty <|> alt   = alt
alt   <|> empty = alt
alt1  <|> alt2  = alt1

Where these types differ is that joining Alternatives is biased towards the left argument - it’s a choice, rather than a composition.

>>> Nothing  <>  Nothing
Nothing
>>> Nothing  <|> Nothing
Nothing

>>> Nothing  <>  Just "i"
Just "i"
>>> Nothing  <|> Just "i"
Just "i"

>>> Just "H" <>  Nothing
Just "H"
>>> Just "H" <|> Nothing
Just "H"

>>> Just "H" <>  Just "i"
Just "Hi"
>>> Just "H" <|> Just "i"
Just "H"

Having gained the ability of choice, we also have optionals now - optional is a function that takes an Alternative functor, runs its action, and returns a Maybe value inside the Alternative instead of returning empty at the root context. This has the advantage of always succeeding, so we can treat a failure as a slight speedbump instead of an error that has to be propagated forwards. You can also just use <|>:

-- H, O, Mg, Ne, etc.
symbol :: Parse String
symbol = do 
  up <- upper -- single character
  low <- maybeToList <$> optional lower -- either [] or a lowercase
  pure (up : low)
  
num :: Parse Int
num = read <$> some digit -- will only parse 1 or more digits, so we don't *have* to use readMaybe

type Atoms = (String, Int)

-- "H2" -> ("H", 2), etc.
atoms :: Parse Atoms
atoms = do
  sym <- symbol
  n <- num <|> pure 1 -- no annotation = single atom
  pure (sym, n)

Let’s try this out now:

>>> runParse atoms "MgOH2"
Just ("Mg", "H2O4")

>>> runParse atoms "H2O2"
Just (("H", 2), "O2")

>>> runParse atoms "Hi2You!!!"
Just (("Hi", 2), "You!!!")

>>> runParse (many atoms) "HaHaUNLeSS10"
Just ([("Ha",1),("Ha",1),("U",1),("N",1),("Le",1),("S",1),("S",10)],"")

Nice. We can parse simple atoms now, and sequences of them at that. We have functions that will incrementally consume streams of input, giving up without munching anything if they fail. Let’s take a deeper look at what’s happening - if we had a parser like

iLike :: Parse String
iLike = string "i liek "
      >> string "mudkipz" <|> string "parsers" <|> string "pasta"
-- /i liek (mudkipz|parsers|pasta)/

Your parsing process internally would look a little like this:

input = "i liek penguin"

Pipeline:
input => [string "i liek "] ---> [string "mudkipz" <|> string "parsers" <|> string "pasta"] => output

input => [string "i liek "] ---> [string "mudkipz" <|> string "parsers" <|> string "pasta"] => output
       ^                       ^    
       |                       |
"i liek pasta"                 |
                            "pasta"

Zooming in:
[string "mudkipz"] = 
  [char 'm'] -> [char 'u'] -> [char 'd'] -> [char 'k'] -> [char 'i'] -> [char 'p'] -> [char 'z'] 
         | FAIL -> Nothing
        "pasta"

[string "parsers"] = 
  [char 'p'] -> [char 'a'] -> [char 'r'] -> [char 's'] -> [char 'e'] -> [char 'r'] -> [char 's']
         | OK          |             |
         "pasta"       | OK          |
                      "asta"         | FAIL -> Nothing
                                    "sta"

[string "pasta"] =  
  [char 'p'] -> [char 'a'] -> [char 's'] -> [char 't'] -> [char 'a']
         | OK          |             |             |             |  
         "pasta"       | OK          |             |             |  
                       "asta"        | OK          |             |  
                                    "sta"          | OK          |  
                                                  "ta"           | OK -> ("pasta", "")
                                                                "a" 

If you were writing a compiler from a regular expression down to a DFA, you might want to combine common paths into a trie-like structure to prevent redundant backtracking. After the initial prefix of the string is consumed by the “i liek” parser, the remaining chunk is passed into the choice between three strings: “mudkipz”, “parsers”, “pasta”. The parser function at that point in the pipeline has “pasta” as its input, so we can essentially “save” our position within the string at this nexus. We’re passing a literal string around (a linked list of characters), but this can easily be done with a pointer to a packed string literal as well for efficiency. Backtracking is implicit between choices.

Let’s recur now and try to tackle the nested repeating groups. It turns out that the problem is even more annoying, because the molecules can use any kind of bracket. K4[ON{SO3}2]2 is fair game. It doesn’t make that much of a difference really, we just have to make sure that brackets are paired correctly.

choice :: (Alternative f) => [a] -> f a
choice = foldr (<|>) empty

-- take an arbitrary parser, returning a version that parses between matching brackets
bracket :: Parse a -> Parse a
bracket p = do
  c <- choice "({[" 
  p <* case c of '(' -> char ')'
                 '{' -> char '}'
                 '[' -> char ']' 

<* is like the flipped, Applicative version of >>. It runs both actions, ignoring the second’s result but allowing its effects to propagate. If you find yourself doing something like

betweenParens :: Parse a -> Parse a
betweenParens p = do
  char '('
  res <- p
  char ')'
  pure res

it helps to refactor using these const operators:

betweenParens :: Parse a -> Parse a
betweenParens p = char '(' *> p <* char ')'

I generally prefer (*>) to (>>), because you don’t always need the power of a monad. We have both, largely because of historical reasons (Applicative came around after Monad).

Finally, our core, mutually recursive parsers - a functional group (in the chemical sense) is a molecule/atom within a pair of matching brackets, followed by an optional number, like “(OH)2”:

funcGroup :: Parse [Atoms]
funcGroup = do
  atomCounts <- concat $ bracket (some molecule) -- wrap the act of parsing one or more 
  n <- num <|> pure 1
  pure $ map ((* n) <$>) atomCounts
    -- multiply every coefficient by the outer multiplier

molecule :: Parse [Atoms]
molecule = do
  parsedAtoms <- some ((pure <$> atoms) <|> funcGroup)
    -- wrap atoms up in a list to match types
  pure $ concat parsedAtoms

parse :: Parse a -> String -> Maybe a
parse p = runParse allConsumed >=> pure . fst -- drop the empty remainder
  where allConsumed = p <* eof -- assert input is entirely consumed after parsing

This is the power of parser combinators - you can build entire grammars out of simple primitives, and write as if every parser succeeds according to your specification. Traversing the input stream and error handling is abstracted away from you by the underlying monad, so your code ends up looking almost identical to your initial grammar.

Oh yeah, we need to get the total counts of each atom:

condense :: [Atoms] -> [Atoms]
condense = toList . Map.fromListWith (+)

parseMolecule :: String -> Maybe [Atoms]
parseMolecule s = condense <$> parse molecule s

Testing Fremy’s salt:

> parseMolecule "K4[ON(SO3)2]2" -- Fremy's salt
Just [("K",4),("O",14),("N",2),("S",4)]

The state-threading pattern is the underlying basis of most parser combinator libraries like Parsec. Parsec allows you to integrate its parser monad into a monad transformer stack, and there’s a lot more flexibility with regards to the type of your input stream and accumulating state parameter. It’s way cleaner than manually writing specific parsers, and exponentially simpler than bison/happy. Turns out, parsing doesn’t actually have to be hard.