write you a parser for okay acceptable!
Written: 2021-01-05 to 2021-02-26by Naomi Liu
Parsec is a beautiful library, but it’s DIFFICULT. Every time I’m working on a problem that requires an actual parser beyond regex, I die a little inside. I used Parsec for my Scheme interpreter, and it was probably the most frustrating part of the project.
This isn’t the fault of the authors, though. Parsers are complicated beasts, and I definitely didn’t know what I was doing the first time around.
There’s an interesting problem on Codewars - given an arbitrary molecule string, return the count of each constituent atom.
CH3CO2H -- acetate
Sounds pretty simple, right?
import Data.Char (isDigit)
import Data.List
import Data.Map qualified as Map
import Text.Read (readMaybe)
import Data.Maybe (fromMaybe, maybeToList)
parseMolec :: String -> [(Char, Int)]
= []
parseMolec [] = [(c, 1)]
parseMolec [c] :cs) = let (ns, cs') = span isDigit cs
parseMolec (cin (c, (fromMaybe 1 . readMaybe) ns) : parseMolec cs'
>>> parseMolec "CH3CO2H"
'C',1),('H',3),('C',1),('O',2),('H',1)] [(
That’s it, we’re done. End of post.
Obviously, we’re not done. Even organic chemists deal with atoms beyond C, H, O, N.
>>> parseMolec "Mg"
'M', 1), ('g', 1)] [(
To make matters more complicated, chemists are just like programmers
- lazy. For polymeric materials with repeating groups, there’s a handy
way to express their structure. For example, the formula for PVC is
C2H3Cl
. I guess that would just be VC -
because it forms arbitrarily long chains, it’s written as
(C2H3Cl)n
. What a tradeoff - chemists making their lives
easier at our expense. If we had to parse something like
CH3(CH2)2CONH2 -- butyramide
It seems like we can treat a group of parentheses as a subexpression on its own, multiplying every number by the coefficient that comes after it.
You can look at this notation as a form of run-length encoding, where you express repeating substructures of data as pairs (s, n). It’s the simplest form of data compression, and it’s easy to encode an arbitrary string. Decoding it is slightly harder, especially when you consider that these expressions can be nested:
K4(ON(SO3)2)2 -- Fremy's salt
Let’s take a step back and write a (roughly) formal grammar for this problem.
= uppercase, [lowercase], [number]
atom = {atom | "(", molecule, ")", [number]} molecule
In EBNF, curly braces denote repetition, and square brackets denote optionals. The regular expression is left as an exercise for the reader (i.e. I don’t want to figure it out.)
We know that these expressions are recursive, so our parser is also going to be recursive. We consume an input string, apply these rules one character at a time, and treat the results of deeper recursive calls as atomic values. We’ll also need to continue off from where the deeper recursive calls left off.
Let’s start off with a simple data type and work from there. We know
that a parser is a function that will take in a String
, and return some value from consuming
that string, along with the remainder of the input. There’s also a
chance that our parser might fail, so we’ll reflect this on the type
level by wrapping the result in a Maybe
. If we were doing something less
trivial, we might use Either
or ExceptT
for more descriptive errors.
newtype Parse a = P { runParse :: String -> Maybe (a, String) }
Our Parse
r type is just the State
monad with the added possibility of
failure. As we consume our string, we’re going to thread around the
input as we consume it. Let’s try to turn our parser into a monad so we
can do this implicitly.
instance Functor Parse where
fmap f p = P $ \s -> runParse p s >>= \(a, s') -> Just (f a, s')
The >>=
in this case refers to
our Maybe
context - the
\(a, s')
function will only be applied if the parser action
succeeds. This is where monads really shine, letting you sequence
effects in a way that abstracts out the repetition of managing nested
contexts.
Any time you find yourself writing a lambda expression just to pass it as the first parameter of a monadic binding, you can usually abstract that away with a Kleisli fish. I mean arrow.
-- (>=>) :: Monad m => (a -> m b) -> (b -> m c) -> (a -> m c)
instance Functor Parse where
fmap f p = P $ runParse p >=> \(a, s') -> Just (f a, s')
But: since every Monad
is an Applicative
, we can basically get all the
lower classes for free. Even better.
instance Functor Parse where
fmap = liftM
instance Applicative Parse where
<*>) = ap
(pure = return
Now we have to actually write some code. We know that in order to
qualify our data type as a Monad, it needs to fulfill the monadic laws
and have a lawful implementation of (>>=)
and return
.
To put a value in the bare minimum context of a monad it shouldn’t
have any effects per se. In our case, return
should be a
parser that doesn’t consume any input, and successfully yields a value
regardless.
instance Monad Parse where
return :: a -> Parse a
return x = P $ Just . (x , )
-- return x = P $ \s -> Just (x, s)
The tricky part is figuring out what it means to bind
a
parser into the next. In order to get sequential evaluation of two
parsers, f
and g
, g
should be run
on the remaining input from running f
. If f
fails to parse its input, then g
should also fail.
(>>=) :: Parse a -> (a -> Parse b) -> Parse b
>>= f = P $ runParse p -- apply the function held by p
p >=> \(a, s') -> -- if it succeeds, it will return a tuple (value, newState)
let g = f a -- when `f` is applied to `a`, it will yield a value of type `m b`
in runParse g s' -- i.e. (a P holding a function `String -> Maybe (b, String)`)
As usual, we can use arrows to make this a lot cleaner.
(>>=) :: Parse a -> (a -> Parse b) -> Parse b
>>= f = P $ runParse p >=> (first f >>> uncurry runParse) p
This is a little precarious, though. Parsing is a task, that more
often that not, will fail. If a single parser is unable to successfully
read a string, we’d like to be able to try an alternative function
before giving up. Good thing this is already implemented for us. The
Alternative
instance for Maybe looks
something like
instance Alternative Maybe where
= Nothing
empty Nothing <|> b = b
Just a <|> _ = Just a
Where we only need to evaluate b
if a
is
empty
. You can think of <|>
as ||
lifted to effectful types. If you squint hard enough, it looks like a
stickbug. We’ll call it that from here on out, but it’s more like a
choice
operator if anything. In our case, if one parser
succeeds, we don’t need to try the other one. We can take advantage of
the fact that functions are Applicative
, and use liftA2
to
write the equivalent of
\s -> runParse a s <|> runParse b s
.
instance Alternative Parse where
= P (const Nothing)
empty <|> b = P $ liftA2 (<|>) (runParse a) (runParse b) a
Now that we have a rudimentary state machine, we can write the basis for open-ended parsers. We only really need 3 simple primitives:
-- we need to be able to consume a value from our input
anyChar :: Parse Char
= P $ \case [] -> Nothing
anyChar :xs) -> Just (x, xs)
(x
-- we need to verify that we've parsed everything
eof :: Parse ()
= P $ \case [] -> Just ((), [])
eof -> Nothing
_
-- we need to be able to conditionally fail
cond :: (a -> Bool) -> Parse a -> Parse a
= do
cond test parser <- parser
res if test res
then pure res
else empty
Let’s try this out on some simple inputs now:
>>> runParse anyChar "hello"
Just ('h', "ello")
>>> runParse ( cond (\c -> c == 'h') anyChar ) "xello"
Nothing
A parser that parses unconditionally is boring, and a parser that
only parses a single character is useless. Since we’ve proven that our
parser is a Monad
, Applicative
, Functor
, and Alternative
, we get a ton of functions for
free.
We’ll start off by defining something to parse a specific character (we already have):
char :: Char -> Parser Char
= cond (== c) anyChar char c
then some more general selectors:
digit :: Parse Char
upper, lower,= cond isUpper anyChar
upper = cond isLower anyChar
lower = cond isDigit anyChar digit
Now we have everything we need to parse a simple atom (again). Control.Applicative
exports some really useful
functions for Alternative
types, like
optional
, some
, and many
that
makes our lives a LOT easier. You can think of these functions as
generalized versions of the regex qualifiers ?
, +
, and *
. Instead of transforming the match behaviour
of a single pattern, they’ll apply an Alternative
action zero or one times, one or
more times, or an unlimited amount. You can implement a regex engine
using parser combinators and these operators pretty trivially, since
monadic parsers are Turing-complete.
If you’re savvy, you might recognize some
and
many
as the Kleene plus and Kleene star, respectively. When
we apply a specific parser, all we’re doing is providing a set of
strings that might be matched:
foo :: Parse String
= string "foo" <|> string "fooo"
foo -- /foo?/
charO :: Parse Char
= char 'o'
charO -- /o/
Transforming a parser with these operators yields a much larger set of matches:
foooo :: Parse String
= do
foooo <- char 'f'
f <- some (char 'o')
o pure $ (f : o)
-- /fo+/
foooo
will now match a countably infinite set of
strings, in the form:
"fo", "foo", "fooo", "foooo", ...] [
Formally, the Kleene star represents the repeated concatenation of a set with the result of cross-concatenating each of its elements together, yielding a new set (including the set’s identity):
lowers :: [Char]
= ['a' .. 'z']
lowers
lowers0 :: [String]
= [""]
lowers0
lowers1 :: [String]
= lowers0 <> [[c] | c <- lowers] -- include empty string
lowers1
lowers2 :: [String]
= lowers1 <> [(c:c1) | c <- lowers, c1 <- lowers1]
lowers2 -- ["", "a", "b", ... "aa", "ab", "ac" ...]
lowers3 :: [String]
= lowers2 <> [(c:c2) | c <- lowers, c2 <- lowers2]
lowers3 -- ["", "a", ..., "aa", ... "aaa", "aab", ...]
...
lowersN :: Int -> [String]
= [0..n] >>= (\x -> sequence $ replicate x lowers)
lowersN n -- [sequence $ replicate x "abcde" | x <- [0..n]]
which is isomorphic to the set of all the n-ary Cartesian powers of
the set for n <- 1..
, plus the set’s identity element.
The Kleene plus just omits the identity element. With this distinction,
the Kleene star forms a Monoid
, while
the Kleene plus is only a Semigroup
. In
fact, Kleene stars construct a free
monoid over their inputs.
This is where regular expressions are truly beautiful - when you
write a regex, you’re specifying exactly what should be
matched, with regards to the set of all inputs the resulting automaton
will succeed on. You could rewrite /foo*/
as
/(fo|foo|fooo|foooo ...)
, but that isn’t feasible for
people with finite lifetimes. Having these operators on hand allows you
to arbitrarily expand the set of accepted inputs for your parser.
In the Alternative
context, these
operators will run an action repeatedly until they yield empty - if you
aren’t passing a state parameter around, your program will likely enter
an infinite loop. Just something to keep in mind.
The Alternative
class is similar to
Monoid
- the stickbug operator <|>
is an associative binary function
between two Alternative
s, and
empty
is an identity element such that:
<|> b) <|> c = a <|> (b <|> c)
(a
<|> empty = empty
empty <|> alt = alt
empty <|> empty = alt
alt <|> alt2 = alt1 alt1
Where these types differ is that joining Alternative
s is biased towards the left
argument - it’s a choice, rather than a
composition.
>>> Nothing <> Nothing
Nothing
>>> Nothing <|> Nothing
Nothing
>>> Nothing <> Just "i"
Just "i"
>>> Nothing <|> Just "i"
Just "i"
>>> Just "H" <> Nothing
Just "H"
>>> Just "H" <|> Nothing
Just "H"
>>> Just "H" <> Just "i"
Just "Hi"
>>> Just "H" <|> Just "i"
Just "H"
Having gained the ability of choice, we also have
optional
s now - optional
is a function that
takes an Alternative
functor, runs its
action, and returns a Maybe
value
inside the Alternative
instead of
returning empty
at the root context. This has the advantage
of always succeeding, so we can treat a failure as a slight speedbump
instead of an error that has to be propagated forwards. You can also
just use <|>
:
-- H, O, Mg, Ne, etc.
symbol :: Parse String
= do
symbol <- upper -- single character
up <- maybeToList <$> optional lower -- either [] or a lowercase
low pure (up : low)
num :: Parse Int
= read <$> some digit -- will only parse 1 or more digits, so we don't *have* to use readMaybe
num
type Atoms = (String, Int)
-- "H2" -> ("H", 2), etc.
atoms :: Parse Atoms
= do
atoms <- symbol
sym <- num <|> pure 1 -- no annotation = single atom
n pure (sym, n)
Let’s try this out now:
>>> runParse atoms "MgOH2"
Just ("Mg", "H2O4")
>>> runParse atoms "H2O2"
Just (("H", 2), "O2")
>>> runParse atoms "Hi2You!!!"
Just (("Hi", 2), "You!!!")
>>> runParse (many atoms) "HaHaUNLeSS10"
Just ([("Ha",1),("Ha",1),("U",1),("N",1),("Le",1),("S",1),("S",10)],"")
Nice. We can parse simple atoms now, and sequences of them at that. We have functions that will incrementally consume streams of input, giving up without munching anything if they fail. Let’s take a deeper look at what’s happening - if we had a parser like
iLike :: Parse String
= string "i liek "
iLike >> string "mudkipz" <|> string "parsers" <|> string "pasta"
-- /i liek (mudkipz|parsers|pasta)/
Your parsing process internally would look a little like this:
= "i liek penguin"
input
Pipeline:=> [string "i liek "] ---> [string "mudkipz" <|> string "parsers" <|> string "pasta"] => output
input
=> [string "i liek "] ---> [string "mudkipz" <|> string "parsers" <|> string "pasta"] => output
input
^ ^
| |"i liek pasta" |
"pasta"
Zooming in:string "mudkipz"] =
[char 'm'] -> [char 'u'] -> [char 'd'] -> [char 'k'] -> [char 'i'] -> [char 'p'] -> [char 'z']
[
| FAIL -> Nothing"pasta"
string "parsers"] =
[char 'p'] -> [char 'a'] -> [char 'r'] -> [char 's'] -> [char 'e'] -> [char 'r'] -> [char 's']
[
| OK | |"pasta" | OK |
"asta" | FAIL -> Nothing
"sta"
string "pasta"] =
[char 'p'] -> [char 'a'] -> [char 's'] -> [char 't'] -> [char 'a']
[
| OK | | | | "pasta" | OK | | |
"asta" | OK | |
"sta" | OK |
"ta" | OK -> ("pasta", "")
"a"
If you were writing a compiler from a regular expression down to a DFA, you might want to combine common paths into a trie-like structure to prevent redundant backtracking. After the initial prefix of the string is consumed by the “i liek” parser, the remaining chunk is passed into the choice between three strings: “mudkipz”, “parsers”, “pasta”. The parser function at that point in the pipeline has “pasta” as its input, so we can essentially “save” our position within the string at this nexus. We’re passing a literal string around (a linked list of characters), but this can easily be done with a pointer to a packed string literal as well for efficiency. Backtracking is implicit between choices.
Let’s recur now and try to tackle the nested repeating groups. It
turns out that the problem is even more annoying, because the molecules
can use any kind of bracket. K4[ON{SO3}2]2
is fair game. It doesn’t make
that much of a difference really, we just have to make sure that
brackets are paired correctly.
choice :: (Alternative f) => [a] -> f a
= foldr (<|>) empty
choice
-- take an arbitrary parser, returning a version that parses between matching brackets
bracket :: Parse a -> Parse a
= do
bracket p <- choice "({["
c <* case c of '(' -> char ')'
p '{' -> char '}'
'[' -> char ']'
<*
is like the flipped, Applicative
version of >>
. It runs both actions, ignoring the
second’s result but allowing its effects to propagate. If you find
yourself doing something like
betweenParens :: Parse a -> Parse a
= do
betweenParens p '('
char <- p
res ')'
char pure res
it helps to refactor using these const
operators:
betweenParens :: Parse a -> Parse a
= char '(' *> p <* char ')' betweenParens p
I generally prefer (*>)
to (>>)
, because you don’t always
need the power of a monad. We have both, largely because of historical
reasons (Applicative
came around after
Monad
).
Finally, our core, mutually recursive parsers - a functional group (in the chemical sense) is a molecule/atom within a pair of matching brackets, followed by an optional number, like “(OH)2”:
funcGroup :: Parse [Atoms]
= do
funcGroup <- concat $ bracket (some molecule) -- wrap the act of parsing one or more
atomCounts <- num <|> pure 1
n pure $ map ((* n) <$>) atomCounts
-- multiply every coefficient by the outer multiplier
molecule :: Parse [Atoms]
= do
molecule <- some ((pure <$> atoms) <|> funcGroup)
parsedAtoms -- wrap atoms up in a list to match types
pure $ concat parsedAtoms
parse :: Parse a -> String -> Maybe a
= runParse allConsumed >=> pure . fst -- drop the empty remainder
parse p where allConsumed = p <* eof -- assert input is entirely consumed after parsing
This is the power of parser combinators - you can build entire grammars out of simple primitives, and write as if every parser succeeds according to your specification. Traversing the input stream and error handling is abstracted away from you by the underlying monad, so your code ends up looking almost identical to your initial grammar.
Oh yeah, we need to get the total counts of each atom:
condense :: [Atoms] -> [Atoms]
= toList . Map.fromListWith (+)
condense
parseMolecule :: String -> Maybe [Atoms]
= condense <$> parse molecule s parseMolecule s
Testing Fremy’s salt:
> parseMolecule "K4[ON(SO3)2]2" -- Fremy's salt
Just [("K",4),("O",14),("N",2),("S",4)]
The state-threading pattern is the underlying basis of most parser
combinator libraries like Parsec. Parsec allows you to integrate its
parser monad into a monad transformer stack, and there’s a lot more
flexibility with regards to the type of your input stream and
accumulating state parameter. It’s way cleaner than manually writing
specific parsers, and exponentially simpler than
bison
/happy
. Turns out, parsing doesn’t
actually have to be hard.