Parsec is a beautiful library, but it’s DIFFICULT. Every time I’m working on a problem that requires an actual parser beyond regex, I die a little inside. I used Parsec for my Scheme interpreter, and it was probably the most frustrating part of the project.
This isn’t the fault of the authors, though. Parsers are complicated beasts, and I definitely didn’t know what I was doing the first time around.
There’s an interesting problem on Codewars – given an arbitrary molecule string, return the count of each constituent atom.
CH3CO2H -- acetate
Sounds pretty simple, right?
import Data.Char (isDigit)
import Data.List
import Data.Map qualified as Map
import Text.Read (readMaybe)
import Data.Maybe (fromMaybe, maybeToList)
parseMolec :: String -> [(Char, Int)]
parseMolec [] = []
parseMolec [c] = [(c, 1)]
parseMolec (c:cs) = let (ns, cs') = span isDigit cs
in (c, (fromMaybe 1 . readMaybe) ns) : parseMolec cs'
>>> parseMolec "CH3CO2H"
[('C',1),('H',3),('C',1),('O',2),('H',1)]
That’s it, we’re done. End of post.
Obviously, we’re not done. Even organic chemists deal with atoms beyond C, H, O, N.
>>> parseMolec "Mg"
[('M', 1), ('g', 1)]
To make matters more complicated, chemists are just like programmers – lazy. For polymeric materials with repeating groups, there’s a handy way to express their structure. For example, the formula for PVC is C2H3Cl
. I guess that would just be VC – because it forms arbitrarily long chains, it’s written as (C2H3Cl)n
. What a tradeoff – chemists making their lives easier at our expense. If we had to parse something like
CH3(CH2)2CONH2 -- butyramide
It seems like we can treat a group of parentheses as a subexpression on its own, multiplying every number by the coefficient that comes after it.
You can look at this notation as a form of run-length encoding, where you express repeating substructures of data as pairs (s, n). It’s the simplest form of data compression, and it’s easy to encode an arbitrary string. Decoding it is slightly harder, especially when you consider that these expressions can be nested:
K4(ON(SO3)2)2 -- Fremy's salt
Let’s take a step back and write a (roughly) formal grammar for this problem.
atom = uppercase, [lowercase], [number]
molecule = {atom | "(", molecule, ")", [number]}
In EBNF, curly braces denote repetition, and square brackets denote optionals. The regular expression is left as an exercise for the reader (i.e. I don’t want to figure it out.)
We know that these expressions are recursive, so our parser is also going to be recursive. We consume an input string, apply these rules one character at a time, and treat the results of deeper recursive calls as atomic values. We’ll also need to continue off from where the deeper recursive calls left off.
Let’s start off with a simple data type and work from there. We know that a parser is a function that will take in a String
, and return some value from consuming that string, along with the remainder of the input. There’s also a chance that our parser might fail, so we’ll reflect this on the type level by wrapping the result in a Maybe
. If we were doing something less trivial, we might use Either
or ExceptT
for more descriptive errors.
newtype Parse a = P { runParse :: String -> Maybe (a, String) }
Our Parse
r type is just the State
monad with the added possibility of failure. As we consume our string, we’re going to thread around the input as we consume it. Let’s try to turn our parser into a monad so we can do this implicitly.
instance Functor Parse where
fmap f p = P $ \s -> runParse p s >>= \(a, s') -> Just (f a, s')
The >>=
in this case refers to our Maybe
context – the \(a, s')
function will only be applied if the parser action succeeds. This is where monads really shine, letting you sequence effects in a way that abstracts out the repetition of managing nested contexts.
Any time you find yourself writing a lambda expression just to pass it as the first parameter of a monadic binding, you can usually abstract that away with a Kleisli fish. I mean arrow.
-- (>=>) :: Monad m => (a -> m b) -> (b -> m c) -> (a -> m c)
instance Functor Parse where
fmap f p = P $ runParse p >=> \(a, s') -> Just (f a, s')
But: since every Monad
is an Applicative
, we can basically get all the lower classes for free. Even better.
instance Functor Parse where
fmap = liftM
instance Applicative Parse where
(<*>) = ap
pure = return
Now we have to actually write some code. We know that in order to qualify our data type as a Monad, it needs to fulfill the monadic laws and have a lawful implementation of (>>=)
and return
.
To put a value in the bare minimum context of a monad it shouldn’t have any effects per se. In our case, return
should be a parser that doesn’t consume any input, and successfully yields a value regardless.
instance Monad Parse where
return :: a -> Parse a
return x = P $ Just . (x , )
-- return x = P $ \s -> Just (x, s)
The tricky part is figuring out what it means to bind
a parser into the next. In order to get sequential evaluation of two parsers, f
and g
, g
should be run on the remaining input from running f
. If f
fails to parse its input, then g
should also fail.
(>>=) :: Parse a -> (a -> Parse b) -> Parse b
p >>= f = P $ runParse p -- apply the function held by p
>=> \(a, s') -> -- if it succeeds, it will return a tuple (value, newState)
let g = f a -- when `f` is applied to `a`, it will yield a value of type `m b`
in runParse g s' -- i.e. (a P holding a function `String -> Maybe (b, String)`)
As usual, we can use arrows to make this a lot cleaner.
(>>=) :: Parse a -> (a -> Parse b) -> Parse b
p >>= f = P $ runParse p >=> (first f >>> uncurry runParse)
This is a little precarious, though. Parsing is a task, that more often that not, will fail. If a single parser is unable to successfully read a string, we’d like to be able to try an alternative function before giving up. Good thing this is already implemented for us. The Alternative
instance for Maybe looks something like
instance Alternative Maybe where
empty = Nothing
Nothing <|> b = b
Just a <|> _ = Just a
Where we only need to evaluate b
if a
is empty
. You can think of <|>
as ||
lifted to effectful types. If you squint hard enough, it looks like a stickbug. We’ll call it that from here on out, but it’s more like a choice
operator if anything. In our case, if one parser succeeds, we don’t need to try the other one. We can take advantage of the fact that functions are Applicative
, and use liftA2
to write the equivalent of \s -> runParse a s <|> runParse b s
.
instance Alternative Parse where
empty = P (const Nothing)
a <|> b = P $ liftA2 (<|>) (runParse a) (runParse b)
Now that we have a rudimentary state machine, we can write the basis for open-ended parsers. We only really need 3 simple primitives:
-- we need to be able to consume a value from our input
anyChar :: Parse Char
anyChar = P $ \case [] -> Nothing
(x:xs) -> Just (x, xs)
-- we need to verify that we've parsed everything
eof :: Parse ()
eof = P $ \case [] -> Just ((), [])
_ -> Nothing
-- we need to be able to conditionally fail
cond :: (a -> Bool) -> Parse a -> Parse a
cond test parser = do
res <- parser
if test res
then pure res
else empty
Let’s try this out on some simple inputs now:
>>> runParse anyChar "hello"
Just ('h', "ello")
>>> runParse ( cond (\c -> c == 'h') anyChar ) "xello"
Nothing
A parser that parses unconditionally is boring, and a parser that only parses a single character is useless. Since we’ve proven that our parser is a Monad
, Applicative
, Functor
, and Alternative
, we get a ton of functions for free.
We’ll start off by defining something to parse a specific character (we already have):
char :: Char -> Parser Char
char c = cond (== c) anyChar
then some more general selectors:
upper, lower, digit :: Parse Char
upper = cond isUpper anyChar
lower = cond isLower anyChar
digit = cond isDigit anyChar
Now we have everything we need to parse a simple atom (again). Control.Applicative
exports some really useful functions for Alternative
types, like optional
, some
, and many
that makes our lives a LOT easier. You can think of these functions as generalized versions of the regex qualifiers ?
, +
, and *
. Instead of transforming the match behaviour of a single pattern, they’ll apply an Alternative
action zero or one times, one or more times, or an unlimited amount. You can implement a regex engine using parser combinators and these operators pretty trivially, since monadic parsers are Turing-complete.
If you’re savvy, you might recognize some
and many
as the Kleene plus and Kleene star, respectively. When we apply a specific parser, all we’re doing is providing a set of strings that might be matched:
foo :: Parse String
foo = string "foo" <|> string "fooo"
-- /foo?/
charO :: Parse Char
charO = char 'o'
-- /o/
Transforming a parser with these operators yields a much larger set of matches:
foooo :: Parse String
foooo = do
f <- char 'f'
o <- some (char 'o')
pure $ (f : o)
-- /fo+/
foooo
will now match a countably infinite set of strings, in the form:
["fo", "foo", "fooo", "foooo", ...]
Formally, the Kleene star represents the repeated concatenation of a set with the result of cross-concatenating each of its elements together, yielding a new set (including the set’s identity):
lowers :: [Char]
lowers = ['a' .. 'z']
lowers0 :: [String]
lowers0 = [""]
lowers1 :: [String]
lowers1 = lowers0 <> [[c] | c <- lowers] -- include empty string
lowers2 :: [String]
lowers2 = lowers1 <> [(c:c1) | c <- lowers, c1 <- lowers1]
-- ["", "a", "b", ... "aa", "ab", "ac" ...]
lowers3 :: [String]
lowers3 = lowers2 <> [(c:c2) | c <- lowers, c2 <- lowers2]
-- ["", "a", ..., "aa", ... "aaa", "aab", ...]
...
lowersN :: Int -> [String]
lowersN n = [0..n] >>= (\x -> sequence $ replicate x lowers)
-- [sequence $ replicate x "abcde" | x <- [0..n]]
which is isomorphic to the set of all the n-ary Cartesian powers of the set for n <- 1..
, plus the set’s identity element. The Kleene plus just omits the identity element. With this distinction, the Kleene star forms a Monoid
, while the Kleene plus is only a Semigroup
. In fact, Kleene stars construct a free monoid over their inputs.
This is where regular expressions are truly beautiful – when you write a regex, you’re specifying exactly what should be matched, with regards to the set of all inputs the resulting automaton will succeed on. You could rewrite /foo*/
as /(fo|foo|fooo|foooo ...)
, but that isn’t feasible for people with finite lifetimes. Having these operators on hand allows you to arbitrarily expand the set of accepted inputs for your parser.
In the Alternative
context, these operators will run an action repeatedly until they yield empty – if you aren’t passing a state parameter around, your program will likely enter an infinite loop. Just something to keep in mind.
The Alternative
class is similar to Monoid
– the stickbug operator <|>
is an associative binary function between two Alternative
s, and empty
is an identity element such that:
(a <|> b) <|> c = a <|> (b <|> c)
empty <|> empty = empty
empty <|> alt = alt
alt <|> empty = alt
alt1 <|> alt2 = alt1
Where these types differ is that joining Alternative
s is biased towards the left argument – it’s a choice, rather than a composition.
>>> Nothing <> Nothing
Nothing
>>> Nothing <|> Nothing
Nothing
>>> Nothing <> Just "i"
Just "i"
>>> Nothing <|> Just "i"
Just "i"
>>> Just "H" <> Nothing
Just "H"
>>> Just "H" <|> Nothing
Just "H"
>>> Just "H" <> Just "i"
Just "Hi"
>>> Just "H" <|> Just "i"
Just "H"
Having gained the ability of choice, we also have optional
s now – optional
is a function that takes an Alternative
functor, runs its action, and returns a Maybe
value inside the Alternative
instead of returning empty
at the root context. This has the advantage of always succeeding, so we can treat a failure as a slight speedbump instead of an error that has to be propagated forwards. You can also just use <|>
:
-- H, O, Mg, Ne, etc.
symbol :: Parse String
symbol = do
up <- upper -- single character
low <- maybeToList <$> optional lower -- either [] or a lowercase
pure (up : low)
num :: Parse Int
num = read <$> some digit -- will only parse 1 or more digits, so we don't *have* to use readMaybe
type Atoms = (String, Int)
-- "H2" -> ("H", 2), etc.
atoms :: Parse Atoms
atoms = do
sym <- symbol
n <- num <|> pure 1 -- no annotation = single atom
pure (sym, n)
Let’s try this out now:
>>> runParse atoms "MgOH2"
Just ("Mg", "H2O4")
>>> runParse atoms "H2O2"
Just (("H", 2), "O2")
>>> runParse atoms "Hi2You!!!"
Just (("Hi", 2), "You!!!")
>>> runParse (many atoms) "HaHaUNLeSS10"
Just ([("Ha",1),("Ha",1),("U",1),("N",1),("Le",1),("S",1),("S",10)],"")
Nice. We can parse simple atoms now, and sequences of them at that. We have functions that will incrementally consume streams of input, giving up without munching anything if they fail. Let’s take a deeper look at what’s happening – if we had a parser like
iLike :: Parse String
iLike = string "i liek "
>> string "mudkipz" <|> string "parsers" <|> string "pasta"
-- /i liek (mudkipz|parsers|pasta)/
Your parsing process internally would look a little like this:
input = "i liek penguin"
Pipeline:
input => [string "i liek "] ---> [string "mudkipz" <|> string "parsers" <|> string "pasta"] => output
input => [string "i liek "] ---> [string "mudkipz" <|> string "parsers" <|> string "pasta"] => output
^ ^
| |
"i liek pasta" |
"pasta"
Zooming in:
[string "mudkipz"] =
[char 'm'] -> [char 'u'] -> [char 'd'] -> [char 'k'] -> [char 'i'] -> [char 'p'] -> [char 'z']
| FAIL -> Nothing
"pasta"
[string "parsers"] =
[char 'p'] -> [char 'a'] -> [char 'r'] -> [char 's'] -> [char 'e'] -> [char 'r'] -> [char 's']
| OK | |
"pasta" | OK |
"asta" | FAIL -> Nothing
"sta"
[string "pasta"] =
[char 'p'] -> [char 'a'] -> [char 's'] -> [char 't'] -> [char 'a']
| OK | | | |
"pasta" | OK | | |
"asta" | OK | |
"sta" | OK |
"ta" | OK -> ("pasta", "")
"a"
If you were writing a compiler from a regular expression down to a DFA, you might want to combine common paths into a trie-like structure to prevent redundant backtracking. After the initial prefix of the string is consumed by the “i liek” parser, the remaining chunk is passed into the choice between three strings: “mudkipz”, “parsers”, “pasta”. The parser function at that point in the pipeline has “pasta” as its input, so we can essentially “save” our position within the string at this nexus. We’re passing a literal string around (a linked list of characters), but this can easily be done with a pointer to a packed string literal as well for efficiency. Backtracking is implicit between choices.
Let’s recur now and try to tackle the nested repeating groups. It turns out that the problem is even more annoying, because the molecules can use any kind of bracket. K4[ON{SO3}2]2
is fair game. It doesn’t make that much of a difference really, we just have to make sure that brackets are paired correctly.
choice :: (Alternative f) => [a] -> f a
choice = foldr (<|>) empty
-- take an arbitrary parser, returning a version that parses between matching brackets
bracket :: Parse a -> Parse a
bracket p = do
c <- choice "({["
p <* case c of '(' -> char ')'
'{' -> char '}'
'[' -> char ']'
<*
is like the flipped, Applicative
version of >>
. It runs both actions, ignoring the second’s result but allowing its effects to propagate. If you find yourself doing something like
betweenParens :: Parse a -> Parse a
betweenParens p = do
char '('
res <- p
char ')'
pure res
it helps to refactor using these const
operators:
betweenParens :: Parse a -> Parse a
betweenParens p = char '(' *> p <* char ')'
I generally prefer (*>)
to (>>)
, because you don’t always need the power of a monad. We have both, largely because of historical reasons (Applicative
came around after Monad
).
Finally, our core, mutually recursive parsers – a functional group (in the chemical sense) is a molecule/atom within a pair of matching brackets, followed by an optional number, like “(OH)2”:
funcGroup :: Parse [Atoms]
funcGroup = do
atomCounts <- concat $ bracket (some molecule) -- wrap the act of parsing one or more
n <- num <|> pure 1
pure $ map ((* n) <$>) atomCounts
-- multiply every coefficient by the outer multiplier
molecule :: Parse [Atoms]
molecule = do
parsedAtoms <- some ((pure <$> atoms) <|> funcGroup)
-- wrap atoms up in a list to match types
pure $ concat parsedAtoms
parse :: Parse a -> String -> Maybe a
parse p = runParse allConsumed >=> pure . fst -- drop the empty remainder
where allConsumed = p <* eof -- assert input is entirely consumed after parsing
This is the power of parser combinators – you can build entire grammars out of simple primitives, and write as if every parser succeeds according to your specification. Traversing the input stream and error handling is abstracted away from you by the underlying monad, so your code ends up looking almost identical to your initial grammar.
Oh yeah, we need to get the total counts of each atom:
condense :: [Atoms] -> [Atoms]
condense = toList . Map.fromListWith (+)
parseMolecule :: String -> Maybe [Atoms]
parseMolecule s = condense <$> parse molecule s
Testing Fremy’s salt:
> parseMolecule "K4[ON(SO3)2]2" -- Fremy's salt
Just [("K",4),("O",14),("N",2),("S",4)]
The state-threading pattern is the underlying basis of most parser combinator libraries like Parsec. Parsec allows you to integrate its parser monad into a monad transformer stack, and there’s a lot more flexibility with regards to the type of your input stream and accumulating state parameter. It’s way cleaner than manually writing specific parsers, and exponentially simpler than bison
/happy
. Turns out, parsing doesn’t actually have to be hard.