I wanted to develop a very small tokenizer, for the purpose of coloring code. I
started with a very naive approch or cuttings lines and words, the way i
probably would do it poorly in other language. As expected the approch is so
poor, that more than half of what i wanted to highlight were not properly
atomized, since words only works on whitespaces boundaries, which means
that keywords or digits with a punctuation mark just afterwards were wrongly
"worded".
I dreadfully though about a simple solution for this problem. 2 options were
there: regexes, or a parser. In a past, i would probably choose the regexes
option, even though regexes are only good up to a certain point, and are very
hard to read after a while. It would certainly still holds in previous language
where parser means to use a different tools (bison for C, ..), with a different
syntax (BNF), with lots of painful mind trick to the tool to make the grammar
sane to the underlying algorithm.
This time is now gone, parsec is so easy to use, integrate and develop that
even regexes are painful.
First i want to have something really simple. something that takes number and symbol
and have them properly labeled. the data class would be something like:
data Atom =
Symbol String
| Number String
| Other Charthe Other class is there to catch all characters that were not identified either as a symbol
or as a number. The first thing is defining the highlevel things such as:
atomize = manyTill (choice [ symbol, number, other ]) eof
it means atomize is a parser that do: until (manyTill) we reach end of file
(eof), choose (choice) between the symbol parser (symbol), or the number parser
(number) or the other parser (other). choice will iterate over the parser list
until one succeed.
let's start by defining the easiest parser: other. this parser is there to
catch any character. it's not suppose to fail, since it accept anything, and return
an other atom. the definition is extremely simple:
other = anyChar >>= return . Char
Then it's time to define the symbol and number parser. because both parsers can fail if
there's not parsing a symbol or a number respectively, you need to wrap the parser in a
try parsec keyword. It tells the parser to save the input and if the parser following
try fails, just rewind the input to where it was saved.
So a symbol is one or more characters that have those specific constraints: the
first character need to be an alpha character a-z and A-Z and we also allow _.
then following characters, if they exists, can be alpha character, underscore
and digits.
symbolFirstChar = [ 'a'..'z' ] ++ [ 'A'..'Z' ] ++ [ '_' ]
symbolChar = symbolFirstChar ++ [ '0'..'9' ]
symbol = try $ do
f <- oneOf symbolFirstChar
ending <- many (oneOf symbolChar)
return $ Symbol (f : ending)
and a number is even simpler, it just one or more digits characters.
number = try $ do many1 (oneOf [ '0'..'9' ]) >>= return . Number
that's it, you have a very simple lexer in 12 lines.
data Atom =
Symbol String
| Number String
| Other Char
atomize = manyTill (choice [ symbol, number, other ]) eof
other = anyChar >>= return . Char
number = try $ do many1 (oneOf [ '0'..'9' ]) >>= return . Number
symbol = try $ do
f <- oneOf symbolFirstChar
ending <- many (oneOf symbolChar)
return $ Symbol (f : ending)
symbolFirstChar = [ 'a'..'z' ] ++ [ 'A'..'Z' ] ++ [ '_' ]
symbolChar = symbolFirstChar ++ [ '0'..'9' ]At this point you might say, alright but with regex it would probably fit in this amount
of line too, which is true, but I think it misses the point that not only it's as simple
to parse simple tokens like those, it also possible to extends it easily and naturally,
to extremelly complex cases, because the regex state machine is not able to
really understand things that requires a proper parser. for example the following example
mix comment and string. the comment beggining tag appears in the string, so it should not
be taken as a beggining of a comment:
printf("C comment looks like: /*\n");
printf("followed by the closing tag: */\n");it's hard for a regex state machine to do the right thing (unless you want to
painfully craft a regex that takes all those embedded cases). the truth is most
of the time, people that develop regex parser, always forgot the hard cases.