I wanted to develop a very small tokenizer, for the purpose of coloring code. I started with a very naive approch or cuttings lines and words, the way i probably would do it poorly in other language. As expected the approch is so poor, that more than half of what i wanted to highlight were not properly atomized, since words only works on whitespaces boundaries, which means that keywords or digits with a punctuation mark just afterwards were wrongly "worded".

I dreadfully though about a simple solution for this problem. 2 options were there: regexes, or a parser. In a past, i would probably choose the regexes option, even though regexes are only good up to a certain point, and are very hard to read after a while. It would certainly still holds in previous language where parser means to use a different tools (bison for C, ..), with a different syntax (BNF), with lots of painful mind trick to the tool to make the grammar sane to the underlying algorithm.

This time is now gone, parsec is so easy to use, integrate and develop that even regexes are painful.

First i want to have something really simple. something that takes number and symbol and have them properly labeled. the data class would be something like:

data Atom =
          Symbol String
        | Number String
        | Other Char

the Other class is there to catch all characters that were not identified either as a symbol or as a number. The first thing is defining the highlevel things such as:

atomize = manyTill (choice [ symbol, number, other ]) eof

it means atomize is a parser that do: until (manyTill) we reach end of file (eof), choose (choice) between the symbol parser (symbol), or the number parser (number) or the other parser (other). choice will iterate over the parser list until one succeed.

let's start by defining the easiest parser: other. this parser is there to catch any character. it's not suppose to fail, since it accept anything, and return an other atom. the definition is extremely simple:

other = anyChar >>= return . Char

Then it's time to define the symbol and number parser. because both parsers can fail if there's not parsing a symbol or a number respectively, you need to wrap the parser in a try parsec keyword. It tells the parser to save the input and if the parser following try fails, just rewind the input to where it was saved.

So a symbol is one or more characters that have those specific constraints: the first character need to be an alpha character a-z and A-Z and we also allow _. then following characters, if they exists, can be alpha character, underscore and digits.

symbolFirstChar = [ 'a'..'z' ] ++ [ 'A'..'Z' ] ++ [ '_' ]
symbolChar = symbolFirstChar ++ [ '0'..'9' ]

symbol = try $ do
	f <- oneOf symbolFirstChar
	ending <- many (oneOf symbolChar)
	return $ Symbol (f : ending)

and a number is even simpler, it just one or more digits characters.

number = try $ do many1 (oneOf [ '0'..'9' ]) >>= return . Number

that's it, you have a very simple lexer in 12 lines.

data Atom =
          Symbol String
        | Number String
        | Other Char
atomize = manyTill (choice [ symbol, number, other ]) eof

other = anyChar >>= return . Char
number = try $ do many1 (oneOf [ '0'..'9' ]) >>= return . Number
symbol = try $ do
	f <- oneOf symbolFirstChar
	ending <- many (oneOf symbolChar)
	return $ Symbol (f : ending)

symbolFirstChar = [ 'a'..'z' ] ++ [ 'A'..'Z' ] ++ [ '_' ]
symbolChar = symbolFirstChar ++ [ '0'..'9' ]

At this point you might say, alright but with regex it would probably fit in this amount of line too, which is true, but I think it misses the point that not only it's as simple to parse simple tokens like those, it also possible to extends it easily and naturally, to extremelly complex cases, because the regex state machine is not able to really understand things that requires a proper parser. for example the following example mix comment and string. the comment beggining tag appears in the string, so it should not be taken as a beggining of a comment:

	printf("C comment looks like: /*\n");
	printf("followed by the closing tag: */\n");

it's hard for a regex state machine to do the right thing (unless you want to painfully craft a regex that takes all those embedded cases). the truth is most of the time, people that develop regex parser, always forgot the hard cases.

I've been looking at increasing performance of haskell based software lately, more precisely of the blog. Here's a small benchmark done related to different changes i introduced to the codebase on the main page.

no cache, no bytestring: 94mb memory, 110ms rendering time

no cache, bytestring: 35mb memory, 50ms rendering time

cached, no bytestring: 30mb memory, 35ms rendering time

cached, bytestring: 450kb memory, 10ms rendering time

As you can see introducing bytestring, is a major performance enhancer compare to the unpacked and inefficient haskell String. I'm sure there's more that can be squeezed, since i have to go back to normal string back and forth at 2 differents place; parsec 2 doesn't seems to support bytestring, hstringtemplate cannot instanciate a stringtemplate from a bytestring) next step is to get rid of the unpacking/packing and see which optimisation can be done: need to find howto to do haskell profiling.