After few hours of code improvement, the blog is now in a much better state. The biggest things missing are the RSS feeds and the wiki text engine, but apart from those, this is shaping to be a blog that i can finally use.

Compared to the hundreds of codebase for blog software, one might wonders why did i need to re-invent a new software or if it's just one a NIH syndrome. There's at least 2 features that were necessary, before i could actually have a blog productively:

  • just text files: this is the biggest feature for me, since i want to be able to write my blog posts in vim or a curse interface on local files then synchronize the whole lot to the server that display them to the world. I also believe relational database are way overkill for basic usage like a personal website, but are also much slower [1] than a custom solution
  • completly written in haskell: I'm more and more careful about what are serving things to the outside world, and that's why i'm phasing out all dynamic languages (python, perl, ruby, php) that need to be throughly tested and re-tested after every modifications. Not only that but those language are magnitude slower than Haskell. I'm also still learning Haskell, hence that's a really good test bed for me.

Haskell is absolutely amazing at everything i've tried so far. embedded parsing is a delight thanks to parsec, the static typing increase the productivity massively, and the rich library / expressiveness makes all those software design easy to shape.

This blog software is also one of the software brick for something bigger.

[1] whilst it's hard to judge the whole performance issue, I think the main issue with available database are the layer on top of the filesystem and the SQL engine.

Configuring the awesome window manager is not as easy at it should since most of the time, you have to dig quite deep to find example or explanations on how to do things, not really casual user friendly. That said, it's usually worth it, since it improves (my) efficiency.

nonetheless here is my snippet to add bindings to the multimedia keys usually found on keyboards those days.

in the keys bindings sections, you need to add something in the:

-- Multimedia keys
awful.key({}, "XF86AudioLowerVolume", function () awful.util.spawn("amixer -q sset pcm 2dB-") end),
awful.key({}, "XF86AudioRaiseVolume", function () awful.util.spawn("amixer -q sset pcm 2dB+") end),
awful.key({}, "XF86HomePage", function () awful.util.spawn(browser) end),
awful.key({}, "XF86Mail", function () awful.util.spawn(mailreader) end),
awful.key({}, "XF86Calculator", function () awful.util.spawn("gnome-calculator") end),
awful.key({}, "XF86AudioPlay", function () awful.util.spawn("rhythmbox-client --play", false) end),
awful.key({}, "XF86Back", function () awful.util.spawn("rhythmbox-client --previous") end),
awful.key({}, "XF86Forward", function () awful.util.spawn("rhythmbox-client --next") end),
awful.key({}, "XF86AudioMute", function () awful.util.spawn("amixer -q sset Master toggle") end),
awful.key({}, "XF86Favorites", function () awful.util.spawn(terminal) end),

There's a , at the end of the block, since i've got other key bindings after this block, but if that's your last block, don't forget to remove it. There's also the following keys that i don't use: XF86Media, XF86Eject, XF86AudioPause

most of the articles found related to parser never talk about the feeding data API. it's only about what's the parser is doing inside (the type of gramar LR, the algorithm, ..)

Unfortunately lots of things that do parsing, offers to the user of the facility the choice of parsing a file as in:

	object *parse_file(const char *filename);
	object *parse_file(FILE *file);

What if you source is available in memory only; there's no easy way to feed the data to this API short of dumping everything to disk and reading it back, or creating a pipe between self data. pretty convulated.

Sometimes you have the choice to parse a in memory string that is slighlty more useful than beeing limited to parsing a file. whilst it's a nice improvement over file or file descriptor only, it's not as nice as it could be.

imagine your data is several gigabytes or that you have major memory constraint that just prevent you from loading the whole data in memory, how can you use this string api. that's correct you just can't, and most of the time people fallback to the file feeding api.

what happens if your data is neither available as a file descriptor, there's no way to store the data, and having the data in memory is just not going to work. At this stage you really want to be able to feed data incrementally so that you just need to have in memory only character at a time (or a small string).

	char c;
	while (c = read_next_char()) {
		parse_data(c);
	}

incremental parser are very simple to have when the underlying technology is a state machine. each time the parsing function returns to the caller, the state is kept (either by the caller to have something reentrant or inside the parser itself which is less recommended), and when recalling the parser the state is passed back to it.

this is the holy grail, since every other API can be develop on top of this very simple API, and even better, the wrapper are completly trivial (couple of lines each):

	file API:

	parse_fd(int fd) {
		while ((c = read_fd_next_char(fd)) != EOF) {
			parse_data(c);
		}
	}
	
	string API:

	parse_string(char *s) {
		int i;
		for (i = 0; s[i]; i++)
			parse_data(s[i]);
	}

In more abstracted language, parser usually take stream which are abstracted API to hide where's data is coming from, and when data is available. resulting in more or less the same behavior as the incremental api. however you lose control of the parser execution, meaning this is hard to stop the parser short of injecting unexpected data through the input stream.

A simple example, would be cancelling the parsing of some arbitrary data because the user requested it. just by the fact that you're processing the data by small chunk, and that's the caller is in charge of the scheduling of the parsing function.

	parse(char *s) {
		int i;
		for (i = 0; s[i]; i++) {
			if (user_cancel)
				break;
			parse_data(s[i]);
		}
	}

As a user of the parsing functions, you should ask for nothing less than incremental parser. as a developper of the parsing functions, should offer nothing less.

I wanted to develop a very small tokenizer, for the purpose of coloring code. I started with a very naive approch or cuttings lines and words, the way i probably would do it poorly in other language. As expected the approch is so poor, that more than half of what i wanted to highlight were not properly atomized, since words only works on whitespaces boundaries, which means that keywords or digits with a punctuation mark just afterwards were wrongly "worded".

I dreadfully though about a simple solution for this problem. 2 options were there: regexes, or a parser. In a past, i would probably choose the regexes option, even though regexes are only good up to a certain point, and are very hard to read after a while. It would certainly still holds in previous language where parser means to use a different tools (bison for C, ..), with a different syntax (BNF), with lots of painful mind trick to the tool to make the grammar sane to the underlying algorithm.

This time is now gone, parsec is so easy to use, integrate and develop that even regexes are painful.

First i want to have something really simple. something that takes number and symbol and have them properly labeled. the data class would be something like:

data Atom =
          Symbol String
        | Number String
        | Other Char

the Other class is there to catch all characters that were not identified either as a symbol or as a number. The first thing is defining the highlevel things such as:

atomize = manyTill (choice [ symbol, number, other ]) eof

it means atomize is a parser that do: until (manyTill) we reach end of file (eof), choose (choice) between the symbol parser (symbol), or the number parser (number) or the other parser (other). choice will iterate over the parser list until one succeed.

let's start by defining the easiest parser: other. this parser is there to catch any character. it's not suppose to fail, since it accept anything, and return an other atom. the definition is extremely simple:

other = anyChar >>= return . Char

Then it's time to define the symbol and number parser. because both parsers can fail if there's not parsing a symbol or a number respectively, you need to wrap the parser in a try parsec keyword. It tells the parser to save the input and if the parser following try fails, just rewind the input to where it was saved.

So a symbol is one or more characters that have those specific constraints: the first character need to be an alpha character a-z and A-Z and we also allow _. then following characters, if they exists, can be alpha character, underscore and digits.

symbolFirstChar = [ 'a'..'z' ] ++ [ 'A'..'Z' ] ++ [ '_' ]
symbolChar = symbolFirstChar ++ [ '0'..'9' ]

symbol = try $ do
	f <- oneOf symbolFirstChar
	ending <- many (oneOf symbolChar)
	return $ Symbol (f : ending)

and a number is even simpler, it just one or more digits characters.

number = try $ do many1 (oneOf [ '0'..'9' ]) >>= return . Number

that's it, you have a very simple lexer in 12 lines.

data Atom =
          Symbol String
        | Number String
        | Other Char
atomize = manyTill (choice [ symbol, number, other ]) eof

other = anyChar >>= return . Char
number = try $ do many1 (oneOf [ '0'..'9' ]) >>= return . Number
symbol = try $ do
	f <- oneOf symbolFirstChar
	ending <- many (oneOf symbolChar)
	return $ Symbol (f : ending)

symbolFirstChar = [ 'a'..'z' ] ++ [ 'A'..'Z' ] ++ [ '_' ]
symbolChar = symbolFirstChar ++ [ '0'..'9' ]

At this point you might say, alright but with regex it would probably fit in this amount of line too, which is true, but I think it misses the point that not only it's as simple to parse simple tokens like those, it also possible to extends it easily and naturally, to extremelly complex cases, because the regex state machine is not able to really understand things that requires a proper parser. for example the following example mix comment and string. the comment beggining tag appears in the string, so it should not be taken as a beggining of a comment:

	printf("C comment looks like: /*\n");
	printf("followed by the closing tag: */\n");

it's hard for a regex state machine to do the right thing (unless you want to painfully craft a regex that takes all those embedded cases). the truth is most of the time, people that develop regex parser, always forgot the hard cases.

I've been looking at increasing performance of haskell based software lately, more precisely of the blog. Here's a small benchmark done related to different changes i introduced to the codebase on the main page.

no cache, no bytestring: 94mb memory, 110ms rendering time

no cache, bytestring: 35mb memory, 50ms rendering time

cached, no bytestring: 30mb memory, 35ms rendering time

cached, bytestring: 450kb memory, 10ms rendering time

As you can see introducing bytestring, is a major performance enhancer compare to the unpacked and inefficient haskell String. I'm sure there's more that can be squeezed, since i have to go back to normal string back and forth at 2 differents place; parsec 2 doesn't seems to support bytestring, hstringtemplate cannot instanciate a stringtemplate from a bytestring) next step is to get rid of the unpacking/packing and see which optimisation can be done: need to find howto to do haskell profiling.

I used to buy apple hardware (1), because the hardware usually ended up to be the most slick. Not only that, but apple is not scared to imposes changes when there's a clear to the user benefit instead of keeping legacy interface/ports for ages.

I never ended up using any Apple media, software, or even operating system. As an firm believer of opensource, and open specifications. However this is coming to an end.

As Apple move more and more in designing hardware that suit their needs, the ability to run opensource software on the hardware is proving more difficult. My own problem was with my last apple laptop, the macbook air version 2, that happens to fail in lots of various annoying way.

Most of the time, booting with the apple cdrom (which happens not to follow the USB spec regarding power usage), would lead to bricking the only usb port. The first time i sent the hardware back to warranty imagining a hw defect, however as the second model did exactly the same, the probability that it was hw defect as well, was quite low.

The only solution to unbrick the USB port, was to disconnect the internal battery. So fortunately its seems the brickage is software based, and can be reversed at the cost of loosing some stored values (the usb state seems to be stored close to lot of other states, like wifi passwords) and at the greater cost of the warranty going away as soon as you open the machine.

The iphone got the same characterics of closeness. Without itunes, there's no way to synchronize the device or even update it. itunes is closed software, but it doesn't even on linux, rendering the operation a macOsX or windows only operation.

And now, the iPad definitely got the same characteric that would make me stays away from it. First the Apple A4 chip, I can only imagine that the cpu will be fairly open beeing an ARM cpu, but the gpu part will probably stay so close, that even if someone get to boot the hardware, there's a good chance that having the gpu actually displaying anything would be a new challenge. and the synchronisation has probably the same story as the iphone.

While lots of people doesn't see it yet, but Apple is the wolf in the sheepfold. This is a threat to the opensource world.

So no, i won't be buying any apple hardware anymore; However I'll be on the market for new slick hardware: a new phone and a new laptop for a start.

(1): 1 ipod, 1 iphone, 1 ibook, 2 macbooks, 1 macbook air

Comments on this blog hasn't been implemented yet, however I'm close to finish the design of how they work. Long story short, the comments and tracebacks will not change anything on the server side, but will be sent to a queue for moderation/acceptation that can be processed locally on the machine where I do the actual posts.

The benefit for this approch are mainly:

  • server-side data stays read-only from the webserver point of view.
  • all comments/tracebacks are to be approved before actually posted: net effect on reducing spams.
  • duplicated and OT comments can be filtered out.
  • comments related to post update (something missing in the post for example) doesn't have to be posted, but just acted on.

As a simple prototype, my initial queue design is going to be on top of SMTP and email. Each time a comment or traceback is receive, instead of modifying a file or a database on the server side, a simple email will be sent to a mailbox after basic checked where issued (spam checking mostly).

As this point i can just process the queue of comments with favorite $MUA. The acceptation process is simply piping the email through a special adding comment binary, that will add the comment to my local data. All this just a simple $MUA macro that does the piping with a simple and swift keybinding. At this point the comment is not visible but will be as soon as I push the data back to the server.

Eventually, new queueing system can be developped on top of XMPP, AMQP. Also the queue processing could be done by a simple bot when the data doesn't have to be moderated.

Do you legally agree to license terms that are not published yet ? no ?

Then, do not put a """version 2/3 or more""" in your license text. Doing so means that you agree that all future version of this license that may be created can be chosen when using what you distribute. even if the license became something completly different than what you did agree on at the license choosing time.

That's it i finally released my first haskell module.

it's a wrapper for git, that provides lowlevel operations and some highlevel ones too. It's not the best haskell code ever, specially since this is one of my first piece of useful haskell code, and i'm going to improve it slowly over time.

it's available here http://github.com/vincenthz/hs-libgit