Homemade Stuff100_homemade_rss_feedWed Jul 7 16:38:51 BST 2010Vincent Hanquezpinboard -- blog software in haskell(2010,7,6,"haskell_cryptohash")cryptohash in haskell2010-7-6T00:00:00+00:00Vincent Hanquez<p>There's a bunch of different haskell projects that implements some cryptohashes (sha1, md5, ..).
However they are very differently constructed, with unconsistent interface overall. Also most of them
has poor performance, and expose only the high level operation: hash data in one go, giving back a digest.</p>
<p>Also most of them implement really common cryptohashes like sha1, sha2 and md5, but it's hard to find anything
out there for less common cryptohashes (md2, md4, ripemd comes to mind).</p>
<p>To rectify this problem, here come the hs-cryptohash package; it's a bundle of 9 differents algorithms implementing
all common cryptohash (sha1, sha2 family, md5), but also some less common one (md2, md4, ripemd160). Each interfaces
of thoses modules are completely similar, which make it simple to replace one hash algorithm by another.</p>
<p>Under the hood, all thoses algorithms are implemented in C, with a
relatively well optimized for most of them. Some of them come close to openssl,
one of the fastest library for cryptohash, whereas most of them are around the
same ball park as the sum programs available in GNU coreutils.</p>
<p>Over the hood, everything is pure despite having IO related calls because of
the C FFI. There's 2 mains interfaces exposed: the one-go interface, and the incremental interface.</p>
<p>The one-go interface exposes 2 calls hash and hashlazy, that takes a strict bytestring and lazy bytestring respectively, and gives back a digest bytestring</p>
<p>The incremental interface exposes 3 calls: init, update and finalize. This
is on par with what most cryptohash algorithm exposes as interface. This is
very useful to use thoses calls when data comes through multiples chunks and
you don't want to "store" all the chunks before hashing. This interface is slighly slower
than the one-go one, for the simple reason that to expose a pure interface, the
context cannot be updated in place.</p>
<p>Here's goes some numbers comparing debian coreutils digest binaries (sha1sum,
sha256sum ..) with openssl and then with new haskell cryptohash modules in
incremental and then one pass mode. each run is processing a 500mb random file,
then averaged on 4 trials.</p>
<pre>
coreutils cryptohash(op) cryptohash(in) openssl
sha1 2.97500 2.65250 3.64250 1.90250
sha224 5.36000 5.78250 6.78250 4.30500
sha256 5.33500 5.76000 6.77750 4.26750
sha384 3.65500 3.65750 4.69000 2.78500
sha512 3.66000 3.66500 4.70750 2.79750
md5 1.65750 1.75750 2.70750 1.50250
ripemd N/A 3.80000 4.77000 3.83
</pre>
<p>
And now some numbers with some other hackage library (digesting a 500mb file):
</p>
<pre>
SHA PureMD5 cryptohash speedup
sha1 23.1s N/A 2.65s x 8.7
sha256 44.8s N/A 5.76s x 7.7
sha512 25.4s N/A 3.55s x 7.1
md5 N/A 9.42 1.76s x 5.3
</pre>
(2010,3,24,"web_color")web page colors2010-3-24T00:00:00+00:00Vincent Hanquez<p>Dear Web Developpers,</p><p>when you are setting web page colors, please set also input related object background colors.
Because you set usually set text color, through the * CSS object, to be black
without setting the input background to be light, people with black themes end
up with a black on black text which is quite impractical.</p><p>setting input colors is not really hard, so please do. OK ? Thanks Bye !</p>(2010,3,20,"liferea_eating_space")liferea is eating my disk space2010-3-20T00:00:00+00:00Vincent Hanquez<p>Liferea is a news aggregator that i've been using for quite a while. The
interface is simple, and it was fast enough. but lately its behavior became
more erratic, so I start looking at the reason of that behavior.</p><p>First things first, the feed's data are all stuffed in a sqlite3 database.
unfortunately it seems that the caching policy is just mirroring everything
locally. couple of weeks ago the database was already a whopping 120Mb, now it
just reach 195Mb. I didn't add any new feeds since that last time, so obviously
something is really wrong. does liferea delete any things in its database ?</p><p>Over looking at the database with the sqlite3 command line utility indicate
that the items table is taking the most space, followed by the itemsets table.</p><p>It's also got this annoying behavior that everything that is copy and pasted in
the liferea windows is considered as a new news feed. For people reading with
mouse selection, which is something i'm trying to quit, this is really
annoying. specially since sometimes it can create multiples (20 was my top
score) news feed subscription with some random text selection which are
obviously not web links. Worst things, you can't do multiples selection in the
feed selection view, so removing bad feeds is a one by one process.</p><p>Looking at liferea-1.6.3 source, with 54086 lines in .c and .h files, the task
of patching it seems daunting. I did a hack to disable copy and paste
completly, but changing the way liferea stores data, seems a too long project.</p><p>So i've decided to take the matter in my own hand, and make a small and fast
news aggregator. one that has a storage well designed, so that reading a new
items in a feed doesn't make the system crawl for IO to the always-growing
database. It will be probably similar to some of those rss2imap projects,
except it will be in haskell and it won't sucks.</p>(2010,3,10,"haskell_libgit")haskell libgit module available2010-3-10T00:00:00+00:00Vincent Hanquez<p>That's it i finally released my first haskell module.</p><p>it's a wrapper for git, that provides lowlevel operations and some highlevel ones too.
It's not the best haskell code ever, specially since this is one of my first piece
of useful haskell code, and i'm going to improve it slowly over time.</p><p>it's available here <a href="http://github.com/vincenthz/hs-libgit">http://github.com/vincenthz/hs-libgit</a></p>(2010,3,9,"license_text")license text2010-3-9T00:00:00+00:00Vincent Hanquez<p>Do you legally agree to license terms that are not published yet ? no ?</p><p>Then, do not put a """version 2/3 or more""" in your license text. Doing so
means that you agree that all future version of this license that may be
created can be chosen when using what you distribute. even if the
license became something completly different than what you did agree on at
the license choosing time.</p>(2010,2,23,"blog_comment")blog comments design2010-2-23T00:00:00+00:00Vincent Hanquez<p>Comments on this blog hasn't been implemented yet, however I'm close to finish
the design of how they work. Long story short, the comments and tracebacks will
not change anything on the server side, but will be sent to a queue for
moderation/acceptation that can be processed locally on the machine where I do
the actual posts.</p><p>The benefit for this approch are mainly:</p><ul><li>server-side data stays read-only from the webserver point of view.</li><li>all comments/tracebacks are to be approved before actually posted: net effect on reducing spams.</li><li>duplicated and OT comments can be filtered out.</li><li>comments related to post update (something missing in the post for example) doesn't have to be posted, but just acted on.</li></ul><p>As a simple prototype, my initial queue design is going to be on top of SMTP
and email. Each time a comment or traceback is receive, instead of modifying
a file or a database on the server side, a simple email will be sent to a
mailbox after basic checked where issued (spam checking mostly).</p><p>As this point i can just process the queue of comments with favorite $MUA.
The acceptation process is simply piping the email through a special adding
comment binary, that will add the comment to my local data. All this just a
simple $MUA macro that does the piping with a simple and swift keybinding. At
this point the comment is not visible but will be as soon as I push the data
back to the server.</p><p>Eventually, new queueing system can be developped on top of XMPP, AMQP.
Also the queue processing could be done by a simple bot when the data
doesn't have to be moderated.</p>(2010,1,28,"apple_tablet")why i don't buy apple hardware anymore2010-1-28T00:00:00+00:00Vincent Hanquez<p>I used to buy apple hardware (1), because the hardware usually ended up to be the
most slick. Not only that, but apple is not scared to imposes changes when
there's a clear to the user benefit instead of keeping legacy interface/ports
for ages.</p><p>I never ended up using any Apple media, software, or even operating system. As
an firm believer of opensource, and open specifications. However this is coming
to an end.</p><p>As Apple move more and more in designing hardware that suit their needs, the
ability to run opensource software on the hardware is proving more difficult.
My own problem was with my last apple laptop, the macbook air version 2, that
happens to fail in lots of various annoying way.</p><p>Most of the time, booting with the apple cdrom (which happens not to follow
the USB spec regarding power usage), would lead to bricking the only usb port.
The first time i sent the hardware back to warranty imagining a hw defect,
however as the second model did exactly the same, the probability that it was
hw defect as well, was quite low.</p><p>The only solution to unbrick the USB port, was to disconnect the internal
battery. So fortunately its seems the brickage is software based, and can be
reversed at the cost of loosing some stored values (the usb state seems to be
stored close to lot of other states, like wifi passwords) and at the greater
cost of the warranty going away as soon as you open the machine.</p><p>The iphone got the same characterics of closeness. Without itunes, there's no
way to synchronize the device or even update it. itunes is closed software, but
it doesn't even on linux, rendering the operation a macOsX or windows only
operation.</p><p>And now, the iPad definitely got the same characteric that would make me stays
away from it. First the Apple A4 chip, I can only imagine that the cpu will be
fairly open beeing an ARM cpu, but the gpu part will probably stay so close,
that even if someone get to boot the hardware, there's a good chance that
having the gpu actually displaying anything would be a new challenge. and the
synchronisation has probably the same story as the iphone.</p><p>While lots of people doesn't see it yet, but Apple is the wolf in the
sheepfold. This is a threat to the opensource world.</p><p>So no, i won't be buying any apple hardware anymore; However I'll be on the
market for new slick hardware: a new phone and a new laptop for a start.</p><p>(1): 1 ipod, 1 iphone, 1 ibook, 2 macbooks, 1 macbook air</p>(2010,1,20,"haskell_performance")some haskell performance enhancement2010-1-20T00:00:00+00:00Vincent Hanquez<p>I've been looking at increasing performance of haskell based software lately, more precisely
of the blog. Here's a small benchmark done related to different changes i introduced to the
codebase on the main page.</p><p>no cache, no bytestring: 94mb memory, 110ms rendering time</p><p>no cache, bytestring: 35mb memory, 50ms rendering time</p><p>cached, no bytestring: 30mb memory, 35ms rendering time</p><p>cached, bytestring: 450kb memory, 10ms rendering time</p><p>As you can see introducing bytestring, is a major performance enhancer compare to
the unpacked and inefficient haskell String. I'm sure there's more that can be
squeezed, since i have to go back to normal string back and forth at 2
differents place; parsec 2 doesn't seems to support bytestring, hstringtemplate
cannot instanciate a stringtemplate from a bytestring) next step is to get rid
of the unpacking/packing and see which optimisation can be done: need to find howto to
do haskell profiling.</p>(2010,1,10,"atomizing_parsec")code tokenization2010-1-10T00:00:00+00:00Vincent Hanquez<p>I wanted to develop a very small tokenizer, for the purpose of coloring code. I
started with a very naive approch or cuttings lines and words, the way i
probably would do it poorly in other language. As expected the approch is so
poor, that more than half of what i wanted to highlight were not properly
atomized, since <b>words</b> only works on whitespaces boundaries, which means
that keywords or digits with a punctuation mark just afterwards were wrongly
"worded".</p><p>I dreadfully though about a simple solution for this problem. 2 options were
there: regexes, or a parser. In a past, i would probably choose the regexes
option, even though regexes are only good up to a certain point, and are very
hard to read after a while. It would certainly still holds in previous language
where parser means to use a different tools (bison for C, ..), with a different
syntax (BNF), with lots of painful mind trick to the tool to make the grammar
sane to the underlying algorithm.</p><p>This time is now gone, parsec is so easy to use, integrate and develop that
even regexes are painful.</p><p>First i want to have something really simple. something that takes number and symbol
and have them properly labeled. the data class would be something like:</p><pre>
<span style="color: blue;">data</span> Atom =
Symbol String
<span style="color: blue;">|</span> Number String
<span style="color: blue;">|</span> Other Char</pre><p>the Other class is there to catch all characters that were not identified either as a symbol
or as a number. The first thing is defining the highlevel things such as:</p><pre>
atomize = manyTill (choice [ symbol, number, other ]) eof</pre><p>it means atomize is a parser that do: until (manyTill) we reach end of file
(eof), choose (choice) between the symbol parser (symbol), or the number parser
(number) or the other parser (other). choice will iterate over the parser list
until one succeed.</p><p>let's start by defining the easiest parser: other. this parser is there to
catch any character. it's not suppose to fail, since it accept anything, and return
an other atom. the definition is extremely simple:</p><pre>
other = anyChar >>= return . Char</pre><p>Then it's time to define the symbol and number parser. because both parsers can fail if
there's not parsing a symbol or a number respectively, you need to wrap the parser in a
<b>try</b> parsec keyword. It tells the parser to save the input and if the parser following
try fails, just rewind the input to where it was saved.</p><p>So a symbol is one or more characters that have those specific constraints: the
first character need to be an alpha character a-z and A-Z and we also allow _.
then following characters, if they exists, can be alpha character, underscore
and digits.</p><pre>
symbolFirstChar = [ <span style="color: red;">'a'</span>..<span style="color: red;">'z'</span> ] ++ [ <span style="color: red;">'A'</span>..<span style="color: red;">'Z'</span> ] ++ [ <span style="color: red;">'_'</span> ]
symbolChar = symbolFirstChar ++ [ <span style="color: red;">'0'</span>..<span style="color: red;">'9'</span> ]
symbol = try $ <span style="color: blue;">do</span>
f <span style="color: blue;"><-</span> oneOf symbolFirstChar
ending <span style="color: blue;"><-</span> many (oneOf symbolChar)
return $ Symbol (f : ending)</pre><p>and a number is even simpler, it just one or more digits characters.</p><pre>
number = try $ <span style="color: blue;">do</span> many1 (oneOf [ <span style="color: red;">'0'</span>..<span style="color: red;">'9'</span> ]) >>= return . Number</pre><p>that's it, you have a very simple lexer in 12 lines.</p><pre>
<span style="color: blue;">data</span> Atom =
Symbol String
<span style="color: blue;">|</span> Number String
<span style="color: blue;">|</span> Other Char
atomize = manyTill (choice [ symbol, number, other ]) eof
other = anyChar >>= return . Char
number = try $ <span style="color: blue;">do</span> many1 (oneOf [ <span style="color: red;">'0'</span>..<span style="color: red;">'9'</span> ]) >>= return . Number
symbol = try $ <span style="color: blue;">do</span>
f <span style="color: blue;"><-</span> oneOf symbolFirstChar
ending <span style="color: blue;"><-</span> many (oneOf symbolChar)
return $ Symbol (f : ending)
symbolFirstChar = [ <span style="color: red;">'a'</span>..<span style="color: red;">'z'</span> ] ++ [ <span style="color: red;">'A'</span>..<span style="color: red;">'Z'</span> ] ++ [ <span style="color: red;">'_'</span> ]
symbolChar = symbolFirstChar ++ [ <span style="color: red;">'0'</span>..<span style="color: red;">'9'</span> ]</pre><p>At this point you might say, alright but with regex it would probably fit in this amount
of line too, which is true, but I think it misses the point that not only it's as simple
to parse simple tokens like those, it also possible to extends it easily and naturally,
to extremelly complex cases, because the regex state machine is not able to
really understand things that requires a proper parser. for example the following example
mix comment and string. the comment beggining tag appears in the string, so it should not
be taken as a beggining of a comment:</p><pre>
printf(<span style="color: red;">"C comment looks like: /*\n"</span>);
printf(<span style="color: red;">"followed by the closing tag: */\n"</span>);</pre><p>it's hard for a regex state machine to do the right thing (unless you want to
painfully craft a regex that takes all those embedded cases). the truth is most
of the time, people that develop regex parser, always forgot the hard cases.</p>(2010,1,8,"parser")superior parser API2010-1-8T00:00:00+00:00Vincent Hanquez<p>most of the articles found related to parser never talk about the feeding data
API. it's only about what's the parser is doing inside (the type of gramar LR,
the algorithm, ..)</p><p>Unfortunately lots of things that do parsing, offers to the user of the facility
the choice of parsing a file as in:</p><pre>
object *parse_file(<span style="color: yellow;">const</span> <span style="color: yellow;">char</span> *filename);
object *parse_file(FILE *file);</pre><p>What if you source is available in memory only; there's no easy way to feed the
data to this API short of dumping everything to disk and reading it back, or creating a pipe
between self data. pretty convulated.</p><p>Sometimes you have the choice to parse a in memory string that is slighlty more
useful than beeing limited to parsing a file. whilst it's a nice improvement over file or
file descriptor only, it's not as nice as it could be.</p><p>imagine your data is several gigabytes or that you have major memory constraint that just prevent you from loading the whole data in memory, how can you use this string api. that's correct
you just can't, and most of the time people fallback to the file feeding api.</p><p>what happens if your data is neither available as a file descriptor, there's no way to store
the data, and having the data in memory is just not going to work. At this stage you really
want to be able to feed data incrementally so that you just need to have in memory only
character at a time (or a small string).</p><pre>
<span style="color: yellow;">char</span> c;
<span style="color: yellow;">while</span> (c = read_next_char()) {
parse_data(c);
}</pre><p>incremental parser are very simple to have when the underlying technology is a state machine.
each time the parsing function returns to the caller, the state is kept (either
by the caller to have something reentrant or inside the parser itself which is
less recommended), and when recalling the parser the state is passed back to it.</p><p>this is the holy grail, since every other API can be develop on top of this very simple API,
and even better, the wrapper are completly trivial (couple of lines each):</p><pre>
file API:
parse_fd(<span style="color: yellow;">int</span> fd) {
<span style="color: yellow;">while</span> ((c = read_fd_next_char(fd)) != EOF) {
parse_data(c);
}
}
string API:
parse_string(<span style="color: yellow;">char</span> *s) {
<span style="color: yellow;">int</span> i;
<span style="color: yellow;">for</span> (i = <span style="color: red;">0</span>; s[i]; i++)
parse_data(s[i]);
}</pre><p>In more abstracted language, parser usually take stream which are abstracted API to hide
where's data is coming from, and when data is available. resulting in more or less
the same behavior as the incremental api. however you lose control of the parser execution,
meaning this is hard to stop the parser short of injecting unexpected data through the input
stream.</p><p>A simple example, would be cancelling the parsing of some arbitrary data
because the user requested it. just by the fact that you're processing the data
by small chunk, and that's the caller is in charge of the scheduling of the
parsing function.</p><pre>
parse(<span style="color: yellow;">char</span> *s) {
<span style="color: yellow;">int</span> i;
<span style="color: yellow;">for</span> (i = <span style="color: red;">0</span>; s[i]; i++) {
<span style="color: yellow;">if</span> (user_cancel)
<span style="color: yellow;">break</span>;
parse_data(s[i]);
}
}</pre><p>As a user of the parsing functions, you should ask for nothing less than incremental
parser. as a developper of the parsing functions, should offer nothing less.</p>