precedence is the ability to give more or less priority to symbols. for example the expression 5 + 4 * 10 can be understood in two different way. the first one, (5 + 4) * 10 which give you the value 90, and the second one 5 + (4 * 10) which give you the value 45. basic mathematics taught us that the second one is the right one, and that * has priority over +. in this case * has higher precedence than +.

associativity is the way to associate same thing together. for example the expression 8 / 4 * 4 can again be understood in two different way. the first one (8 / 4) * 4 which give you the value 8, and the second one 8 / (4 * 4) which gives you the value 0.5. * and / got the same precedence so you can't determine only with this criteria. on a left-to-right association, expression associate from left to right. so 1 * 2 * 3 * 4 is actually understood as ( (1 * 2) * 3 ) * 4. on a right-to-left association with associate term on the opposite way. 1 * 2 * 3 * 4 is understood as 1 * (2 * ( 3 * 4 ) ).

common parser generator, using LALR(1) type of parsing, work with a stack of symbol and only do two operations repetitively: reduce or shift. shift means that we take the symbol we looking at and we put it on the stack. reducing means that part of (or all) the stack can be replace by some rule.

the following grammar contains conflict with the + and * terms. conflicts means that the parser has multiple options as to how things can be parsed.

E -> E * E (rule 1)
  -> E + E (rule 2)

when parsing the following expression E + E * E, the following actions are performed:

Stack           Input           Action
                E + E * E       shift
E               + E * E         shift
E +             E * E           shift
E + E           * E             shift/reduce conflict

At this point the parser can choose either to replace the stack elements E + E by E through rule 2, or it can choose to shift the following elements through rule 1. by default usually, they choose to shift which lead into the following

Stack           Input           Action
E + E           * E             shift
E + E *         E               shift
E + E * E                       reduce (rule 1)
E + E                           reduce (rule 2)
E                               accept

that means the expression has been parsed E + (E * E).

going back to the shift choosing, and replacing it by a reduce will lead into the following:

Stack           Input           Action
E + E           * E             reduce (rule 2)
E               * E             shift
E *             E               shift
E * E                           reduce (rule 1)
E                               accept

which is (E + E) * E

in this specific case, the grammar is very easily written to get rid of the conflicts between the 2 operators. however when the number of symbols grows this is getting more and more tedious to do.

Fortunately, this is possible to actually influence the parser to do an educated guess at what to do when there's a shift/reduce conflict instead of defaulting to some action. this is done with precedence and associativity of terminal symbols.

When the precedence of the next symbol is higher than your current stack precedence, you want to shift, and when it's lower, reduce. when precedence is the same, for example * and /, associativity come into action. left-to-right associativity will means reduce, and right-to-left shift.

practically with bison or ocamlyacc, precedence and associativity are defined together at the beginning of the grammar file. using the %left %right %nonassoc directives.

%left directive means that the following symbols will associate from left-to-right. %right directive means right-to-left. and the %nonassoc directive means that instead of associating symbols either from left-to-right or right-to-left, we want the parser to error-out on those case, which means that the user has to write the way it's associated explicitly (using parenthesis for examples).

The precedence is implicitely defined with the same directives using the order in which they appears in the file. the first of those directives will be the lowest precedence, and the last of those directives will be the highest precedence.

for example a subset of the C operators precedence will be written the following way:

%left ,
%right = += -= *= ...
%left == !=
%left <= < > >=
%left << >>
%left + -
%left * / %

note that sometimes some symbol have different precedence in different context. -4 * 5 need to be understood has (- 4) * 5. using the previous precedence this is not possible since the precedence of the stack after shifting "- 4" is the precedence of -, which is lower than the precedence of *. hence the parser will shift to give priority to *.

the %prec directive is useful to artificially change the precedence of the stack. in our case we want to define a dummy symbol that going to have higher precedence than *. First you need to tweak the precedence directives adding

%left + -
%left * / %
%left DUMMY_MINUS

and in your grammar the "E -> - E" rules will be rewritten "E -> - E %prec DUMMY_MINUS".

most of the articles found related to parser never talk about the feeding data API. it's only about what's the parser is doing inside (the type of gramar LR, the algorithm, ..)

Unfortunately lots of things that do parsing, offers to the user of the facility the choice of parsing a file as in:

	object *parse_file(const char *filename);
	object *parse_file(FILE *file);

What if you source is available in memory only; there's no easy way to feed the data to this API short of dumping everything to disk and reading it back, or creating a pipe between self data. pretty convulated.

Sometimes you have the choice to parse a in memory string that is slighlty more useful than beeing limited to parsing a file. whilst it's a nice improvement over file or file descriptor only, it's not as nice as it could be.

imagine your data is several gigabytes or that you have major memory constraint that just prevent you from loading the whole data in memory, how can you use this string api. that's correct you just can't, and most of the time people fallback to the file feeding api.

what happens if your data is neither available as a file descriptor, there's no way to store the data, and having the data in memory is just not going to work. At this stage you really want to be able to feed data incrementally so that you just need to have in memory only character at a time (or a small string).

	char c;
	while (c = read_next_char()) {
		parse_data(c);
	}

incremental parser are very simple to have when the underlying technology is a state machine. each time the parsing function returns to the caller, the state is kept (either by the caller to have something reentrant or inside the parser itself which is less recommended), and when recalling the parser the state is passed back to it.

this is the holy grail, since every other API can be develop on top of this very simple API, and even better, the wrapper are completly trivial (couple of lines each):

	file API:

	parse_fd(int fd) {
		while ((c = read_fd_next_char(fd)) != EOF) {
			parse_data(c);
		}
	}
	
	string API:

	parse_string(char *s) {
		int i;
		for (i = 0; s[i]; i++)
			parse_data(s[i]);
	}

In more abstracted language, parser usually take stream which are abstracted API to hide where's data is coming from, and when data is available. resulting in more or less the same behavior as the incremental api. however you lose control of the parser execution, meaning this is hard to stop the parser short of injecting unexpected data through the input stream.

A simple example, would be cancelling the parsing of some arbitrary data because the user requested it. just by the fact that you're processing the data by small chunk, and that's the caller is in charge of the scheduling of the parsing function.

	parse(char *s) {
		int i;
		for (i = 0; s[i]; i++) {
			if (user_cancel)
				break;
			parse_data(s[i]);
		}
	}

As a user of the parsing functions, you should ask for nothing less than incremental parser. as a developper of the parsing functions, should offer nothing less.

I wanted to develop a very small tokenizer, for the purpose of coloring code. I started with a very naive approch or cuttings lines and words, the way i probably would do it poorly in other language. As expected the approch is so poor, that more than half of what i wanted to highlight were not properly atomized, since words only works on whitespaces boundaries, which means that keywords or digits with a punctuation mark just afterwards were wrongly "worded".

I dreadfully though about a simple solution for this problem. 2 options were there: regexes, or a parser. In a past, i would probably choose the regexes option, even though regexes are only good up to a certain point, and are very hard to read after a while. It would certainly still holds in previous language where parser means to use a different tools (bison for C, ..), with a different syntax (BNF), with lots of painful mind trick to the tool to make the grammar sane to the underlying algorithm.

This time is now gone, parsec is so easy to use, integrate and develop that even regexes are painful.

First i want to have something really simple. something that takes number and symbol and have them properly labeled. the data class would be something like:

data Atom =
          Symbol String
        | Number String
        | Other Char

the Other class is there to catch all characters that were not identified either as a symbol or as a number. The first thing is defining the highlevel things such as:

atomize = manyTill (choice [ symbol, number, other ]) eof

it means atomize is a parser that do: until (manyTill) we reach end of file (eof), choose (choice) between the symbol parser (symbol), or the number parser (number) or the other parser (other). choice will iterate over the parser list until one succeed.

let's start by defining the easiest parser: other. this parser is there to catch any character. it's not suppose to fail, since it accept anything, and return an other atom. the definition is extremely simple:

other = anyChar >>= return . Char

Then it's time to define the symbol and number parser. because both parsers can fail if there's not parsing a symbol or a number respectively, you need to wrap the parser in a try parsec keyword. It tells the parser to save the input and if the parser following try fails, just rewind the input to where it was saved.

So a symbol is one or more characters that have those specific constraints: the first character need to be an alpha character a-z and A-Z and we also allow _. then following characters, if they exists, can be alpha character, underscore and digits.

symbolFirstChar = [ 'a'..'z' ] ++ [ 'A'..'Z' ] ++ [ '_' ]
symbolChar = symbolFirstChar ++ [ '0'..'9' ]

symbol = try $ do
	f <- oneOf symbolFirstChar
	ending <- many (oneOf symbolChar)
	return $ Symbol (f : ending)

and a number is even simpler, it just one or more digits characters.

number = try $ do many1 (oneOf [ '0'..'9' ]) >>= return . Number

that's it, you have a very simple lexer in 12 lines.

data Atom =
          Symbol String
        | Number String
        | Other Char
atomize = manyTill (choice [ symbol, number, other ]) eof

other = anyChar >>= return . Char
number = try $ do many1 (oneOf [ '0'..'9' ]) >>= return . Number
symbol = try $ do
	f <- oneOf symbolFirstChar
	ending <- many (oneOf symbolChar)
	return $ Symbol (f : ending)

symbolFirstChar = [ 'a'..'z' ] ++ [ 'A'..'Z' ] ++ [ '_' ]
symbolChar = symbolFirstChar ++ [ '0'..'9' ]

At this point you might say, alright but with regex it would probably fit in this amount of line too, which is true, but I think it misses the point that not only it's as simple to parse simple tokens like those, it also possible to extends it easily and naturally, to extremelly complex cases, because the regex state machine is not able to really understand things that requires a proper parser. for example the following example mix comment and string. the comment beggining tag appears in the string, so it should not be taken as a beggining of a comment:

	printf("C comment looks like: /*\n");
	printf("followed by the closing tag: */\n");

it's hard for a regex state machine to do the right thing (unless you want to painfully craft a regex that takes all those embedded cases). the truth is most of the time, people that develop regex parser, always forgot the hard cases.