Maximum Munch Principle

Environment: Compilers

Most STL users are already aware that they need to put in a extra space when taking a template as a template argument to any class. For example, when you want to make a vector from a complex number, which is also a template class, you have to put an extra space between the two > operators.

std::vector<std::complex<int> > vecComp;

If you forgot to put in the extra space, the compiler will treat it as a logical shift operator, >>. It is recommended that you extend the language to understand the meaning of >>, depending on its context [VAN03].

To better understand what is going behind the scenes and why we need it, we have to look a little bit into the workings of a compiler and its construction. The simplest definition of a compiler is that it is a program that translates one language into another language. The C++ compiler translates the C++ source program into the required platform-specific machine language.

The first phase of any compiler is to read the source program, the input stream, and convert it into tokens after removing the white spaces and comments, which are used by parser. This phase is called Lexical Analysis or scanner [AHO86]. It is possible to make a language that doesn’t have any comments and the Lexical Analyzer doesn’t remove white space, but in this case it is the responsibility of the parser and the other compiler phase to handle white spaces. Also, you have to rewrite its grammar to handle white spaces; this is not a reasonable and easy task.

The Lexical Analyzer usually stores line numbers for better error information, and may also store a copy of the source program with error information. But, in the case of multi-line comments or complete comments lines in the source program, storing of line numbers will not be same after removing the comments lines.

Tokens are a logically cohesive sequence of characters of collective meaning. In more technical terms, tokens are usually terminal symbols in grammar [AHO86]. Usually, token types are keywords, operators, identifiers, constants, strings, digits, punctuation symbols, and so forth. The character sequence, derived from the input stream, creates a token called lexeme.

One typical C++ scanner may generate the following token of a given C++ statement:

int a = b + 10;

This statement generates seven tokens. First, it generates the token keyword for int. Then there is an identifier, a tokenized first search into the symbol table’s data structure to store information about identifiers. If it is not found in the symbol table, enter its name “a” into the symbol table and generate a token type identifier. Then, the punctuation token is generated, followed by identifier b, with the same rule of symbol tables, which was used with identifier a, then the plus token, a digit token of value 10, and finally the punctuation token for a semicolon. Tokens are usually generated with their attributes. In the case of an identifier, the pointer to its symbol table entry is stored as an attribute of that token. If we write the pairs of tokens and its attribute of the above statement, it would be something like this:


{Keyword integer, NULL}
{Identifier, Pointer to symbol table entry of “a”}
{Assignment Operator, NULL }
{Identifier, Pointer to symbol table entry of “b”}
{Plus Operator, NULL}
{Digit, 10}
{Terminator symbol, NULL}

Compilers usually create tokens from the longest lexeme; in the case of >> in C++, the compiler creates a token of the shift right operator rather than two greater than operators [HOL90]. This is also called Maximum Munch Tokenization Principle, or simply Maximum Munch Principle, that the C++ implementation has to consider as many characters as possible to create a token.

Other examples of Maximum Munch Principles can be seen with a digraph. Take a look at the following code.

std::list<::x> lst;

Here <: is a digraph symbol and treated as [. So, the above code is compiled as

std::list[:x> lst;

You have to put extra space between the < and colon. The correct code is written as:

std::list< ::x> lst;

And the same is true in the following example.

int x = y%::z;

Here %: is again a digraph. Consider # and this statement becomes

int x = y#:z;

You have to put extra space here too to compile it correctly.

int x = y % ::z;

It is interesting to note the exact reverse of this when the Maximum Munch Principle is not applied, but readers of the code think it is. For example, someone once asked Bjarne Stroustrup to overload or create the ** operator, which exists in COBOL and some other languages. There is no ** operator in C++ and there isn’t any way to create any new operator in C++. But in C++, the * operator has two meanings, multiplication and dereference operator. You can overload both operators; the only difference is the number of parameters. The multiplication operator needs two parameters; however, the dereference operator needs only one. Stroustrup used the nice technique to overload both operators to give the illusion of an ** operator.


// overloading ** operator
#include <cmath>
#include <ciostream>

struct Index {
double d;
Index(double dd) :d(dd) { }
};

struct II {
double d;
II(double dd) :d(dd) { }
};

struct III {
double d;
III(double ddd) : d(ddd) { }
};

II operator*(Index i) {
return II(i.d);
}

double operator*(double d , II i) {
return pow(d,i.d);
}

int main () {
Index i = 3;
std::cout << 2**i << “n”;
}

Here, the statement 2**i gives the illusion of an ** operator, but in fact it is two * operators. One is a multiplication operator and the other is the dereference operator.

References


  1. [AHO98] Compilers Principles, Techniques, and Tools. Alfred V. Aho, Ravi Sehi, and Jeffrey D. Ullman 1986

  2. [HOL90] Compiler Design in C. Allen I. Holub 1990

  3. [VAN03] C++ Templates, The Complete Guide. David Vandevoorde, Nicolai M. Josuttis 2003

More by Author

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Must Read